Joseph Voss

Background

About

SRE with seven years operating large-scale bare-metal infrastructure, from first in class supercomputers to Cloudflare's global edge network. Drawn to the internals: how git objects are structured, how containers work, how eBPF programs hook the kernel, and leveraging that to build tools no one has gotten to yet.

Work Experience

Systems Reliability Engineer, Cloudflare
Aug, 2021 - Present
SRE on the Edge Infrastructure team, owning reliability and automation for Cloudflare's global edge network of 310+ colos.
- Rotated through operational oncall covering 310+ edge colos globally; led incident response and authored post-mortems for high-severity customer-impacting events.
- Designed and built an automated colo recovery system using Temporal workflows, orchestrating the full recovery sequence for offline/degraded edge colos, including Redfish and IPMI power cycling from chat ops.
- Built a just-in-time access system for edge jumphosts with temporary, audited access grants and automatic expiration. Extended to FedRAMP High colos via dedicated GCP jumphosts with firewall enforcement.
- Designed and built an AI-powered operational memory system using sentence-transformer embeddings, a local vector database, and LLM-generated summaries to surface relevant alert context during oncall.
- Led end-to-end migration of the Salt CI pipeline from TeamCity to GitLab CI. Reduced CI runtime, standardized dependency management, and shipped complementary tooling.
- Authored RFC and drove implementation of upstream Debian repository snapshotting, hardening the fleet's apt supply chain with gpg validated packages progressively deployed following a staged health-mediated rollout instead of direct upstream pulls.
HPC Systems Engineer, Oak Ridge National Laboratory
Jun, 2018 - Aug, 20213 years 2 months
Systems engineer at the Oak Ridge Leadership Computing Facility (OLCF), provisioning and operating large-scale HPC clusters including Summit and Andes.
- Provisioned Andes, new 700 node HPC cluster - largest commodity cluster procured by ORNL to date.
- Wrote custom tool (Anchor) to boot HPC machines from container images. Transitioned several large-scale systems to use it.
- Developed eBPF wrapper (greggd) to compile and load Linux kernel profiling programs and stream data via Telegraf to Kafka.
- Extended Let's Encrypt Golang projects to create in-house certificate issuer for host authentication bootstrapping.
- Contributed to open source projects to improve system health and monitoring; used to create load-balanced and highly available data transfer cluster.
- Helped lead a team of student interns in the '19 Student Cluster Competition.
- Created Helm charts and automated pipelines to move system services to Kubernetes.
- Developed CI pipeline to stage Puppet changes on bare metal servers.
- Reviewed proposals for new HPC systems; assisted in their provisioning and deployment.
- Used Puppet to configure and manage large-scale systems.
DevOps Engineer, MultiMechanics
Jan, 2018 - May, 20184 months
Short-term contract building automated cross-platform build infrastructure for simulation software.
- Created automated build system using Vagrant. Converted software tools from Windows to Redhat and SUSE environments.
- Configured and installed PBS-Pro job scheduler to better share computing resources; led training on its usage.
Student Intern, High Performance Computing, Texas Advanced Computing Center
Feb, 2016 - Aug, 20171 year 6 months
Student intern developing automated HPC testing and monitoring tooling, and competing in the SC16 Student Cluster Competition (4th place).
- Developed an automated HPC testing harness using Jenkins, PyTest, and CMake that integrates with Slurm.
- Created a heatmap visualization showing historical degradation and improvement in system performance.
- Designed, built and managed a cluster of high performance compute nodes for the Student Cluster Competition.
- Developed remote power monitoring system using SNMP, Graphite, and Grafana.
- Attended Supercomputing Conference 2016 to compete with student teams from around the world; placed 4th overall.
Science and Engineering Apprentice, Applied Research Laboratory, UT Austin
May, 2014 - Aug, 20151 year 3 months
Undergraduate research apprentice developing GPS data collection and processing tools for navigation research.
- Created a suite of cross-compatible unit tests in C++ for open source software.
- Redesigned the method of reading/writing RINEX files to use OOP encapsulation.
- Developed an inexpensive COTS GPS data collection platform using Python; decodes binary streams and writes out to formatted RINEX files.

Projects Experience

Anchor, Oak Ridge National Laboratory
Jan, 2019 - Jan, 20212 years
Extensible initrd module to boot HPC clusters from immutable squashFS container images with a read-write overlay, replacing in-house provisioning scripts with container tooling.
- Designed as a dracut initrd module: fetches a squashfs image at boot, mounts it read-only with a tmpfs overlay, and uses ACME/Step-CA mutual-TLS certificate bootstrapping to authenticate nodes before image download.
- Deployed to production clusters at OLCF; supported both serverless (Kubernetes) and local management-server provisioning modes.
greggd, Oak Ridge National Laboratory
Jan, 2019 - Jan, 20212 years
Golang daemon that compiles and loads eBPF C programs into the Linux kernel at runtime using gobpf/bcc, attaches kprobes and perf output maps, binary-deserializes kernel structs via reflection, and streams structured data in InfluxDB line protocol to a local Unix socket.
- Deployed to large-scale HPC systems at OLCF; provided low-overhead kernel observability (file opens, exec, block I/O latency, TCP session lifetime) without modifying the kernel or rebooting nodes.
git-remote-s3
Jan, 2021 - Present
Rust CLI tool implementing a Git remote helper that allows using an S3-compatible object store as a git remote repository.
- Implements the full Git remote helper protocol in Rust.
- Supports push/pull/clone against S3-compatible backends (Backblaze B2, AWS S3, etc.).
Cloudflare Workers Photo Gallery
Jan, 2025 - Present
Self-hosted image gallery and Google Photos uploader running entirely on Cloudflare Workers, R2, KV, and Images - no compute server required.
- Gallery viewer built with Workers static assets, Fancybox, and Tailwind; thumbnails served via live Cloudflare Images transforms.
- Google Photos picker integration for direct-to-R2 uploads using OAuth, KV-backed session state, and Cloudflare Access for auth.

Skills

Languages
Golang

Python

Rust

Bash

Typescript
Infrastructure & Configuration Management
Saltstack

Puppet

Terraform

Kubernetes

NixOS

cdk8s / Helm

Debian / RHEL Linux

GCP

Cloudflare Workers
Observability, Reliability & Systems
Prometheus

Grafana

Consul

Nomad

Temporal

Redfish / IPMI

eBPF

Kafka
CI/CD & Developer Tooling
GitLab CI

Jenkins

Git / Jujutsu

Education

Mechanical Engineering, Bachelor of Science, University of Texas at Austin
Aug, 2014 - May, 2018

Certificates

Red Hat Certified Engineer (RHCE), Red Hat
Issued on: Aug 30, 2019
Red Hat Certified System Administrator (RHCSA), Red Hat
Issued on: Aug 30, 2019

Publications

Anchor: Diskless Cluster Provisioning Using Container Tools , SC '20: International Conference for High Performance Computing, Networking, Storage and Analysis
Published on: Jan 01, 2020
Presented at SC '20. Atlanta, GA, USA.
Automated System Health and Performance Benchmarking Platform , Supercomputing Conference '17: Proceedings of the 2nd International HPC System Professionals Workshop at SC'17
Published on: Jan 01, 2017
ACM, New York, NY, USA.
Student Cluster Competition 2016 Reproducibility Challenge: Genomic Partitioning with ParConnect , Parallel Computing
Published on: Jan 01, 2017
Parallel Computing journal.