×
Joseph Voss

Joseph Voss

Systems Reliability Engineer

Austin, TX, US

Background


About

About

SRE with seven years operating large-scale bare-metal infrastructure, from first in class supercomputers to Cloudflare's global edge network. Drawn to the internals: how git objects are structured, how containers work, how eBPF programs hook the kernel, and leveraging that to build tools no one has gotten to yet.

Work Experience

Work Experience

  • Systems Reliability EngineerCloudflare

    Aug, 2021 - Present

    SRE on the Edge Infrastructure team, owning reliability and automation for Cloudflare's global edge network of 310+ colos.

    • Rotated through operational oncall covering 310+ edge colos globally; led incident response and authored post-mortems for high-severity customer-impacting events.

    • Designed and built an automated colo recovery system using Temporal workflows, orchestrating the full recovery sequence for offline/degraded edge colos, including Redfish and IPMI power cycling from chat ops.

    • Built a just-in-time access system for edge jumphosts with temporary, audited access grants and automatic expiration. Extended to FedRAMP High colos via dedicated GCP jumphosts with firewall enforcement.

    • Designed and built an AI-powered operational memory system using sentence-transformer embeddings, a local vector database, and LLM-generated summaries to surface relevant alert context during oncall.

    • Led end-to-end migration of the Salt CI pipeline from TeamCity to GitLab CI. Reduced CI runtime, standardized dependency management, and shipped complementary tooling.

    • Authored RFC and drove implementation of upstream Debian repository snapshotting, hardening the fleet's apt supply chain with gpg validated packages progressively deployed following a staged health-mediated rollout instead of direct upstream pulls.

  • HPC Systems EngineerOak Ridge National Laboratory

    Jun, 2018 - Aug, 20213 years 2 months

    Systems engineer at the Oak Ridge Leadership Computing Facility (OLCF), provisioning and operating large-scale HPC clusters including Summit and Andes.

    • Provisioned Andes, new 700 node HPC cluster - largest commodity cluster procured by ORNL to date.

    • Wrote custom tool (Anchor) to boot HPC machines from container images. Transitioned several large-scale systems to use it.

    • Developed eBPF wrapper (greggd) to compile and load Linux kernel profiling programs and stream data via Telegraf to Kafka.

    • Extended Let's Encrypt Golang projects to create in-house certificate issuer for host authentication bootstrapping.

    • Contributed to open source projects to improve system health and monitoring; used to create load-balanced and highly available data transfer cluster.

    • Helped lead a team of student interns in the '19 Student Cluster Competition.

    • Created Helm charts and automated pipelines to move system services to Kubernetes.

    • Developed CI pipeline to stage Puppet changes on bare metal servers.

    • Reviewed proposals for new HPC systems; assisted in their provisioning and deployment.

    • Used Puppet to configure and manage large-scale systems.

  • DevOps EngineerMultiMechanics

    Jan, 2018 - May, 20184 months

    Short-term contract building automated cross-platform build infrastructure for simulation software.

    • Created automated build system using Vagrant. Converted software tools from Windows to Redhat and SUSE environments.

    • Configured and installed PBS-Pro job scheduler to better share computing resources; led training on its usage.

  • Student Intern, High Performance ComputingTexas Advanced Computing Center

    Feb, 2016 - Aug, 20171 year 6 months

    Student intern developing automated HPC testing and monitoring tooling, and competing in the SC16 Student Cluster Competition (4th place).

    • Developed an automated HPC testing harness using Jenkins, PyTest, and CMake that integrates with Slurm.

    • Created a heatmap visualization showing historical degradation and improvement in system performance.

    • Designed, built and managed a cluster of high performance compute nodes for the Student Cluster Competition.

    • Developed remote power monitoring system using SNMP, Graphite, and Grafana.

    • Attended Supercomputing Conference 2016 to compete with student teams from around the world; placed 4th overall.

  • Science and Engineering ApprenticeApplied Research Laboratory, UT Austin

    May, 2014 - Aug, 20151 year 3 months

    Undergraduate research apprentice developing GPS data collection and processing tools for navigation research.

    • Created a suite of cross-compatible unit tests in C++ for open source software.

    • Redesigned the method of reading/writing RINEX files to use OOP encapsulation.

    • Developed an inexpensive COTS GPS data collection platform using Python; decodes binary streams and writes out to formatted RINEX files.

Projects Experience

Projects Experience

  • Anchor, Oak Ridge National Laboratory

    Jan, 2019 - Jan, 20212 years

    Extensible initrd module to boot HPC clusters from immutable squashFS container images with a read-write overlay, replacing in-house provisioning scripts with container tooling.

    • Designed as a dracut initrd module: fetches a squashfs image at boot, mounts it read-only with a tmpfs overlay, and uses ACME/Step-CA mutual-TLS certificate bootstrapping to authenticate nodes before image download.

    • Deployed to production clusters at OLCF; supported both serverless (Kubernetes) and local management-server provisioning modes.

  • greggd, Oak Ridge National Laboratory

    Jan, 2019 - Jan, 20212 years

    Golang daemon that compiles and loads eBPF C programs into the Linux kernel at runtime using gobpf/bcc, attaches kprobes and perf output maps, binary-deserializes kernel structs via reflection, and streams structured data in InfluxDB line protocol to a local Unix socket.

    • Deployed to large-scale HPC systems at OLCF; provided low-overhead kernel observability (file opens, exec, block I/O latency, TCP session lifetime) without modifying the kernel or rebooting nodes.

  • git-remote-s3

    Jan, 2021 - Present

    Rust CLI tool implementing a Git remote helper that allows using an S3-compatible object store as a git remote repository.

    • Implements the full Git remote helper protocol in Rust.

    • Supports push/pull/clone against S3-compatible backends (Backblaze B2, AWS S3, etc.).

  • Cloudflare Workers Photo Gallery

    Jan, 2025 - Present

    Self-hosted image gallery and Google Photos uploader running entirely on Cloudflare Workers, R2, KV, and Images - no compute server required.

    • Gallery viewer built with Workers static assets, Fancybox, and Tailwind; thumbnails served via live Cloudflare Images transforms.

    • Google Photos picker integration for direct-to-R2 uploads using OAuth, KV-backed session state, and Cloudflare Access for auth.

Skills

Skills

  • Languages

    Golang

    Python

    Rust

    Bash

    Typescript

  • Infrastructure & Configuration Management

    Saltstack

    Puppet

    Terraform

    Kubernetes

    NixOS

    cdk8s / Helm

    Debian / RHEL Linux

    GCP

    Cloudflare Workers

  • Observability, Reliability & Systems

    Prometheus

    Grafana

    Consul

    Nomad

    Temporal

    Redfish / IPMI

    eBPF

    Kafka

  • CI/CD & Developer Tooling

    GitLab CI

    Jenkins

    Git / Jujutsu

Education

Education

  • Mechanical Engineering, Bachelor of Science, University of Texas at Austin

    Aug, 2014 - May, 2018

Certificates

Certificates

Publications

Publications