Systems Reliability Engineer, Cloudflare
Aug, 2021 - Present
SRE on the Edge Infrastructure team, owning reliability and automation for Cloudflare's global edge network of 310+ colos.
Rotated through operational oncall covering 310+ edge colos globally; led incident response and authored post-mortems for high-severity customer-impacting events.
Designed and built an automated colo recovery system using Temporal workflows, orchestrating the full recovery sequence for offline/degraded edge colos, including Redfish and IPMI power cycling from chat ops.
Built a just-in-time access system for edge jumphosts with temporary, audited access grants and automatic expiration. Extended to FedRAMP High colos via dedicated GCP jumphosts with firewall enforcement.
Designed and built an AI-powered operational memory system using sentence-transformer embeddings, a local vector database, and LLM-generated summaries to surface relevant alert context during oncall.
Led end-to-end migration of the Salt CI pipeline from TeamCity to GitLab CI. Reduced CI runtime, standardized dependency management, and shipped complementary tooling.
Authored RFC and drove implementation of upstream Debian repository snapshotting, hardening the fleet's apt supply chain with gpg validated packages progressively deployed following a staged health-mediated rollout instead of direct upstream pulls.