×
Sergio Santiago

Sergio Santiago

Staff Site Reliability Engineer

Triftstraße 38, Berlin, Germany, DE, 13353
+49 151 728 68755
Portuguese, English, French

Background


About

About

Staff Site Reliability Engineer with 20+ years building mission-critical infrastructure at scale. Expert in cloud infrastructure modernization, Kubernetes platform engineering, and SRE best practices. Strategic leader focused on designing resilient systems, reducing operational complexity, and enabling teams through self-service infrastructure. Proven track record architecting systems supporting 100,000+ concurrent users while optimizing costs and driving organizational reliability culture.

Work Experience

Work Experience

  • Staff Site Reliability EngineerDoctolib

    Jan, 2024 - Present

    Leading infrastructure modernization and reliability initiatives for Europe's largest health tech platform serving 100,000+ concurrent users across France, Germany, and Italy.

    • Engineered Argo Workflows ecosystem as complete replacement for legacy kubetoolbox with production-grade automation (Azure AD SSO, GitHub notifications, ECR integration, event sensors, CLI tooling) — enabling teams to automate complex operational tasks at scale

    • Designed Datadog Deployment Gates configurable validation ecosystem with 10+ dedicated monitors replacing static queries, improving canary rollout reliability and deployment confidence

    • Co-authored 'Paradigm Shift' strategy shifting from centralized to decentralized, library-based configuration — enabling team autonomy and reducing SRE review bottlenecks

    • Implemented EKS workload optimization strategy (cost reduction), redesigned monolith pod sizing, and optimized preproduction environments

    • Advanced production-grade canary deployment strategy and ArgoCD integration for strengthened release confidence and rollback capabilities

  • Senior Site Reliability EngineerTemplafy

    Nov, 2021 - Mar, 20242 years 4 months

    Architected and maintained highly available infrastructure for content enablement platform while scaling operations and improving reliability culture.

    • Designed monitoring, alerting, and incident response frameworks for 99.99% uptime targets

    • Led disaster recovery planning and business continuity strategy

    • Mentored engineering teams on SRE principles and operational excellence

    • Owned capacity planning and cost optimization initiatives

  • Cloud Platform EngineerThe Adecco Group

    Jun, 2021 - Nov, 20215 months

    SME for Kubernetes/CNCF in cloud platform team supporting enterprise application deployments.

    • Automated cloud infrastructure deployment using Azure DevOps and Terraform

    • Designed CI/CD pipelines supporting agile development teams

    • Containerized applications using Helm and Kubernetes

  • Azure Rapid Response Senior EngineerMicrosoft

    Jun, 2019 - May, 20211 year 11 months

    Supported critical enterprise customers and top startups on Azure platform as subject matter expert for Azure Core Platform domains.

    • Specialized in Azure Compute, Linux, and Kubernetes (AKS)

    • Provided end-to-end Azure solution support for enterprise customers and startups

    • Member of global Azure Rapid Response team (EMEA, Americas, Asia)

  • Senior Telco NFV EngineerVMware

    Nov, 2016 - Mar, 20181 year 5 months

    Provided carrier-grade support for Telco NFV platforms to communications service providers globally.

    • Supported VMware vCloud NFV stack for CSP customers

    • Expert in vSphere, VSAN, NSX, vCloud Director, and vRealize

    • Delivered mission-critical support for telecommunications infrastructure

Skills

Skills

  • Cloud Infrastructure & Platforms

    AWS (EKS, RDS/Aurora, ElastiCache, MKS)

    Azure (AKS, App Services, Service Bus)

    Kubernetes

    Docker

    Multi-cloud architecture

  • Infrastructure-as-Code

    Terraform

    Helm

    ArgoCD

    Ruby DSL

    GitOps

    Ansible

  • Automation & Reliability Engineering

    Argo Workflows

    CI/CD Pipelines

    SRE Practices

    Incident Management

    Disaster Recovery

    Capacity Planning

  • Observability & Monitoring

    Datadog

    Prometheus

    Grafana

    Application Performance Monitoring

    Metrics & Alerting

  • Leadership & Strategy

    Team Leadership

    SRE Culture

    Infrastructure Modernization

    Cost Optimization

    Architectural Design

  • Virtualization & Networking

    VMware (vSphere, NSX, vCloud)

    Network Security

    Load Balancing

    DNS/DHCP

Education

Education

  • Electronic Engineering and Telecommunications, Bachelor of Science, University of Aveiro

    Jan, 1988 - Jan, 1994

Certificates

Certificates