Work across multiple business units to implement SLO tracking tool (Dynatrace Site Reliability Guardian). This will be deployed across 200+ critical applications in the coming months.
Partner with teams in Platform, Security, Engineering to implement Synthetic Monitoring best practices, including alerting, maintenance, and cost management.
Implement solution in Dynatrace to track MFA sign-in using Business Events to capture user login experience.
Implement license management reporting and monitoring for Dynatrace and Appdynamics.
Designed automation to integrate, implement, and support Dynatrace in a large enterprise environment from the ground up. Utilizing internal and vendor APIs, automation was built that integrated with our Change Management Database (ServiceNow) for application metadata, and Slack/PagerDuty for alerting. This has enabled development teams to get observability 'out of the box' for their applications with minimal effort.
Present topics at all hands meetings to promote observability best practices, introduce new features, and conduct training seminars.
Maintain internal observability documentation and training materials for new hires and existing team members.
Worked with teams across the organization to reduce synthetic monitoring use costs. Built reporting tools using Python to manage this task.
Write and maintain various ad-hoc scripts to interact with monitoring tool APIs including Dynatrace, PagerDuty, Grafana, InfluxDB, Github, AWS, Thycrotic SecretServer, and various internal APIs.
Updated standard Terraform to include Dynatrace, which is now deployed on over 8,000 on-prem servers, thousands of cloud instances, ECS containers, lambda functions, and 160+ Kubernetes clusters.
Conducted Enterprise Observability Tool Proof of Concepts to evaluate the existing open-source monitoring solution against 2 commercial vendors (Dynatrace and Datadog).
Designed and developed serverless automation for PagerDuty, including user integration with Active Directory, Microsoft Teams Meetings for every incident created, post incident review processes, and internal Application Catalog.
Re-architected metrics ingestion platform, backend, and metrics routing to increase reliability significantly. Deployed a solution on Kubernetes that batched, aggregated and routed metrics to different storage backends. Previously there were weekly incidents for the on-call engineer, there has been one production issue since deployment over 4 years ago.
Built and operated an open-source monitoring stack (Grafana, InfluxDB, Telegraf) in AWS to provide a self-service developer experience, which eliminated my team as a bottleneck for monitoring changes.
Mentor junior team members, conduct code reviews, and lead agile ceremonies.