Korwin Stevens

Observability Engineering

I work with technology teams to implement and support observability solutions. I have experience with open-source and commercial monitoring tools, including Dynatrace, Splunk, OpenTelemetry, Prometheus, and Grafana. I have a strong background in Python, Terraform, and Kubernetes. I aim to make systems reliable and stable by providing the right data to the right folks at the right time.

Work

Senior Systems Engineer

  • Work across multiple business units to implement SLO tracking tool (Dynatrace Site Reliability Guardian). This will be deployed across 200+ critical applications in the coming months.

  • Partner with teams in Platform, Security, Engineering to implement Synthetic Monitoring best practices, including alerting, maintenance, and cost management.

  • Implement solution in Dynatrace to track MFA sign-in using Business Events to capture user login experience.

  • Implement license management reporting and monitoring for Dynatrace and Appdynamics.

  • Designed automation to integrate, implement, and support Dynatrace in a large enterprise environment from the ground up. Utilizing internal and vendor APIs, automation was built that integrated with our Change Management Database (ServiceNow) for application metadata, and Slack/PagerDuty for alerting. This has enabled development teams to get observability 'out of the box' for their applications with minimal effort.

  • Present topics at all hands meetings to promote observability best practices, introduce new features, and conduct training seminars.

  • Maintain internal observability documentation and training materials for new hires and existing team members.

  • Worked with teams across the organization to reduce synthetic monitoring use costs. Built reporting tools using Python to manage this task.

  • Write and maintain various ad-hoc scripts to interact with monitoring tool APIs including Dynatrace, PagerDuty, Grafana, InfluxDB, Github, AWS, Thycrotic SecretServer, and various internal APIs.

  • Updated standard Terraform to include Dynatrace, which is now deployed on over 8,000 on-prem servers, thousands of cloud instances, ECS containers, lambda functions, and 160+ Kubernetes clusters.

  • Conducted Enterprise Observability Tool Proof of Concepts to evaluate the existing open-source monitoring solution against 2 commercial vendors (Dynatrace and Datadog).

  • Designed and developed serverless automation for PagerDuty, including user integration with Active Directory, Microsoft Teams Meetings for every incident created, post incident review processes, and internal Application Catalog.

  • Re-architected metrics ingestion platform, backend, and metrics routing to increase reliability significantly. Deployed a solution on Kubernetes that batched, aggregated and routed metrics to different storage backends. Previously there were weekly incidents for the on-call engineer, there has been one production issue since deployment over 4 years ago.

  • Built and operated an open-source monitoring stack (Grafana, InfluxDB, Telegraf) in AWS to provide a self-service developer experience, which eliminated my team as a bottleneck for monitoring changes.

  • Mentor junior team members, conduct code reviews, and lead agile ceremonies.

Web Operations Engineer

  • Maintain Akamai CDN and WAF configurations for public facing websites - rocketmortgage.com, quickenloans.com, myql.com, others.

  • Work across teams to prepare for large marketing campaigns around Cyber Monday, College Bowl games, and Super Bowls.

  • Implement AppDynamics for select web properties. Build dashboards and reports using the data.

  • Provide incident management for incidents impacting our public websites, including security events.

Client Service Tech / Web Developer

  • Provide service and support for over 70 websites, providing hosting on Windows and Linux environments.

  • Lead team of developers on various projects, including ERP implementation and custom web applications.

  • Desktop and server support on location for small and medium size businesses

Skills

Observability

  • Monitoring
  • Alerting
  • Logging
  • Distributed Tracing

Dynatrace

  • Dashboards
  • Alerting
  • Synthetic Monitoring
  • APIs
  • License/Account Management
  • Real User Monitoring
  • Site Reliability Guardian

Site Reliability Engineering

  • SRE
  • Error Budgets
  • Incident Management
  • Automation

OpenTelemetry

  • Collectors
  • Instrumentation
  • Observability Pipelines
  • Routing Rules

SLOs

  • Service Level Objectives
  • Service Level Indicators
  • Service Level Agreements
  • Error Budgets

Python

Grafana

  • Dashboards
  • Alerting

Splunk

Prometheus

InfluxDB

  • Enterprise
  • Open Source

Telegraf

AppDynamics

PagerDuty

  • Automation
  • Incident Management
  • Integration
  • Alerting

Bash

Terraform

  • Infrastructure as Code

Kubernetes

  • Containers
  • Orchestration

Docker

Linux

  • Ubuntu
  • CentOS
  • Amazon Linux

AWS

  • EC2
  • S3
  • RDS
  • Lambda
  • CloudWatch
  • IAM
  • Route53
  • SQS
  • EventBridge

Presentations & Demos

  • Workshops
  • Training

DevOps

Agile

Mentorship

Vendor Relationship Management

Languages

English

Native speaker

Interests

Guitar

Home Automation

Cooking

Travel

Camping

Hiking

Dogs