Senior Software Engineer / Site Reliability Engineer (SRE) – Observability & Platform Engineering
About this role
Senior Software Engineer / Site Reliability Engineer (SRE) – Observability & Platform Engineering1Must-Have Skills (Required)
Core Engineering & Platform Skills
• Strong proficiency in at least one of the following: Python, JavaScript (Node.js), or Java
• Hands-on experience with API integrations (designing, consuming, and integrating APIs)
• Strong experience working in Kubernetes environments, including deployment, operations, and monitoring
Observability & Monitoring
• Experience with DataDog (preferred) or similar tools such as Prometheus, Grafana
• Ability to configure dashboards, alerts, and APM (tracing, metrics, logging)
• Experience monitoring containerized and microservices architectures
Cloud & Infrastructure
• Hands-on experience with AWS
• Experience integrating observability tools into cloud environments
SRE & Operations
• Experience with CI/CD integrations for observability (e.g., DataDog in pipelines)
• Ability to automate monitoring and operational tasks using scripting (Python preferred)
Strongly Preferred Skills
• Experience owning and operating an internal engineering platform
• Deep experience with observability platforms
• Demonstrated ownership of reliability, scalability, and performance
• Proven ability to proactively lead maintenance efforts and platform improvements
• Experience installing and configuring DataDog agents and integrations
• Experience managing API keys and secure configurations
• Experience managing user roles and access controls within observability platforms
Nice-to-Have Skills (Preferred)
• Familiarity with Go (Golang)
• Experience with additional observability tools such as New Relic, Dynatrace, Elastic, or Splunk Observability
Description
Project Overview:
We are seeking a Senior Software Engineer / SRE with an Observability focus to support platform reliability, monitoring, and modernization initiatives. This role blends software engineering (60–70%) with site reliability engineering (30–40%), with a strong emphasis on Kubernetes and observability platforms.
Key Responsibilities
• Support platform reliability, monitoring, and modernization initiatives
• Provide operational and training support for DataDog, the Observability Platform for R&D
• Enhance observability, reliability, and performance across engineering platforms
• Drive automation and operational excellence for monitoring and alerting frameworks
• Support Kubernetes-based platform operations and monitoring integrations
Timezone Coverage
• PST Coverage Required