Senior Platform Engineer — AI Agent Infrastructure
About this role
Accountabilities:
• Own and evolve the cloud infrastructure supporting AI agents running at scale in production environments
• Design and implement event-driven architectures using durable asynchronous messaging systems
• Improve inter-service communication by replacing synchronous dependencies with scalable messaging patterns
• Build and maintain infrastructure as code frameworks for provisioning, deployment, and environment consistency
• Ensure platform reliability, scalability, and performance across distributed workloads
• Develop advanced observability capabilities including dashboards, alerts, tracing, logging, and health monitoring
• Lead incident response analysis and proactively improve system resilience based on production learnings
• Evaluate emerging technologies and drive architectural decisions as the platform matures
• Optimize databases, storage systems, and caching layers for speed, availability, and cost efficiency
• Collaborate with engineering teams to support secure and efficient deployment of AI workloads
Requirements:
• 4+ years of experience in platform engineering, infrastructure engineering, SRE, or backend systems roles
• Strong expertise in event-driven architecture and messaging systems such as Kafka, RabbitMQ, NATS, or similar
• Deep AWS experience including EC2, VPC, IAM, S3, RDS, and internal networking concepts
• Solid experience with SQL databases such as PostgreSQL and NoSQL systems such as MongoDB or Redis
• Strong Docker knowledge including container lifecycle management, health checks, resource limits, and image optimization
• Proven experience debugging distributed systems, asynchronous flows, and cascading production failures
• Hands-on experience with Infrastructure as Code tools such as Terraform or Pulumi
• Strong observability skills using Datadog or equivalent tools for APM, logging, monitoring, and tracing
• Experience with Go or similar backend programming languages
• Strong communication skills and ability to lead technical decisions in remote teams
Preferred Qualifications:
• Experience supporting AI or MLOps infrastructure, model serving, LLM inference, or GPU workloads
• Familiarity with LangFuse, LangSmith, Braintrust, MLflow, or similar AI observability tools
• Experience building multi-tenant container platforms or internal PaaS environments
• Kubernetes migration or production operations experience
• Exposure to Airflow, Prefect, Snowflake, BigQuery, Databricks, or similar data platforms
• ECS and AI agent framework ecosystem knowledge is a plus
Benefits:
• Competitive compensation package
• Fully remote work from anywhere
• One-time home office setup allowance
• Company-provided work equipment
• Stock options
• Health plan coverage regardless of location
• Flexible paid time off
• Language learning and professional development courses
• Personal growth and continuous learning support
• Opportunity to shape cutting-edge AI infrastructure at scale
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1