SRE Architect, AI-Powered Reliability

Wexinc

📍 5 Locations 📅 Posted May 13, 2026

About this role

About the Team & Role

WEX operates across multiple lines of business, Mobility, Benefits, and Travel, serving enterprise customers globally with payment and technology solutions that demand uncompromising reliability. These are mission-critical systems handling high-volume financial transactions where availability, transactional integrity, and low latency are non-negotiable. Our SRE practice is in its early stages, and the decisions made now will define how we build, operate, and continuously improve reliable systems for years to come.

This person will define and enforce the reliability standards, operational practices, and architectural guardrails that every line of business at WEX must meet, and will use AI as a primary tool to establish, scale, and continuously improve those standards faster than traditional approaches alone can achieve.

This is not a role embedded in a single business unit. It sits at the center of WEX engineering with a mandate that spans all LOBs. You will set the bar, and you will hold it , working with engineering leadership, platform teams, and LOB architects to make reliability a consistent, measurable, and continuously improving property of every system we operate.

How you'll make an impact

Enterprise Standards & Governance

• Define, publish, and enforce enterprise-wide SRE best practices and operational standards covering observability, incident management, resilience, capacity planning, and reliability architecture, applicable across all WEX lines of business.

• Define and lead WEX’s AI-Powered Reliability Engineering strategy, driving adoption of SRE agents across the software lifecycle—from design and development through deployment and operations, to improve reliability, automation, and operational efficiency.

• Architect and oversee the implementation of mission-critical systems, ensuring that reliability, availability, and transactional integrity requirements are designed in from the start, not bolted on after the fact.

• Establish and govern SLO, SLI, and error budget frameworks across LOBs, partnering with engineering leadership to align reliability targets with business and commercial expectations.

• Own the production readiness review process, defining the criteria every service must meet before going live and driving accountability for remediation when gaps are found.

• Serve as the primary technical advisor to engineering leadership across WEX on matters of reliability, resilience architecture, and operational excellence.

Observability

• Define the enterprise observability standard, what good looks like for metrics, distributed tracing, structured logging, and alerting, and hold all LOBs accountable to it.

• Use AI-powered tooling to move beyond static dashboards: deploy intelligent anomaly detection, dynamic baselining, and automated signal correlation to reduce noise and surface actionable signals at scale.

• Drive instrumentation practices that give engineering teams genuine insight into the health of high-availability, low-latency systems, including real-time payment flows and transaction pipelines where latency and consistency are critical.

• Lead the evaluation and adoption of AI-assisted observability platforms that reason across telemetry sources to accelerate detection and diagnosis.

Incident Management

• Establish the enterprise incident management framework: severity definitions, response playbooks, escalation paths, on-call standards, and cross-LOB communication protocols.

• Integrate AI into the full incident lifecycle, intelligent triage and automated runbook suggestions at detection, real-time signal correlation during active incidents, and AI-assisted timeline and impact summaries at resolution.

• Reduce cognitive burden on on-call engineers through tooling that surfaces relevant context, prior incidents, and likely remediation paths automatically during high-pressure situations.

• Define, track, and report on incident metrics (MTTD, MTTR, recurrence rate) across all LOBs, using trends to drive systemic improvement rather than one-off fixes.

Resilience Engineering & Self-Healing Systems

• Lead cross-functional initiatives to enhance system resilience and performance across WEX, advocating for circuit breakers, bulkheads, graceful degradation, retry strategies, and fault isolation as enterprise standards.

• Design self-healing and auto-recovery mechanisms that allow systems to detect, respond to, and recover from common failure modes without human intervention, reducing toil and improving mean time to recovery.

• Build and operate chaos engineering programs appropriate for WEX's financial systems, running controlled failure experiments that expose resilience gaps safely and systematically before they manifest as production incidents.

• Use AI to proactively identify resilience risks: analyze production telemetry, deployment signals, and dependency graphs to surface systems most likely to fail under stress before incidents occur.

Capacity Planning & Load Testing

• Develop enterprise capacity planning strategies, establishing the models, tooling, and review cadences that ensure every LOB can anticipate and provision for demand growth without last-minute scrambles or over-provisioning.

• Define and enforce load testing standards as a gate in the software delivery lifecycle, ensuring that services can handle peak transactional load, including burst demand on payment and fleet systems, before they reach production.

• Apply AI-driven forecasting to capacity planning: model historical growth patterns, seasonal demand signals, and business pipeline data to produce reliable capacity outlooks across LOBs.

Cloud Cost Optimization

• Drive cloud cost optimization and budgeting initiatives across WEX engineering, establishing the frameworks, tooling, and governance processes that ensure cloud spend is rationalized against reliability and performance outcomes.

• Identify and remediate cost inefficiencies without compromising availability: right-sizing, reserved capacity strategy, workload scheduling, and architecture patterns that reduce waste in high-availability deployments.

• Partner with LOB engineering and finance leadership to produce credible cloud cost forecasts, and hold teams accountable to efficiency targets.

Blameless Postmortem Culture

• Design and champion the enterprise blameless postmortem process, creating templates, facilitation standards, and review cadences that make postmortems genuinely useful and consistently practiced across all LOBs.

• Use AI to accelerate postmortem quality: generate draft timelines from incident telemetry, surface contributing factors from logs and traces, and identify systemic patterns across multiple incidents over time.

• Build a postmortem knowledge base that is searchable and actionable, so lessons from past incidents actively inform future architectural decisions and operational practices.

• Close the loop on postmortem action items, tracking completion rates across LOBs and escalating chronic non-compliance to engineering leadership.

Technical Advisory & Cross-LOB Enablement

• Serve as a technical advisor to engineering leaders and architects across WEX, reviewing system designs for reliability risk, providing guidance on high-availability and low-latency architecture patterns, and advising on operational tradeoffs.

• Lead cross-functional initiatives that span LOBs, bringing together engineering teams to solve shared reliability challenges, establish common tooling, and align on enterprise standards.

• Create and deliver internal enablement programs, workshops, documentation, office hours, and design review forums, that build SRE capability across WEX engineering without requiring headcount growth in every team.

• Communicate clearly and influentially to senior leadership: produce written strategy documents, present reliability trends and investment recommendations, and maintain executive visibility into the state of reliability across the enterprise.

Experience you'll bring

Required

• 12+ years in SRE, platform engineering, or distributed systems, with a hands-on track record of operating mission-critical systems at scale.

• Deep practical expertise across observability, incident management, resilience engineering, and capacity planning, not just familiarity, but proven delivery in production environments.

• Experience with high-availability, low-latency systems where transactional integrity and consistency are critical requirements, payment processing, financial platforms, or equivalent.

• Demonstrated experience using AI tools to solve real reliability problems: anomaly detection, incident triage, noise reduction, postmortem acceleration, capacity forecasting, or auto-remediation.

• Proven ability to define and enforce technical standards across multiple engineering teams or business units without direct managerial authority.

• Experience designing self-healing and auto-recovery mechanisms in production distributed systems.

• Strong background in cloud cost optimization, architecture patterns, governance frameworks, and tooling for managing cloud spend at scale (AWS, GCP, or Azure).

• Excellent written and verbal communication skills, able to produce authoritative strategy documents, lead cross-LOB forums, and advise VP and C-level engineering leaders.

Preferred

• Experience in payments, fintech, fleet technology, or benefits administration, familiarity with the reliability and compliance demands of financial transaction systems.

• Experience building or maturing an SRE practice from an early stage across a multi-product or multi-LOB organization.

• Familiarity with AI-native observability or AIOps platforms (Dynatrace, Honeycomb, Coralogix, or similar).

• Background in chaos engineering (Gremlin, LitmusChaos, AWS Fault Injection Simulator) and controlled failure experimentation in regulated or financial environments.

• Experience with systems requiring strict transactional consistency, distributed databases, event-driven architectures, or payment settlement pipelines.

• Proficiency with Kubernetes, service mesh (Istio/Linkerd), and OpenTelemetry-based observability stacks.

• BS/MS in Computer Science, Engineering, or equivalent practical experience.

The base pay range represents the anticipated low and high end of the pay range for this position. Actual pay rates will vary and will be based on various factors, such as your qualifications, skills, competencies, and proficiency for the role. Base pay is one component of WEX's total compensation package. Most sales positions are eligible for commission under the terms of an applicable plan. Non-sales roles are typically eligible for a quarterly or annual bonus based on their role and applicable plan. WEX's comprehensive and market competitive benefits are designed to support your personal and professional well-being. Benefits include health, dental and vision insurances, retirement savings plan, paid time off, health savings account, flexible spending accounts, life insurance, disability insurance, tuition reimbursement, and more. For more information, check out the "About Us" section.

Pay Range: $200,600.00 - $250,400.00

This listing was aggregated by Perik.ai from Wexinc’s public job board. Click the button above to view the full job description and apply directly.

Explore more jobs

More from Wexinc Browse all AI & tech jobs