Perik.ai See who’s hiring. Apply before everyone else.
← Back to all jobs

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Hyphenconnect
📍 San Francisco Bay Area, USA 📅 Posted April 24, 2026
Apply on Hyphenconnect’s website →

About this role

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

• Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.

• Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

• Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

• Deep expertise in 3D parallelism (Data, Tensor, Pipeline).

• Experience managing SLURM or Kubernetes-based GPU clusters.

• Strong systems engineering background (C++, CUDA, Python).

This listing was aggregated by Perik.ai from Hyphenconnect’s public job board. Click the button above to view the full job description and apply directly.
Explore more jobs
More from Hyphenconnect Browse all AI & tech jobs

Perik.ai is an AI & tech job board that aggregates the latest openings from top companies — updated daily so you can apply before everyone else.

About FAQ Privacy Policy Terms of Service Contact