Tri Dao et al. · NeurIPS
The key lesson for me is that memory movement is often the real bottleneck; algorithm design should respect the hardware path.
Papers and resources on distributed systems and ML infrastructure
Tri Dao et al. · NeurIPS
The key lesson for me is that memory movement is often the real bottleneck; algorithm design should respect the hardware path.
Aakanksha Chowdhery et al. · NeurIPS
Useful for understanding how distributed training, model quality, and infrastructure design interact at scale.
Jared Kaplan et al. · arXiv
A clean reminder that compute, data, and parameter count must be considered together rather than as independent dials.
Samyam Rajbhandari et al. · SC
A great reminder that optimizer state and partitioning strategy are first-order system design choices, not implementation details.
Mohamed Shoeybi et al. · SC
Important for seeing how model parallelism turns into a practical systems problem.
Philip Moritz et al. · OSDI
Useful for understanding how an execution framework can balance simplicity for users with scheduling complexity under the hood.
Google, etc. · Google Research / Industry
A foundational comparison for scheduler design, control-plane simplicity, and resource fairness.
Benjamin H. Sigelman et al. · Google Research
Still one of the clearest arguments for why tracing belongs inside the core systems workflow.