Reading List

Papers and resources on distributed systems and ML infrastructure

2026

reading →

reading · reading

reading

#reading

2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness →

Tri Dao et al. · NeurIPS

The key lesson for me is that memory movement is often the real bottleneck; algorithm design should respect the hardware path.

#Attention#GPU#Efficiency

2022

PaLM: Scaling Language Modeling with Pathways →

Aakanksha Chowdhery et al. · NeurIPS

Useful for understanding how distributed training, model quality, and infrastructure design interact at scale.

Reading note.

#DistributedTraining#Infrastructure

2020

Scaling Laws for Neural Language Models →

Jared Kaplan et al. · arXiv

A clean reminder that compute, data, and parameter count must be considered together rather than as independent dials.

Reading note.

#Scaling#LLMs

2020

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models →

Samyam Rajbhandari et al. · SC

A great reminder that optimizer state and partitioning strategy are first-order system design choices, not implementation details.

#DistributedTraining#Memory#Partitioning

2019

Megatron-LM: Training Multi-Billion Parameter Language Models →

Mohamed Shoeybi et al. · SC

Important for seeing how model parallelism turns into a practical systems problem.

Reading note.

#Parallelism#Systems

2018

Ray: A Distributed Framework for Emerging AI Applications →

Philip Moritz et al. · OSDI

Useful for understanding how an execution framework can balance simplicity for users with scheduling complexity under the hood.

#Ray#Scheduling#AIInfrastructure

2015

Borg, Omega, and Kubernetes →

Google, etc. · Google Research / Industry

A foundational comparison for scheduler design, control-plane simplicity, and resource fairness.

Reading note.

#Scheduling#ClusterSystems

2010

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure →

Benjamin H. Sigelman et al. · Google Research

Still one of the clearest arguments for why tracing belongs inside the core systems workflow.

#Tracing#Observability#DistributedSystems