Tinkerings

My notes to build mental models for how things work.

Triton Kernels

Modern GPUs have massive compute but limited memory bandwidth. The key metric is arithmetic intensity (AI):

Context Parallelism

Context Parallelism (CP) is a distributed training technique that enables training of LLMs with extremely long sequences (up to 1M tokens) by sharding the sequence dimension…

Expert Parallelism

Let’s distill how to run a Mixture-of-Experts (MoE) model with expert parallelism with an example.