This repo contains a minimal reproducible torch example of the "DiLoCo: Distributed Low-Communication Training of Language Models" approach in 180 lines of code.
First install the dependencies :
pip install -r requirements.txt
torchrun --nproc_per_node=1 pure_torch_diloco.py --per-device-train-batch-size 16 --batch-size 256 --lr 1e-3 --warmup-steps 50 --local-steps 10
torchrun --nproc_per_node=2 pure_torch_diloco.py --per-device-train-batch-size 16 --batch-size 256 --lr 1e-3 --warmup-steps 50 --local-steps 10