Efficient Training in PyTorch

简介

本仓库包含了实体书《大模型动力引擎——PyTorch性能与显存优化手册》中各个章节所用到的代码示例。该书由张爱玲 (@ailzhang) 和杨占略 (@jim19930609) 编写，清华大学出版社于2024年10月出版发行。

《大模型动力引擎——PyTorch性能与显存优化手册》章节结构

Intro

This is the public repo for code snippets in book "Efficient training in PyTorch" by Ailing Zhang(@ailzhang) and Zhanlue Yang(@jim19930609). The book is structured in 3 sections:

Foundation: this section covers the essential knowledge of hardware, software and profiling tools, including:

Intro to concepts like GPU compute/memory, CPU, networking etc.
PyTorch fundamentals like tensors, operations, async GPU/CPU execution, dynamic graphs, autograd etc and how they work.
How to set up a benchmark environment and accurately measure time with async CPU/GPU computation.
How to interpret PyTorch profiler results to identify bottlenecks.

Common problems and techniques: this section delves into:

Improving data loading speed, such as using multiprocessing dataloaders and prefetching, and diagnosing bottlenecks via tools like htop and iotop.
Speeding up computation on a single GPU.
Reducing RAM usage on a single GPU.
Distributed training strategies, including data parallelism, tensor parallelism, and pipeline parallelism.

Advanced techniques: the final section covers:

Techniques like Automatic Mixed Precision, custom CUDA kernels, and compiler-based optimizations like torch.dynamo.
An example where we optimize the minGPT codebase, showing step-by-step how to improve GPU memory usage and performance using the techniques discussed in previous chapters.

Please note that the book is currently written and published in Chinese, and there is no English version available yet. If you’re interested in an English edition, feel free to let us know by creating an issue.

Repo layout

The folder structure mirrors the chapters of the book, with code samples arranged in the order they appear in each chapter.
Profiling results are located in the traces subfolders within each chapter’s directory. Please note that the profiling results may differ depending on your specific software and hardware setup.
For chapter 10, the main branch contains a modified version of the vanilla minGPT code. Memory optimizations are located in the chapter10_memory branch, and performance optimizations in the chapter10_perf branch. Each commit corresponds to a specific technique discussed in the book.

Code format

Please make sure pre-commit is installed and feel free to add rules.

pip install pre-commit
pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
chapter_01_intro		chapter_01_intro
chapter_03_pytorch		chapter_03_pytorch
chapter_04_profiler		chapter_04_profiler
chapter_05_data		chapter_05_data
chapter_06_compute		chapter_06_compute
chapter_07_memory		chapter_07_memory
chapter_08_distributed		chapter_08_distributed
chapter_09_advanced		chapter_09_advanced
chapter_10_mingpt		chapter_10_mingpt
images		images
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Training in PyTorch

简介

Intro

Repo layout

Code format

About

Releases

Languages

ailzhang/EfficientPyTorch

Folders and files

Latest commit

History

Repository files navigation

Efficient Training in PyTorch

简介

Intro

Repo layout

Code format

About

Resources

Stars

Watchers

Forks

Releases

Languages