[Feature] All GPUs within the same TP group load training data from shared memory #91

xrsrke · 2024-03-03T03:15:01Z

In a typical data loading phase of distributed training, each GPU worker is equipped with its own data loader, responsible for reading training data into the CPU memory before forwarding it to the GPU. This leads to competition among workers for disk read bandwidth, thereby creating a bottleneck. Notably, we observe that in the LLM training setting, GPU workers within the same machine are in the same tensor parallel group. Consequently, their inputs for each iteration are inherently identical. Based on this observation, we adopt a two-layer tree-based approach. We use a single, dedicated data loader on each machine to read the training data into a piece of shared memory. Subsequently, each GPU worker is responsible for copying the necessary data to its own GPU memory. This eliminates redundant reads and significantly enhances the efficiency of data transfer.

Reference: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5

xrsrke added enhancement New feature or request help wanted Extra attention is needed Low Priority labels Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] All GPUs within the same TP group load training data from shared memory #91

[Feature] All GPUs within the same TP group load training data from shared memory #91

xrsrke commented Mar 3, 2024

[Feature] All GPUs within the same TP group load training data from shared memory #91

[Feature] All GPUs within the same TP group load training data from shared memory #91

Comments

xrsrke commented Mar 3, 2024