Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: causal_dataset separate_last_epoch处理的疑问 #9599

Open
1 task done
dynamicheart opened this issue Dec 10, 2024 · 1 comment
Open
1 task done

[Bug]: causal_dataset separate_last_epoch处理的疑问 #9599

dynamicheart opened this issue Dec 10, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@dynamicheart
Copy link
Contributor

dynamicheart commented Dec 10, 2024

软件环境

- paddlepaddle: develop
- paddlepaddle-gpu: develop
- paddlenlp: develop

重复问题

  • I have searched the existing issues

错误描述

Pretrain loss在最后一个"epoch"骤降

稳定复现步骤 & 代码

Pretrain loss在最后一个"epoch"骤降
image

  1. run_pretrain.py的train_sampler的shuffle是设置为False的,因此数据集的shuffle完全是在causal_dataset里面处理的

  2. 当训练需要的samples数量大于数据集能提供的samples数量时,会对数据集进行重复选择, 每一次重复选择叫作一次数据epoch。然而,训练需要的samples数量并不总是等于数据集samples数量的整数倍,最后一个数据epoch可能会进行特殊处理:

# If we have less than 80% of the samples for the last epoch,
# seperate out the epoch and treat it differently.
# Note: the 80% number is just based on common sense and can
# be adjusted if needed.
separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)

  1. 因此,数据集的samples排布会分成两部分[先前各个epoch的samples数据, 最后一个epoch的samples数据]。然而,causal_dataset对其是进行分别shuffle,导致两部分samples的数据分布规律不一致,进而导致loss会有骤降的现象:https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/data/causal_dataset.py#L691-L711

参考PR:

@dynamicheart dynamicheart added the bug Something isn't working label Dec 10, 2024
@dynamicheart dynamicheart changed the title [Bug]: casual_dataset separate_last_epoch处理的疑问 [Bug]: causal_dataset separate_last_epoch处理的疑问 Dec 11, 2024
@ZHUI
Copy link
Collaborator

ZHUI commented Dec 16, 2024

你好,建议您可以 修改shuffle这部分的代码,合并一起shuffle。这样应该不会出现突变。

您可以去Megatron那边提问,我不太清楚为什么Megatron 需要设置Last epoch should not be globally shuffled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants