-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bert/Tensorflow] Failed to run on FP16 Mode #402
Comments
Caused due to both assigning and passing a value as explained here It doesn't seem fp16 specific. Is fp32 training working? Can you link LMS? |
@swethmandava FP32 works fine. The problem is in fp16. |
So I set num_accumulation_steps to 1 and now it is working fine with fp16. |
After many trials and errors. I am now sure that LMS is not compatible with this TensorFlow bert code. I am not sure what is the issue but I think it is somewhere in the optimization script. |
@meatybobby can you check if this issue is still valid? |
Related to Model/Framework(s)
Bert / Tensorflow
Describe the bug
I tried to train Bert using tensorflow on SUMMIT. However, I got the following error when I tried to use fp16:
The only change that I did, is enabling the large model scaling:
To Reproduce
ddlrun python run_pretraining.py
--input_files_dir=/data/train
--eval_files_dir=/data/eval
--output_dir=/models/bert/
--bert_config_file=/models/bert/bert_xlarge_bfd_config.json
--do_train=True
--do_eval=False
--train_batch_size=11
--max_seq_length=512
--max_predictions_per_seq=76
--num_train_steps=500000
--num_accumulation_steps=8
--num_warmup_steps=50000
--save_checkpoints_steps=100
--learning_rate=0.003
--horovod=True
--allreduce_post_accumulation=True
--use_fp16=True
--use_xla=True
Environment
Super Computer: SUMMIT
Module: ibm-wml-ce/1.6.2
The text was updated successfully, but these errors were encountered: