Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bert/Tensorflow] Failed to run on FP16 Mode #402

Open
agemagician opened this issue Feb 14, 2020 · 5 comments
Open

[Bert/Tensorflow] Failed to run on FP16 Mode #402

agemagician opened this issue Feb 14, 2020 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@agemagician
Copy link

agemagician commented Feb 14, 2020

Related to Model/Framework(s)
Bert / Tensorflow

Describe the bug
I tried to train Bert using tensorflow on SUMMIT. However, I got the following error when I tried to use fp16:

Traceback (most recent call last):
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node cond_4/cond/Assign/Switch_1 was passed float from lms/swapin_cond_4_cond_Assign_Switch_1:0 incompatible with expected float_ref.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_pretraining.py", line 648, in <module>
    tf.compat.v1.app.run()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining.py", line 588, in main
    estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node cond_4/cond/Assign/Switch_1 was passed float from lms/swapin_cond_4_cond_Assign_Switch_1:0 incompatible with expected float_ref.
I0214 14:52:28.498270 35184372402448 monitored_session.py:240] Graph was finalized.
2020-02-14 14:52:28.510054: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-14 14:52:28.511800: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x136e8f980 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-14 14:52:28.511821: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-14 14:52:28.513940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-14 14:52:28.517668: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x136ef2390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-14 14:52:28.517685: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-14 14:52:28.520372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-14 14:52:28.520397: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-14 14:52:28.520421: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-14 14:52:28.522190: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-14 14:52:28.522235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-14 14:52:28.523822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-14 14:52:28.525140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-14 14:52:28.528524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-14 14:52:28.533710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 5
2020-02-14 14:52:28.533733: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-14 14:52:28.533778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-14 14:52:28.533789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      5 
2020-02-14 14:52:28.533796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 5:   N 
2020-02-14 14:52:28.539013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14870 MB memory) -> physical GPU (device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0035:05:00.0, compute capability: 7.0)
I0214 14:52:29.025281 35184372402448 lms.py:1275] [LMS][0] Added 4815 operations to the model (2339 swap-out operations (67.86 GiB) and 2476 swap-in operations (73.32 GiB))
I0214 14:52:29.025513 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS, took: 622064.0621185303 ms
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node cond_4/cond/Assign/Switch_1 was passed float from lms/swapin_cond_4_cond_Assign_Switch_1:0 incompatible with expected float_ref.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_pretraining.py", line 648, in <module>
    tf.compat.v1.app.run()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining.py", line 588, in main
    estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorfloI0214 14:52:30.093494 35184372402448 monitored_session.py:240] Graph was finalized.
2020-02-14 14:52:30.105004: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-14 14:52:30.106514: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x171fc1b80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-14 14:52:30.106547: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-14 14:52:30.108710: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-14 14:52:30.112611: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x172024590 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-14 14:52:30.112623: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-14 14:52:30.115268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-14 14:52:30.115290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-14 14:52:30.115311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-14 14:52:30.117450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-14 14:52:30.117493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-14 14:52:30.119262: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-14 14:52:30.120732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-14 14:52:30.124358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-14 14:52:30.129455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 1
2020-02-14 14:52:30.129477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-14 14:52:30.129527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-14 14:52:30.129535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      1 
2020-02-14 14:52:30.129543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   N 
2020-02-14 14:52:30.134787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14870 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
w_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node cond_4/cond/Assign/Switch_1 was passed float from lms/swapin_cond_4_cond_Assign_Switch_1:0 incompatible with expected float_ref.

The only change that I did, is enabling the large model scaling:

training_hooks = []
  if FLAGS.report_loss and (not FLAGS.horovod or hvd.rank() == 0):
    global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
    training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, FLAGS.display_loss_steps))
  if FLAGS.horovod and hvd.size() > 1:
    training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))

  lms_hook = LMS(swapout_threshold=712,
               swapin_ahead=248,
               swapin_groupby=16,
               sync_mode=0)
  training_hooks.append(lms_hook)

  estimator = tf.estimator.Estimator(
      model_fn=model_fn,
      config=run_config)

To Reproduce
ddlrun python run_pretraining.py
--input_files_dir=/data/train
--eval_files_dir=/data/eval
--output_dir=/models/bert/
--bert_config_file=/models/bert/bert_xlarge_bfd_config.json
--do_train=True
--do_eval=False
--train_batch_size=11
--max_seq_length=512
--max_predictions_per_seq=76
--num_train_steps=500000
--num_accumulation_steps=8
--num_warmup_steps=50000
--save_checkpoints_steps=100
--learning_rate=0.003
--horovod=True
--allreduce_post_accumulation=True
--use_fp16=True
--use_xla=True

Environment
Super Computer: SUMMIT
Module: ibm-wml-ce/1.6.2

@agemagician agemagician added the bug Something isn't working label Feb 14, 2020
@swethmandava
Copy link
Contributor

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node cond_4/cond/Assign/Switch_1 was passed float from lms/swapin_cond_4_cond_Assign_Switch_1:0 incompatible with expected float_ref.

Caused due to both assigning and passing a value as explained here It doesn't seem fp16 specific. Is fp32 training working? Can you link LMS?

@agemagician
Copy link
Author

@agemagician
Copy link
Author

agemagician commented Feb 14, 2020

So I set num_accumulation_steps to 1 and now it is working fine with fp16.
I think the problem exists somewhere on the accumulation part on the optimizer file.

@agemagician
Copy link
Author

agemagician commented Feb 15, 2020

After many trials and errors. I am now sure that LMS is not compatible with this TensorFlow bert code.
However, the code that I took from:
https://github.com/lambdal/bert
is working fine:
https://github.com/agemagician/bert

I am not sure what is the issue but I think it is somewhere in the optimization script.

@nv-kkudrynski
Copy link
Collaborator

@meatybobby can you check if this issue is still valid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants