LossScaleOptimizer does not work #31956

tsc2017 · 2019-08-25T17:53:22Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 x64
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.14.0
Python version: 3.7.3
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: 10.0
GPU model and memory: GTX 1080Ti

Describe the current behavior
I am trying to run the sample code from https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer and get the following error when no gradient can be computed for some variables:

ValueError                                Traceback (most recent call last)
C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    526                 as_ref=input_arg.is_ref,
--> 527                 preferred_dtype=default_dtype)
    528           except TypeError as err:

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1223     if ret is None:
-> 1224       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1225 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    283           value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 284           allow_broadcast=allow_broadcast))
    285   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    453     if values is None:
--> 454       raise ValueError("None values not supported.")
    455     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    540               observed = ops.internal_convert_to_tensor(
--> 541                   values, as_ref=input_arg.is_ref).dtype.name
    542             except ValueError as err:

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1223     if ret is None:
-> 1224       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1225 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    283           value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 284           allow_broadcast=allow_broadcast))
    285   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    453     if values is None:
--> 454       raise ValueError("None values not supported.")
    455     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-5d2950c170d2> in <module>
     13 
     14 # Call minimize() on the loss scale optimizer.
---> 15 train_op = loss_scale_optimizer.minimize(loss)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py in minimize(self, loss, global_step, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, name, grad_loss)
    411 
    412     return self.apply_gradients(grads_and_vars, global_step=global_step,
--> 413                                 name=name)
    414 
    415   def compute_gradients(self, loss, var_list=None,

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\contrib\mixed_precision\python\loss_scale_optimizer.py in apply_gradients(self, grads_and_vars, global_step, name)
    148     is_finite_grad = []
    149     for g in grads:
--> 150       is_finite_grad.append(math_ops.reduce_all(gen_math_ops.is_finite(g)))
    151     is_overall_finite = math_ops.reduce_all(is_finite_grad)
    152 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py in is_finite(x, name)
   4919   try:
   4920     _, _, _op = _op_def_lib._apply_op_helper(
-> 4921         "IsFinite", x=x, name=name)
   4922   except (TypeError, ValueError):
   4923     result = _dispatch.dispatch(

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    543               raise ValueError(
    544                   "Tried to convert '%s' to a tensor and failed. Error: %s" %
--> 545                   (input_name, err))
    546             prefix = ("Input '%s' of '%s' Op has type %s that does not match" %
    547                       (input_name, op_type_name, observed))

ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.

Describe the expected behavior
No error would occur for some other optimizers such as AdamOptimizer and MovingAverageOptimizer, even if no gradient can be computed for some variables.
Code to reproduce the issue

import tensorflow as tf
a1=tf.Variable(1., name='a1')
a2=tf.Variable(2., name='a2')

model_params = [var for var in tf.global_variables() if 'a' in var.name]
loss = a1**2
opt = tf.train.AdamOptimizer(learning_rate=.1, beta1=0., beta2=0.9)

# Choose a loss scale manager which decides how to pick the right loss scale
# throughout the training process.
loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(5000)

# Wraps the original optimizer in a LossScaleOptimizer.
loss_scale_optimizer =tf.contrib.mixed_precision.LossScaleOptimizer(opt, loss_scale_manager)

# Call minimize() on the loss scale optimizer.
train_op = loss_scale_optimizer.minimize(loss, var_list=model_params)

The text was updated successfully, but these errors were encountered:

seijikun · 2019-08-25T17:59:41Z

Sounds quite similar to #31953

oanush · 2019-08-26T05:41:07Z

@tsc2017 ,
Can you also refer the similar issues #783 and #17.Thanks!

tsc2017 · 2019-08-26T11:32:27Z

@oanush Thanks for the information.
Now I figure out that the problem can be fixed by removing pairs of grad and var in which grad is None:

import tensorflow as tf
a1=tf.Variable(1., name='a1')
a2=tf.Variable(2., name='a2')

model_params = [var for var in tf.global_variables() if 'a' in var.name]
loss = a1**2
opt = tf.train.AdamOptimizer(learning_rate=.1, beta1=0., beta2=0.9)

# Choose a loss scale manager which decides how to pick the right loss scale
# throughout the training process.
loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(5000)

# Wraps the original optimizer in a LossScaleOptimizer.
loss_scale_optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(opt, loss_scale_manager)

# Compute gradients
grads_and_vars = loss_scale_optimizer.compute_gradients(loss, var_list=model_params, colocate_gradients_with_ops=True)

# Remove irrelevant (grad, var) pairs
grads_and_vars = [(grad, var) for grad, var in grads_and_vars if grad is not None]

# Call apply_gradients() on the loss scale optimizer.
train_op = loss_scale_optimizer.apply_gradients(grads_and_vars)

Still, I have doubt whether the source of LossScaleOptimizer should be updated so that its behavior is consistent with the other optimizers.

jvishnuvardhan · 2019-08-29T22:06:27Z

I could reproduce the issue. Here is the gist. Thanks!

tanzhenyu · 2019-08-29T23:25:37Z

There was a technical decision made that the users should pass in valid trainable variables to avoid gradient to be None.

reedwm · 2019-09-03T21:17:24Z

This looks like a bug. However, contrib is being removed from TF 2.0, and so the contrib LossScaleOptimizer is deprecated and no longer maintained.

You can use a tf.keras.mixed_precision.experimental.LossScaleOptimizer with a tf.keras optimizer. Or in TF 1, a tf.train.experimental.MixedPrecisionLossScaleOptimizer with a non-keras optimizer.

I'm closing this issue, as even if it is fixed, the fix will never make it to a stable version of TensorFlow.

tensorflow-bot · 2019-09-03T21:17:25Z

Are you satisfied with the resolution of your issue?
Yes
No

oanush self-assigned this Aug 26, 2019

oanush added the TF 1.14 for issues seen with TF 1.14 label Aug 26, 2019

oanush added stat:awaiting response Status - Awaiting response from author comp:apis Highlevel API related issues labels Aug 26, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 26, 2019

oanush added the type:support Support issues label Aug 27, 2019

oanush assigned jvishnuvardhan and unassigned oanush Aug 27, 2019

jvishnuvardhan added type:bug Bug and removed type:support Support issues labels Aug 29, 2019

jvishnuvardhan assigned tanzhenyu and unassigned jvishnuvardhan Aug 29, 2019

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 29, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 30, 2019

reedwm closed this as completed Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LossScaleOptimizer does not work #31956

LossScaleOptimizer does not work #31956

tsc2017 commented Aug 25, 2019 •

edited

Loading

seijikun commented Aug 25, 2019

oanush commented Aug 26, 2019

tsc2017 commented Aug 26, 2019 •

edited

Loading

jvishnuvardhan commented Aug 29, 2019

tanzhenyu commented Aug 29, 2019

reedwm commented Sep 3, 2019

tensorflow-bot bot commented Sep 3, 2019

LossScaleOptimizer does not work #31956

LossScaleOptimizer does not work #31956

Comments

tsc2017 commented Aug 25, 2019 • edited Loading

seijikun commented Aug 25, 2019

oanush commented Aug 26, 2019

tsc2017 commented Aug 26, 2019 • edited Loading

jvishnuvardhan commented Aug 29, 2019

tanzhenyu commented Aug 29, 2019

reedwm commented Sep 3, 2019

tensorflow-bot bot commented Sep 3, 2019

tsc2017 commented Aug 25, 2019 •

edited

Loading

tsc2017 commented Aug 26, 2019 •

edited

Loading