Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LossScaleOptimizer does not work #31956

Closed
tsc2017 opened this issue Aug 25, 2019 · 7 comments
Closed

LossScaleOptimizer does not work #31956

tsc2017 opened this issue Aug 25, 2019 · 7 comments
Assignees
Labels
comp:apis Highlevel API related issues TF 1.14 for issues seen with TF 1.14 type:bug Bug

Comments

@tsc2017
Copy link

tsc2017 commented Aug 25, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 x64
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.14.0
  • Python version: 3.7.3
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: GTX 1080Ti

Describe the current behavior
I am trying to run the sample code from https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer and get the following error when no gradient can be computed for some variables:

ValueError                                Traceback (most recent call last)
C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    526                 as_ref=input_arg.is_ref,
--> 527                 preferred_dtype=default_dtype)
    528           except TypeError as err:

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1223     if ret is None:
-> 1224       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1225 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    283           value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 284           allow_broadcast=allow_broadcast))
    285   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    453     if values is None:
--> 454       raise ValueError("None values not supported.")
    455     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    540               observed = ops.internal_convert_to_tensor(
--> 541                   values, as_ref=input_arg.is_ref).dtype.name
    542             except ValueError as err:

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1223     if ret is None:
-> 1224       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1225 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    283           value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 284           allow_broadcast=allow_broadcast))
    285   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    453     if values is None:
--> 454       raise ValueError("None values not supported.")
    455     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-5d2950c170d2> in <module>
     13 
     14 # Call minimize() on the loss scale optimizer.
---> 15 train_op = loss_scale_optimizer.minimize(loss)

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\training\optimizer.py in minimize(self, loss, global_step, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, name, grad_loss)
    411 
    412     return self.apply_gradients(grads_and_vars, global_step=global_step,
--> 413                                 name=name)
    414 
    415   def compute_gradients(self, loss, var_list=None,

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\contrib\mixed_precision\python\loss_scale_optimizer.py in apply_gradients(self, grads_and_vars, global_step, name)
    148     is_finite_grad = []
    149     for g in grads:
--> 150       is_finite_grad.append(math_ops.reduce_all(gen_math_ops.is_finite(g)))
    151     is_overall_finite = math_ops.reduce_all(is_finite_grad)
    152 

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py in is_finite(x, name)
   4919   try:
   4920     _, _, _op = _op_def_lib._apply_op_helper(
-> 4921         "IsFinite", x=x, name=name)
   4922   except (TypeError, ValueError):
   4923     result = _dispatch.dispatch(

C:\Users\admin\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    543               raise ValueError(
    544                   "Tried to convert '%s' to a tensor and failed. Error: %s" %
--> 545                   (input_name, err))
    546             prefix = ("Input '%s' of '%s' Op has type %s that does not match" %
    547                       (input_name, op_type_name, observed))

ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.

Describe the expected behavior
No error would occur for some other optimizers such as AdamOptimizer and MovingAverageOptimizer, even if no gradient can be computed for some variables.
Code to reproduce the issue

import tensorflow as tf
a1=tf.Variable(1., name='a1')
a2=tf.Variable(2., name='a2')

model_params = [var for var in tf.global_variables() if 'a' in var.name]
loss = a1**2
opt = tf.train.AdamOptimizer(learning_rate=.1, beta1=0., beta2=0.9)

# Choose a loss scale manager which decides how to pick the right loss scale
# throughout the training process.
loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(5000)

# Wraps the original optimizer in a LossScaleOptimizer.
loss_scale_optimizer =tf.contrib.mixed_precision.LossScaleOptimizer(opt, loss_scale_manager)

# Call minimize() on the loss scale optimizer.
train_op = loss_scale_optimizer.minimize(loss, var_list=model_params)
@seijikun
Copy link

Sounds quite similar to #31953

@oanush oanush self-assigned this Aug 26, 2019
@oanush oanush added the TF 1.14 for issues seen with TF 1.14 label Aug 26, 2019
@oanush
Copy link

oanush commented Aug 26, 2019

@tsc2017 ,
Can you also refer the similar issues #783 and #17.Thanks!

@oanush oanush added stat:awaiting response Status - Awaiting response from author comp:apis Highlevel API related issues labels Aug 26, 2019
@tsc2017
Copy link
Author

tsc2017 commented Aug 26, 2019

@oanush Thanks for the information.
Now I figure out that the problem can be fixed by removing pairs of grad and var in which grad is None:

import tensorflow as tf
a1=tf.Variable(1., name='a1')
a2=tf.Variable(2., name='a2')

model_params = [var for var in tf.global_variables() if 'a' in var.name]
loss = a1**2
opt = tf.train.AdamOptimizer(learning_rate=.1, beta1=0., beta2=0.9)

# Choose a loss scale manager which decides how to pick the right loss scale
# throughout the training process.
loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(5000)

# Wraps the original optimizer in a LossScaleOptimizer.
loss_scale_optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(opt, loss_scale_manager)

# Compute gradients
grads_and_vars = loss_scale_optimizer.compute_gradients(loss, var_list=model_params, colocate_gradients_with_ops=True)

# Remove irrelevant (grad, var) pairs
grads_and_vars = [(grad, var) for grad, var in grads_and_vars if grad is not None]

# Call apply_gradients() on the loss scale optimizer.
train_op = loss_scale_optimizer.apply_gradients(grads_and_vars)

Still, I have doubt whether the source of LossScaleOptimizer should be updated so that its behavior is consistent with the other optimizers.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 26, 2019
@oanush oanush added the type:support Support issues label Aug 27, 2019
@oanush oanush assigned jvishnuvardhan and unassigned oanush Aug 27, 2019
@jvishnuvardhan jvishnuvardhan added type:bug Bug and removed type:support Support issues labels Aug 29, 2019
@jvishnuvardhan
Copy link
Contributor

I could reproduce the issue. Here is the gist. Thanks!

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 29, 2019
@tanzhenyu
Copy link
Contributor

There was a technical decision made that the users should pass in valid trainable variables to avoid gradient to be None.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 30, 2019
@reedwm
Copy link
Member

reedwm commented Sep 3, 2019

This looks like a bug. However, contrib is being removed from TF 2.0, and so the contrib LossScaleOptimizer is deprecated and no longer maintained.

You can use a tf.keras.mixed_precision.experimental.LossScaleOptimizer with a tf.keras optimizer. Or in TF 1, a tf.train.experimental.MixedPrecisionLossScaleOptimizer with a non-keras optimizer.

I'm closing this issue, as even if it is fixed, the fix will never make it to a stable version of TensorFlow.

@reedwm reedwm closed this as completed Sep 3, 2019
@tensorflow-bot
Copy link

tensorflow-bot bot commented Sep 3, 2019

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues TF 1.14 for issues seen with TF 1.14 type:bug Bug
Projects
None yet
Development

No branches or pull requests

7 participants