You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I'm running out of memory when I'm trying to run the training code on GPU. When using nvidia-smi I can see the memory loads to max and then goes down. However I don't know if this is TF trying to claim all available memory or the code. I changed the batch size to 16 but still have the same problem.
After loading all the cuda libraries I'm getting Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR at 13:47:58.783613
I'm using python3.6.9 and 10 GB of GPU
$ python train_market1501.py --mode=train --batch_size=16
2021-07-23 13:47:53.804516: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
Using TensorFlow backend.
Train set size: 11606 images, 676 identities
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:229: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:236: The name tf.read_file is deprecated. Please use tf.io.read_file instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:238: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py:373: The name tf.FIFOQueue is deprecated. Please use tf.queue.FIFOQueue instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:252: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:19: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:28: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.
feature dimensionality: 128
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:97: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/losses.py:142: The name tf.log is deprecated. Please use tf.math.log instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:266: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:268: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:270: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:276: The name tf.losses.get_regularization_loss is deprecated. Please use tf.compat.v1.losses.get_regularization_loss instead.
---------------------------------------
Run ID: RLKUKK
Log directory: ./output/market1501/RLKUKK
---------------------------------------
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py:464: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
2021-07-23 13:47:56.598199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-23 13:47:56.630062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.630541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1634] Found device 0 with properties:
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.56
pciBusID: 0000:01:00.0
2021-07-23 13:47:56.630588: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.632912: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:56.633926: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-23 13:47:56.634333: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-23 13:47:56.636568: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-23 13:47:56.637146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-23 13:47:56.637365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:56.637504: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.638077: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.638349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1762] Adding visible gpu devices: 0
2021-07-23 13:47:56.643456: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2021-07-23 13:47:56.643902: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61dc1a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-07-23 13:47:56.643913: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-07-23 13:47:56.690559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.691044: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61aefc0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-07-23 13:47:56.691061: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1650, Compute Capability 7.5
2021-07-23 13:47:56.691354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.691822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1634] Found device 0 with properties:
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.56
pciBusID: 0000:01:00.0
2021-07-23 13:47:56.691863: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.691943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:56.692006: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-23 13:47:56.692055: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-23 13:47:56.692082: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-23 13:47:56.692113: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-23 13:47:56.692127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:56.692242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.692557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.692838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1762] Adding visible gpu devices: 0
2021-07-23 13:47:56.692879: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.922074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1175] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-23 13:47:56.922102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] 0
2021-07-23 13:47:56.922108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1194] 0: N
2021-07-23 13:47:56.922484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.923047: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.923448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2647 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-07-23 13:47:58.050318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:58.454611: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:58.783613: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.791662: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.799769: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.807322: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.914962: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915018: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915063: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915090: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1_1/Conv2D}}]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1_1/Conv2D}}]]
[[train_op/control_dependency/_373]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1004, in managed_session
yield sess
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775, in train
train_step_kwargs)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py", line 613, in _train_step_fn
session, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[train_op/control_dependency/_373]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'conv1_1/Conv2D':
File "train_market1501.py", line 130, in <module>
main()
File "train_market1501.py", line 71, in main
**train_kwargs)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 177, in train_loop
trainable_scopes=trainable_scopes)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 254, in create_trainer
feature_var, logit_var = network_factory(image_var)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 120, in factory_fn
weight_decay=weight_decay)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 26, in create_network
weights_regularizer=conv_regularizer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1162, in convolution2d
conv_dims=2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1060, in convolution
outputs = layer.apply(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 201, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1176, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 662, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 252, in __call__
name=self.name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2052, in conv2d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_market1501.py", line 130, in <module>
main()
File "train_market1501.py", line 71, in main
**train_kwargs)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 188, in train_loop
save_interval_secs=save_interval_secs, number_of_steps=number_of_steps)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py", line 468, in run
**kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 790, in train
ignore_live_threads=ignore_live_threads)
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1014, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 839, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1045, in run_loop
[self._sv.summary_op, self._sv.global_step])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[conv4_1/1/Elu-0-1-TransposeNCHWToNHWC-LayoutOptimizer/_333]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'conv1_1/Conv2D':
File "train_market1501.py", line 130, in <module>
main()
File "train_market1501.py", line 71, in main
**train_kwargs)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 177, in train_loop
trainable_scopes=trainable_scopes)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 254, in create_trainer
feature_var, logit_var = network_factory(image_var)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 120, in factory_fn
weight_decay=weight_decay)
File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 26, in create_network
weights_regularizer=conv_regularizer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1162, in convolution2d
conv_dims=2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1060, in convolution
outputs = layer.apply(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 201, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1176, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 662, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 252, in __call__
name=self.name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2052, in conv2d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
The text was updated successfully, but these errors were encountered:
I think I'm running out of memory when I'm trying to run the training code on GPU. When using nvidia-smi I can see the memory loads to max and then goes down. However I don't know if this is TF trying to claim all available memory or the code. I changed the batch size to 16 but still have the same problem.
After loading all the cuda libraries I'm getting
Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
at 13:47:58.783613I'm using python3.6.9 and 10 GB of GPU
The text was updated successfully, but these errors were encountered: