Epoch limit reached #9

rakibhasan48 · 2018-03-19T18:27:38Z

I am running the following :
python train.py --slices 55 --width 12 --stride 1 --Bwidth 350 --vocabulary_size 29
--height 25 --train_data_pattern ./tf-data/handwritten-test-{}.tfrecords --train_dir models-feds
--test_data_pattern ./tf-data/handwritten-test-{}.tfrecords --max_steps 20 --batch_size 20 --beam_size 1
--input_chanels 1 --start_new_model --rnn_cell LSTM --model LSTMCTCModel --num_epochs 6000

Ouput
FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:tensorflow:/job:master/task:0: Tensorflow version: 1.1.0.
(8750, '', 25, 350, 1)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Removing existing train directory.
INFO:tensorflow:/job:master/task:0: Flag 'start_new_model' is set. Building a new model.
INFO:tensorflow:Using batch size of 20 for training.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of training files: 3.
(8750, '', 25, 350, 1)
(8750, '', 25, 350, 1)
INFO:tensorflow:Using batch size of 20 for testing.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of testing files: 3.
(8750, '', 25, 350, 1)
(8750, '********************', 25, 350, 1)
Tensor("Reshape:0", shape=(20, 25, 350, 1), dtype=float32)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Starting managed session.
2018-03-19 18:25:13.630111: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630151: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630174: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630180: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:16.371931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-19 18:25:16.372284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-03-19 18:25:16.372318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2018-03-19 18:25:16.372332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2018-03-19 18:25:16.372347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
INFO:tensorflow:/job:master/task:0: Entering training loop.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:models-feds/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
2018-03-19 18:25:19.788531: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
2018-03-19 18:25:19.789465: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
INFO:tensorflow:/job:master/task:0: Done training -- epoch limit reached.
INFO:tensorflow:/job:master/task:0: Exited training loop.

The text was updated successfully, but these errors were encountered:

ghost · 2018-04-05T08:31:10Z

@rakibhasan48 what is your os, and how many GB of ram do you have?

JiteshPshah · 2018-04-17T11:30:03Z

I am also getting the same error. What is the solution?

rakibhasan48 · 2018-04-17T11:57:10Z

I tested on aws. With a titan xp, so specs shouldn't be a problem.

johnsmithm · 2018-05-05T10:05:33Z

check out the files = [data_pattern.format(j) for j in range(3)] if nameT=='train' else [data_pattern.format(j) for j in range(3,6)] line from train.py to specify the number of training and testing tfrecords. check if tfrecord has images and labels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epoch limit reached #9

Epoch limit reached #9

rakibhasan48 commented Mar 19, 2018

ghost commented Apr 5, 2018

JiteshPshah commented Apr 17, 2018

rakibhasan48 commented Apr 17, 2018

johnsmithm commented May 5, 2018

Epoch limit reached #9

Epoch limit reached #9

Comments

rakibhasan48 commented Mar 19, 2018

ghost commented Apr 5, 2018

JiteshPshah commented Apr 17, 2018

rakibhasan48 commented Apr 17, 2018

johnsmithm commented May 5, 2018