Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch limit reached #9

Open
rakibhasan48 opened this issue Mar 19, 2018 · 4 comments
Open

Epoch limit reached #9

rakibhasan48 opened this issue Mar 19, 2018 · 4 comments

Comments

@rakibhasan48
Copy link

I am running the following :
python train.py --slices 55 --width 12 --stride 1 --Bwidth 350 --vocabulary_size 29
--height 25 --train_data_pattern ./tf-data/handwritten-test-{}.tfrecords --train_dir models-feds
--test_data_pattern ./tf-data/handwritten-test-{}.tfrecords --max_steps 20 --batch_size 20 --beam_size 1
--input_chanels 1 --start_new_model --rnn_cell LSTM --model LSTMCTCModel --num_epochs 6000

Ouput
FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:tensorflow:/job:master/task:0: Tensorflow version: 1.1.0.
(8750, '', 25, 350, 1)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Removing existing train directory.
INFO:tensorflow:/job:master/task:0: Flag 'start_new_model' is set. Building a new model.
INFO:tensorflow:Using batch size of 20 for training.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of training files: 3.
(8750, '
', 25, 350, 1)
(8750, '', 25, 350, 1)
INFO:tensorflow:Using batch size of 20 for testing.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of testing files: 3.
(8750, '
', 25, 350, 1)
(8750, '********************', 25, 350, 1)
Tensor("Reshape:0", shape=(20, 25, 350, 1), dtype=float32)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Starting managed session.
2018-03-19 18:25:13.630111: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630151: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630174: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630180: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:16.371931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-19 18:25:16.372284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-03-19 18:25:16.372318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2018-03-19 18:25:16.372332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2018-03-19 18:25:16.372347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
INFO:tensorflow:/job:master/task:0: Entering training loop.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:models-feds/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
2018-03-19 18:25:19.788531: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
2018-03-19 18:25:19.789465: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
INFO:tensorflow:/job:master/task:0: Done training -- epoch limit reached.
INFO:tensorflow:/job:master/task:0: Exited training loop.

@ghost
Copy link

ghost commented Apr 5, 2018

@rakibhasan48 what is your os, and how many GB of ram do you have?

@JiteshPshah
Copy link

I am also getting the same error. What is the solution?

@rakibhasan48
Copy link
Author

I tested on aws. With a titan xp, so specs shouldn't be a problem.

@johnsmithm
Copy link
Owner

check out the files = [data_pattern.format(j) for j in range(3)] if nameT=='train' else [data_pattern.format(j) for j in range(3,6)] line from train.py to specify the number of training and testing tfrecords. check if tfrecord has images and labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants