Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi ,i get the error msg like this : #19

Open
ross-Hr opened this issue Oct 12, 2022 · 8 comments
Open

Hi ,i get the error msg like this : #19

ross-Hr opened this issue Oct 12, 2022 · 8 comments

Comments

@ross-Hr
Copy link

ross-Hr commented Oct 12, 2022

2022-10-12 15:43:57.254005: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a finalprefetch` in the input pipeline to which to introduce slack.
I1012 15:43:57.996680 140468541171456 api.py:459] train_step begins...
I1012 15:44:07.279798 140468532778752 api.py:459] train_step begins...
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:10.852259 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:17.169317 140468541171456 api.py:446] Trainable variables:
I1012 15:44:17.426999 140468541171456 api.py:446] vit/stem_conv/kernel:0 (16, 16, 3, 768)
I1012 15:44:17.432081 140468541171456 api.py:446] vit/stem_conv/bias:0 (768,)
I1012 15:44:17.436969 140468541171456 api.py:446] vit/stem_ln/gamma:0 (768,)
....
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:31.484436 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:37.695064 140468532778752 api.py:459] train_step ends...
I1012 15:44:38.920633 140468541171456 api.py:459] train_step ends...
2022-10-12 15:45:08.671253: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 351529
Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call
ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area
retval__1 = ag
.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
retval__1 = ag
.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 351529
2022-10-12 15:45:08.671413: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 415619
Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call
ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area
retval__1 = ag
.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
retval__1 = ag
.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 415619

`

My gpu is 2 * RTX 3070 with 8G .

@ross-Hr
Copy link
Author

ross-Hr commented Oct 12, 2022

Is the GPU memory too small ?

@chentingpc
Copy link
Collaborator

chentingpc commented Oct 12, 2022 via email

@ross-Hr
Copy link
Author

ross-Hr commented Oct 18, 2022

It is the annoantions error. I reload the annoations to solve the error.
But the new error likes :

W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias

My tf==2.10.0

This looks like some data issue as the complaint was about a keyerror probably related to image id.

On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>

@chentingpc
Copy link
Collaborator

chentingpc commented Oct 18, 2022 via email

@ross-Hr
Copy link
Author

ross-Hr commented Oct 18, 2022

this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match.

On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.> wrote: It is the annoantions error. I reload the annoations to solve the error. But the new error likes : W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias My tf==2.10.0 This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m_2240461384712268694_> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ? — Reply to this email directly, view it on GitHub <#19 (comment) <#19 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.**>

I git clone the repo and did not change anything.
Which version of TF are you using?
I put the Object365 checkpoints into model_dir and the command likes

python3 run.py --mode=train --model_dir=/data/c/Objects365-vitb-640/ --config=configs/config_det_finetune.py --config.dataset.data_dir=/data/c/pix2seq --config.dataset.coco_annotations_dir=/data/c/annotations --config.train.batch_size=8 --config.train.epochs=20 --config.optimization.learning_rate=3e-5

but get the above error.
The config.dataset.data_dir is my offline coco tfds.

By the way , I wonder if this is wrong
image

@ross-Hr
Copy link
Author

ross-Hr commented Oct 18, 2022

well, i change the code in model.py
latest_ckpt, ckpt, self._verify_restored = utils.restore_from_checkpoint( model_dir, False, model=model, global_step=optimizer.iterations, optimizer=optimizer)
by
False to True, i.e. using
checkpoint.restore(latest_ckpt).expect_partial()
can avoid the error. But i still confused about that.

@ross-Hr
Copy link
Author

ross-Hr commented Nov 8, 2022

@chentingpc
Hi, do you know how to debug with strategy.run(...) in train_multiple_steps function ?
I can not step into the train_step function.

@chentingpc
Copy link
Collaborator

you should be able to do pdb in the code when running in eager mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants