Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer learning on Jwyang's checkpoint #6

Open
Mahad-M opened this issue Jul 14, 2020 · 1 comment
Open

Transfer learning on Jwyang's checkpoint #6

Mahad-M opened this issue Jul 14, 2020 · 1 comment

Comments

@Mahad-M
Copy link

Mahad-M commented Jul 14, 2020

I have tried to resume training on a checkpoint that was created by jwyang's repository using -r flag. The checkpoint was trained for 50 epochs and now I want to train it for 20 more epochs. However, I get the following error when I try to train with ADAM optimizer.

python run.py train --net resnet101 --dataset voc_2007_trainval -r --epoch 50 --total_epoch 70 --cuda -o adam

Called with args:
Namespace(add_params=[], batch_size=None, class_agnostic=False, cuda=True, dataset='voc_2007_trainval', display_interval=100, epoch=50, learning_rate=None, lr_decay_gamma=None, lr_decay_step=None, mGPU=False, mode='train', net='resnet101', optimizer='adam', pretrain=False, resume=True, save_dir='models', session=1, total_epoch=70, vis_off=False)
Current device: CUDA:0
Using config:
GENERAL:
{'MAX_IMG_RATIO': 2.0,
'MAX_IMG_SIZE': 1200,
'MIN_IMG_RATIO': 0.5,
'MIN_IMG_SIZE': 800,
'POOLING_MODE': 'pool',
'POOLING_SIZE': 7}
TRAIN:
{'BATCH_SIZE': 2,
'BG_THRESHOLD_HI': 0.5,
'BG_THRESHOLD_LO': 0.0,
'BIAS_DECAY': False,
'DOUBLE_BIAS': True,
'FG_PROPOSAL_FRACTION': 0.25,
'FG_THRESHOLD': 0.5,
'LEARNING_RATE': 0.001,
'LR_DECAY_GAMMA': 0.1,
'LR_DECAY_STEP': 5,
'MOMENTUM': 0.9,
'PROPOSAL_PER_IMG': 256,
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_LABELS_FRACTION': 0.5,
'RPN_MAX_LABELS': 256,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESHOLD': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POST_NMS_TOP': 2000,
'RPN_PRE_NMS_TOP': 12000,
'USE_FLIPPED': False,
'WEIGHT_DECAY': 0.0005}
RPN:
{'ANCHOR_SCALES': [2, 4, 8, 16, 32],
'ANCHOR_RATIOS': [0.5, 1, 2, 4, 8],
'FEATURE_STRIDE': 16}
Loading image dataset...
WARNING! Cannot find "devkit_path" in additional parameters. Try to use default path (./data/VOCdevkit)...
Used image config: {'color_mode': 'BGR', 'range': 255, 'mean': [102.9801, 115.9465, 122.7717], 'std': [1.0, 1.0, 1.0]}
Data for voc_2007_trainval gt roidb loaded from /mnt/mahad/faster-rcnn-pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
Loaded PascalVoc 2007 trainval dataset.
Preparing image data...
Done.
Filtering image data (remove images without boxes)...
Before filtering, there are 403 images...
After filtering, there are 403 images...
Done.
Output directory: /mnt/mahad/faster-rcnn-pytorch/data/models/resnet101/voc_2007
Loading pretrained weights from /mnt/mahad/faster-rcnn-pytorch/data/pretrained_model/resnet101.pth...
Done.
Loading checkpoint /mnt/mahad/faster-rcnn-pytorch/data/models/resnet101/voc_2007/frcnn_1_50.pth...
Done.
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple)
Traceback (most recent call last):
File "run.py", line 135, in
train(dataset=args.dataset, net=args.net, batch_size=args.batch_size,
File "/mnt/mahad/faster-rcnn-pytorch/script/train.py", line 141, in train
optimizer.step()
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/optim/adam.py", line 86, in step
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
KeyError: 'exp_avg'

With sgd optimizer, the error trace is as follows,

python run.py train --net resnet101 --dataset voc_2007_trainval -r --epoch 50 --total_epoch 70 --cuda
Called with args:
Namespace(add_params=[], batch_size=None, class_agnostic=False, cuda=True, dataset='voc_2007_trainval', display_interval=100, epoch=50, learning_rate=None, lr_decay_gamma=None, lr_decay_step=None, mGPU=False, mode='train', net='resnet101', optimizer='sgd', pretrain=False, resume=True, save_dir='models', session=1, total_epoch=70, vis_off=False)
Current device: CUDA:0
Using config:
GENERAL:
{'MAX_IMG_RATIO': 2.0,
'MAX_IMG_SIZE': 1200,
'MIN_IMG_RATIO': 0.5,
'MIN_IMG_SIZE': 800,
'POOLING_MODE': 'pool',
'POOLING_SIZE': 7}
TRAIN:
{'BATCH_SIZE': 2,
'BG_THRESHOLD_HI': 0.5,
'BG_THRESHOLD_LO': 0.0,
'BIAS_DECAY': False,
'DOUBLE_BIAS': True,
'FG_PROPOSAL_FRACTION': 0.25,
'FG_THRESHOLD': 0.5,
'LEARNING_RATE': 0.001,
'LR_DECAY_GAMMA': 0.1,
'LR_DECAY_STEP': 5,
'MOMENTUM': 0.9,
'PROPOSAL_PER_IMG': 256,
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_LABELS_FRACTION': 0.5,
'RPN_MAX_LABELS': 256,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESHOLD': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POST_NMS_TOP': 2000,
'RPN_PRE_NMS_TOP': 12000,
'USE_FLIPPED': False,
'WEIGHT_DECAY': 0.0005}
RPN:
{'ANCHOR_SCALES': [2, 4, 8, 16, 32],
'ANCHOR_RATIOS': [0.5, 1, 2, 4, 8],
'FEATURE_STRIDE': 16}
Loading image dataset...
WARNING! Cannot find "devkit_path" in additional parameters. Try to use default path (./data/VOCdevkit)...
Used image config: {'color_mode': 'BGR', 'range': 255, 'mean': [102.9801, 115.9465, 122.7717], 'std': [1.0, 1.0, 1.0]}
Data for voc_2007_trainval gt roidb loaded from /mnt/mahad/faster-rcnn-pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
Loaded PascalVoc 2007 trainval dataset.
Preparing image data...
Done.
Filtering image data (remove images without boxes)...
Before filtering, there are 403 images...
After filtering, there are 403 images...
Done.
Output directory: /mnt/mahad/faster-rcnn-pytorch/data/models/resnet101/voc_2007
Loading pretrained weights from /mnt/mahad/faster-rcnn-pytorch/data/pretrained_model/resnet101.pth...
Done.
Loading checkpoint /mnt/mahad/faster-rcnn-pytorch/data/models/resnet101/voc_2007/frcnn_1_50.pth...
Done.
/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple)
Traceback (most recent call last):
File "run.py", line 135, in
train(dataset=args.dataset, net=args.net, batch_size=args.batch_size,
File "/mnt/mahad/faster-rcnn-pytorch/script/train.py", line 141, in train
optimizer.step()
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/dl38/lib/python3.8/site-packages/torch/optim/sgd.py", line 106, in step
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (100) must match the size of tensor b (3) at non-singleton dimension 3

Any help would be appreciated.

@loolzaaa
Copy link
Owner

KeyError: 'exp_avg'

Try to reproduce it for myself and return.

What about SGD:

RuntimeError: The size of tensor a (100) must match the size of tensor b (3) at non-singleton dimension 3

Incompatible sizes of the tensors in checkpoint. Which sizes of the input tensor of the net? Which sizes of the checkpoint parameters??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants