Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not reproduce top1 acc 77.0% on Kinetics #197

Open
gooners1886 opened this issue May 12, 2020 · 17 comments
Open

can not reproduce top1 acc 77.0% on Kinetics #197

gooners1886 opened this issue May 12, 2020 · 17 comments
Labels
question Further information is requested

Comments

@gooners1886
Copy link

gooners1886 commented May 12, 2020

I try to train a model from scratch on Kinetics-400 using same data as Non-local Network.
the config file is configs/Kinetics/SLOWFAST_8x8_R50.yaml
TRAIN.BATCH_SIZE is set to 64
GPU is 4xP40s,
But I got the top1-acc 74.13% on the validation set.
[INFO: logging.py: 67]: json_stats: {"split": "test_final", "top1_acc": "74.13", "top5_acc": "91.08"}
Is there something need to modify to reproduce the 77% reported in the model zoo?

@haooooooqi haooooooqi added the question Further information is requested label May 12, 2020
@haooooooqi
Copy link
Contributor

Hi, Thanks for playing with PySlowFast.

I am not sure about your detailed setting (especially your dataset size and pre-process), but one thing seems to be wrong is the batch size you use. Could you try to make sure you have 8 batch size per GPU? If you only have 4 GPUs, you could change the LR following the linear scaling rule.

@gooners1886
Copy link
Author

@takatosp1

  1. my dataset is the same as Non-local Network. there are 234584 samples in training set and 19760 samples in val set. I conduct no pre-process because the videos in Non-local Network is already set to 256 for the shorter side. Is this config right?

  2. there are some diff between the your config and mine in GPU num.
    your code settting: 8GPU, 8 batch size per GPU, TRAIN.BATCH_SIZE is set to 64
    mine code setting: 4gpu, 16 batch szie per GPU . TRAIN.BATCH_SIZE is set to 64
    i think since TRAIN.BATCH_SIZE is 64 in both settings, they should be the equal.
    is the two setttings above equal? or i need to change the LR?

@haooooooqi
Copy link
Contributor

haooooooqi commented May 12, 2020

Running mean and std is only calculated on each of the device (GPU), so the running mean across 8 samples would be different from the running mean across 16 samples, so I think the more equivalent version should be BZ of 32 with half of the original lr.

@gooners1886
Copy link
Author

@takatosp1 thank you very much for your guide!!!
another question is when i try to train kinetics-400 with 4gpu, batchsize 32, and half lr, should i modify the SOLVER.MAX_EPOCH? and WARMUP_EPOCHS ? and WARMUP_START_LR ?

@gurkirt
Copy link

gurkirt commented May 30, 2020

@gooners1886 are you able train it on 4 GPUs?

@gurkirt
Copy link

gurkirt commented May 30, 2020

Also, when I train I3D from scratch, I got "split": "test_final", "top1_acc": "72.82", "top5_acc": "90.65". Which is about 0.5 lower than mentioned in model zoo. @takatosp1 Can you please reveal, what did you achieve with this code while benchmarking.

When I test with provided Caffe 2 weight I get
I3D_R50 {"split": "test_final", "top1_acc": "73.04", "top5_acc": "90.34"}
SLOWFAST {"split": "test_final", "top1_acc": "76.44", "top5_acc": "92.22"}

For your reference, my dataset is the same as the dataset of Non-local paper. As mentioned by Xiaolong here . I got a copy of it from him, last year. It contains 234619 training videos and 19761 videos. However, I get a warning, [WARNING: meters.py: 302]: clip count tensor([30, 30, 30, ..., 30, 30, 30]) ~= num clips 30, when I run testing script on validation data.

pyav=='8.0.1'

@chunfuchen
Copy link

chunfuchen commented Jun 16, 2020

@gurkirt Which SLOWFAST model you tested? I tested the SLOWFAST_8x8_R50.pkl model but I only get {"split": "test_final", "top1_acc": "74.55", "top5_acc": "91.36"}. It is about 2.5% worse. I have 19742 data.

pyav==8.0.2
ffmpeg==4.2.3

@haooooooqi
Copy link
Contributor

Thanks @gurkirt for the kind clarification. @chunfuchen feel free to follow what @gurkirt described and you should able to reproduce the result.

@gurkirt
Copy link

gurkirt commented Jun 17, 2020

@takatosp1, Is it expected to get 72.8 instead of 73.4 with the current setup? I know this is a small gap, I just want to make sure if that I am not making any errors, here.

@gurkirt
Copy link

gurkirt commented Jun 17, 2020

@gurkirt Which SLOWFAST model you tested? I tested the SLOWFAST_8x8_R50.pkl model
@chunfuchen I trained I3D_8x8_R50.cfg from scratch and got 72.8.

@chunfuchen
Copy link

@gurkirt is it possible I could get a copy of kinetics400 from you? Thanks.

@gurkirt
Copy link

gurkirt commented Jun 17, 2020

You can find it here facebookresearch/video-nonlocal-net#67

@chunfuchen
Copy link

@takatosp1 I have followed @gurkirt to download the data.
I tested a model (SLOWFAST_8x8_R50) that has 77% top-1 accuracy on the model zoo page but I only get 76.44%, which is 0.6% lower. (I do not retrain it, just tested the model provided in the github.)
I know the model was trained and tested under caffe2, do you expect there is a 0.6 gap when switching to pytorch?

Thanks.

@youngwanLEE
Copy link

@chunfuchen same situation.

Is there any way to get the intact kinetics dataset?

@bqhuyy
Copy link

bqhuyy commented Aug 9, 2020

@takatosp1 thank you very much for your guide!!!
another question is when i try to train kinetics-400 with 4gpu, batchsize 32, and half lr, should i modify the SOLVER.MAX_EPOCH? and WARMUP_EPOCHS ? and WARMUP_START_LR ?

Have you reproduced the result of SLOWFAST_8x8_R50? Can you share your config when training with 4GPUS?

@bqhuyy
Copy link

bqhuyy commented Aug 9, 2020

Running mean and std is only calculated on each of the device (GPU), so the running mean across 8 samples would be different from the running mean across 16 samples, so I think the more equivalent version should be BZ of 32 with half of the original lr.

I try to reproduce SLOWFAST_8x8_R50 from scratch. Can you share the configuration to train on 4GPUS machine? Thank you

@BoPang1996
Copy link

I use the config configs/Kinetics/SLOWFAST_8x8_R50.yaml. The date I use is shared by Xiaolong Wang in Nonlocal which contains 234643 training videos and 19761 val videos. I do not have 16 nodes, so I trained the model on 2x8 V100 cards with 8 mini-batch on each card. The base learning rate is scaled to 0.2. The top-1 accuracy is 75.8, 1.2 lower than the official results.

For SLOWFAST_8x8_R101_101_101.yaml, the reproduced top-1 acc is 77.2, 0.7 lower than the official results.

For SLOW_8x8_R50.yaml, the reproduced top-1 acc is 74.0, 0.8 lower than the official results.

Does anyone else suffer from this problem? Is this caused by incomplete data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants