Reproducing the published COCO training results #253

andravin · 2021-11-01T00:29:44Z

andravin
Nov 1, 2021

Does efficientdet-pytorch reproduce the COCO dataset training results published in any version of the paper EfficientDet: Scalable and Efficient Object Detection?

I get reasonable results for efficientdet_d0 when training from scratch with --no-pretrained-backbone, but efficientdet_d1 results are not very close to the published results, and efficientdet_d2 actually is worse than the smaller models.

Edit: @rwightman points out that the EfficientDet paper initializes the model's backbone with ImageNet pretrained weights, so we shouldn't expect the same 300 epoch training schedule to reproduce the published results. Yet there may be other differences with the official implementation, which we pursue below.

rwightman · 2021-11-01T05:26:10Z

rwightman
Nov 1, 2021
Maintainer

@andravin there do appear to be some differences in behaviour from the official impl, as I've mentioned before after the official impl updated there weights at one point I found it hard to match them. It was part of a change they made for the longer training schedule on the larger models, but they also reposted better weights for the smaller ones but supposedly at the same epoch counts.

The official models are trained on TPUs with a custom batch norm syncing, that differs a bit from PyTorch. I've also found that PyTorch sync-bn really messed up some of my training runs (seems somewhat weight init specific) so I don't use it anymore but that does decrease stability of training as model size increases.

I realize I didn't train a pure D2 but I did train a 'Q2' fairly well which is similar but more params due to the QuadFPN. Certainly better than Q1 and Q0 runs and similarly D0 and D1. Again, I had to disable sync-bn.

I also always train with pretrained backbone. The original training scheme uses pretrained backbones so I don't think your proposed hparams would be similar. I don't recall if it was a blog post or another paper where the non-pretrained backbones were used, but I'm pretty sure they upped the epochs to 450 and beyond...

0 replies

andravin · 2021-11-01T06:33:10Z

andravin
Nov 1, 2021
Author

I also always train with pretrained backbone. The original training scheme uses pretrained backbones so I don't think your proposed hparams would be similar

Thanks, I will retry with the pre-trained backbone weights.

I think you are saying that PyTorch's SyncBatchNorm is broken. Is there a reference for this issue?

4 replies

rwightman Nov 1, 2021
Maintainer

@andravin I never figured out exactly what was happening but did burn a bunch of time on it. I have encountered a simpler failure case with SyncBatchNorm in the past, if you try training a standard ResNet for classification task and zero_init the last batch norm in the residual path it fails to train well. I suspect it's related.

andravin Nov 1, 2021
Author

I am searching the PyTorch github issues related to SyncBatchNorm. So far, none of the issues suggest that SyncBatchNorm computes the wrong output. Such a bug would affect many PyTorch users, not just efficientdet-pytorch users. I would expect it to have been reported shortly after it was introduced.

andravin Dec 7, 2021
Author

Here is a PyTorch SyncBatchNorm issue that was fixed last year: pytorch/pytorch#36382

andravin Dec 7, 2021
Author

Here is an interesting article about how someone went about isolating the above bug and others: https://ppwwyyxx.com/blog/2020/Fight-Against-Silent-Bugs-in-Deep-Learning-Libraries

andravin · 2021-11-01T23:52:41Z

andravin
Nov 1, 2021
Author

Initializing the backbone with pre-trained weights made a big difference. The *_d1 --model-ema training run overtook the --no-pretrained-backbone best accuracy after ~60 epochs.

0 replies

andravin · 2021-11-04T02:06:03Z

andravin
Nov 4, 2021
Author

@andravin there do appear to be some differences in behaviour from the official impl, as I've mentioned before after the official impl updated there weights at one point I found it hard to match them. It was part of a change they made for the longer training schedule on the larger models, but they also reposted better weights for the smaller ones but supposedly at the same epoch counts.

OK, this will be a review for you, but thought I would summarize what the different versions of the EfficientDet paper have to say about this.

In the last version (v7, 27 Jul 2020) of the EfficientDet paper, Appendix 1.1 "Hyperparameters" seems to imply that the increase in accuracy of EfficientNet-D1 from 38.3 in v1 of the paper to 40.2 in in v7 is entirely due to more training epochs (300) and greater scale jittering, [0.1, 2.0].

The penultimate paper (v6, 14 June 2020) reported EfficientDet-D1 at 39.1 mAP and trained for 300 epochs. It does not report the scale jittering used; but 39.1 is exactly the accuracy plotted in Figure 8 of v7 for the same model at 300 epochs and a lesser amount of scale jittter, [0.5, 1.5].

So the clues seem to plot a path from v1 to v7 that increases accuracy by increasing epochs to 300 and scale jitter to [0.1, 2.0].

It appears that the augmentation in efficientdet-pytorch uses the same scale jitter, [0.1, 2.0]:

efficientdet-pytorch/effdet/data/transforms.py

Lines 112 to 117 in c5b694a

    
           class RandomResizePad: 
        
               def __init__(self, target_size: int, scale: tuple = (0.1, 2.0), interpolation: str = 'random', 
        
                            fill_color: tuple = (0, 0, 0)): 
        
                   self.target_size = _size_tuple(target_size) 
        
                   self.scale = scale

Nevertheless, I would expect that learning rate schedule or scale jittering would be a good place to look for any difference in behavior between efficientdet-pytorch and the official EfficientDet implementation.

1 reply

rwightman Nov 4, 2021
Maintainer

@andravin I'm pretty sure 300 epochs was always used in the released models, at least since their code was available which defaulted to 300. I'm still not certain that jumps in accuracy were accounted for, not from the v1 but I think the iterations that had 39.x for D1 and then it popped up to 40.2 without changing epochs from what I could tell. Scale jitter was also always .1 to 2.0 from the point I implemented

andravin · 2021-11-04T20:28:16Z

andravin
Nov 4, 2021
Author

Here is the reported EfficientDet-D0 and D1 accuracy versus paper version on arxiv.org:

EfficientDet-D0

	v1	v2	v3	v7
mAP	32.4	32.9
test-dev AP			33.8	34.6
val AP			33.5	34.3

EfficientDet-D1

	v1	v2	v3	v7
mAP	38.3	38.9
test-dev AP			39.6	40.5
val AP			39.1	40.2

Since we seem to be an impasse, I think I will ask the authors to explain these numbers.

4 replies

rwightman Nov 4, 2021
Maintainer

I'm not sure there will be an ans for that question. On my end the differences haven't been significant enough for me to sink more time into it, I definitely dug through a few times, fixed a few small things that didn't end up having much impact.

Aside from potential bugs in my code, it's also really tough to say how significant the differences in numerics are. TPU vs GPU, Tensorflow vs PyTorch, batch norm eps / momentum differences, TF custom BN synv vs torch synbn (or none), SAME padding vs Pytorch, Pillow vs TF image processing, etc.

andravin Nov 4, 2021
Author

I'm not sure there will be an ans for that question.

The authors would of course know what they changed in each version of the paper.
I am very curious to learn how they achieved +2% mAP improvement between v1 and v7.

rwightman Nov 4, 2021
Maintainer

I meant as whether there is a response or not :p Either way, it might be worth training with their code to see if the results are actually reproducable there on GPU.

andravin Nov 5, 2021
Author

Not sure why anybody would consider it acceptable to ignore valid questions about the reproducibility of published results. I am not a scientist if I don't ask the question.

rwightman · 2021-11-05T02:36:52Z

rwightman
Nov 5, 2021
Maintainer

Agreed. Still probably worth trying to 4x GPU train in Tensorflow impl. Maybe just as challenging to reproduce there. If that's the case the question for them is different. Also understanding seed to seed variation is worthwhile. It's a week or two to run these exp for me so I've never done it.

…

On Thu, Nov 4, 2021, 6:41 PM Andrew Lavin ***@***.***> wrote: Not sure why anybody would consider it acceptable to ignore valid questions about the reproducibility of published results. I am not a scientist if I don't ask the question. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#253 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICH7YPO2TCUCYWPAONLUKM75LANCNFSM5HC4RBWA> .

0 replies

rwightman · 2021-11-17T01:46:48Z

rwightman
Nov 17, 2021
Maintainer

@andravin where did your experiments end up?

5 replies

andravin Nov 17, 2021
Author

I had to interrupt the experiments because urgent jobs needed the GPU resources.

andravin Dec 29, 2021
Author

@rwightman the mAP with the following hyperparameters is 38.7% at epoch 287; will have the final result soon.

Notice I did not use sync-batchnorm due to the large batch_size = 32 / GPU; also did not use exponential moving average. An experimental model did very well with the same exact hyperparamters.

I would guess the difference between my result and yours is ema. Thoughts?


DATASET=/data/mscoco

PROJECT=grr

BATCH_SIZE=32
PROCS_PER_NODE=4
LEARNING_RATE=0.16
WORKERS=5

EPOCHS=300
WARMUP=5

MODEL=efficientdet_d1
MASTER_PORT=29503

python3 -m torch.distributed.launch \
        --nproc_per_node=$PROCS_PER_NODE \
        --master_port=$MASTER_PORT \
        train.py \
        --batch-size $BATCH_SIZE \
        --lr $LEARNING_RATE \
        --epochs $EPOCHS \
        --warmup-epochs $WARMUP \
	--workers $WORKERS \
        --project $PROJECT \
        --model $MODEL \
	--native-amp \
	--opt momentum \
	$DATASET

andravin Dec 29, 2021
Author

model best:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.388
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.577
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.412
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.195
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.442
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.322
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.498
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.527
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.300
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.593
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.724

andravin Dec 31, 2021
Author

Checkpoint averaging did not change the mAP.

So, the 39.4 result in the efficientdet-pytorch README was trained with extra regularization in the form of --lr-noise 0.4 0.9.

My result of 38.8 might be the best efficientdet-pytorch can do with the hyperparameters published in the EfficientDet paper. It is close to the 38.9 mAP published in v2 of the EfficientDet paper and 39.1 published in v3. Again, the mystery is how the result became 40.2 in v7.

Maybe ema would help a bit, I might try it.

rwightman Dec 31, 2021
Maintainer

@andravin I'd be really curious to know where the official TF code ends up with the same GPU config / batch sizes / etc ..

Training with EMA could bring it up a little bit, likely closer to my 39.4, but not up to 40.2 based on my runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the published COCO training results #253

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Reproducing the published COCO training results #253

andravin Nov 1, 2021

Replies: 7 comments · 14 replies

rwightman Nov 1, 2021 Maintainer

andravin Nov 1, 2021 Author

rwightman Nov 1, 2021 Maintainer

andravin Nov 1, 2021 Author

andravin Dec 7, 2021 Author

andravin Dec 7, 2021 Author

andravin Nov 1, 2021 Author

andravin Nov 4, 2021 Author

rwightman Nov 4, 2021 Maintainer

andravin Nov 4, 2021 Author

rwightman Nov 4, 2021 Maintainer

andravin Nov 4, 2021 Author

rwightman Nov 4, 2021 Maintainer

andravin Nov 5, 2021 Author

rwightman Nov 5, 2021 Maintainer

rwightman Nov 17, 2021 Maintainer

andravin Nov 17, 2021 Author

andravin Dec 29, 2021 Author

andravin Dec 29, 2021 Author

andravin Dec 31, 2021 Author

rwightman Dec 31, 2021 Maintainer

andravin
Nov 1, 2021

Replies: 7 comments 14 replies

rwightman
Nov 1, 2021
Maintainer

andravin
Nov 1, 2021
Author

rwightman Nov 1, 2021
Maintainer

andravin Nov 1, 2021
Author

andravin Dec 7, 2021
Author

andravin Dec 7, 2021
Author

andravin
Nov 1, 2021
Author

andravin
Nov 4, 2021
Author

rwightman Nov 4, 2021
Maintainer

andravin
Nov 4, 2021
Author

rwightman Nov 4, 2021
Maintainer

andravin Nov 4, 2021
Author

rwightman Nov 4, 2021
Maintainer

andravin Nov 5, 2021
Author

rwightman
Nov 5, 2021
Maintainer

rwightman
Nov 17, 2021
Maintainer

andravin Nov 17, 2021
Author

andravin Dec 29, 2021
Author

andravin Dec 29, 2021
Author

andravin Dec 31, 2021
Author

rwightman Dec 31, 2021
Maintainer