Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducibility and domain classifier loss #1

Open
vidit09 opened this issue Aug 27, 2020 · 26 comments
Open

reproducibility and domain classifier loss #1

vidit09 opened this issue Aug 27, 2020 · 26 comments

Comments

@vidit09
Copy link

vidit09 commented Aug 27, 2020

Hello,
Thanks for providing the code. I was able to run the code for Sim10k --> Cityscapes and got scores lower than reported. mAP 24.8, mAP @0.5 42.6, mAP @0.75 24.4 , mAP @s 5.3 , mAP @m 27.3 , mAP@L 51.1 . One thing that I noticed is that the domain classifier loss goes down to zero for all the pyramid scales after few iterations. Could you please let me know what kind of trend you see for the domain classifier losses, which could possibly be the reason for me seeing the lower scores?

Thanks

@chengchunhsu
Copy link
Owner

chengchunhsu commented Aug 27, 2020

Hi,

I assume you were running the pre-training stage (only GA is applied).

First, we do observe the similar behavior that the domain classifier loss quickly reducing to a small value.
Accordingly, we tend to select the weight of former iteration (around 5000) rather than later iteration (e.g., 20000 in your case) as the pre-trained weight for further training (GA+CA stage).
I will update more relative details in README and config files.

Note that the training can still improve the performance, just not as good as GA+CA did.
Actually, the only reason we set the iteration of the pre-training stage to 20000 is to find the best result for the GA model to better compare with the GA+CA model.

Second, there is still some randomness in the training procedure due to adversarial learning.

Here we provide the training log for Sim10k --> Cityscapes using VGG backbone (only GA).
Link: https://drive.google.com/file/d/1f-948P_FDSQ1fC-5Q6OsLVrDQ_obTrtM/view?usp=sharing

As far as I can tell, the performance for the final iteration is quite similar.
If you are still concern about the reproducibility for the pre-training stage, we recommend you evaluate the model from different iterations.
Otherwise, you can just keep going on the next stage and see if there are other issues.

Hope the provided information is helpful!

Best,
Cheng-Chun

@vidit09
Copy link
Author

vidit09 commented Aug 28, 2020

Thank you for your answer. The above reported numbers are after da+ca training, but I initialise the second stage training with the weights at 20k iterations. I will try to initialise second stage with the weights at 5k iteration. Could you please comment on the trend for the domain loss in your second stage training ?

Thanks
Vidit

@chengchunhsu
Copy link
Owner

chengchunhsu commented Aug 30, 2020

Hi,

I notice two issues that can lead to a small loss value shown in the log file.

  1. the logged loss values are scaled by the lambda value (e.g., L180 in trainer.py).
  2. the default BCELoss implementation of pytorch applied mean operation on all elements. (L64 in fcos_head_discriminator.py)

Accordingly, the loss value appears to be smaller than it should be in the training log after we round off four decimal places.
I also observed the above issues in the previous experiment.

For now, I believe the issue was caused by the implementation of the logger rather than the coverage of the training.

Best,
Cheng-Chun

@vidit09
Copy link
Author

vidit09 commented Aug 31, 2020

Thanks for this update. Yes, I see that for the second stage that all the adversarial losses don't go to zero(p7-p5). Could you please tell me how to run the code without enabling da?

Best,
Vidit

@chengchunhsu
Copy link
Owner

chengchunhsu commented Aug 31, 2020

Sure, you can turn off the adversarial adaptation by setting DA_ON as False (example).

Moreover, you can also clone the original version of FCOS (link) and run it.

Cheng-Chun

@vidit09
Copy link
Author

vidit09 commented Sep 1, 2020

Thanks for the pointers. Another question that I have is how do you decide to which iteration's weight to take to initialize for the second stage's training. Do you use some validation set?

@chengchunhsu
Copy link
Owner

In most of our experiments, the pretrained iteration was set as 5k or 10k, except we set 2k for the KITTI dataset since it requires less training iterations.
I think the result is not very sensitive to the pretrained iteration.

Cheng-Chun

@andgitchang
Copy link

andgitchang commented Sep 7, 2020

Excuse me to follow up this issue. @chengchunhsu
I noticed that the learning rate also set as 0.005 in single_gpu settings. However, don't we need to modify the learning rate according to the linear scaling rule? e.g. scales to 0.00125 during single-gpu training (instead of 4-gpus)?
In the best of your experiences, is the learning rate of adversarial training sensitive? It seems that the larger learning rate setting (Sim10k->Cityscapes, single-gpu @ AP49.7) is competitive with the camera-ready result 4gpus @ AP49.0

@chengchunhsu
Copy link
Owner

Hi @andgitchang

Thanks for bringing it up.
You are right about the fact that the learning rate should be adjusted based on the batch size.

Performing single-gpu training without adjusting the learning rate was a pure mistake.
I will fix the problem and indicate it in README later.

For your question, I am not sure about the answer since I do not have direct comparison result for different learning rate.
But in general, a large learning rate will result in an unstable coverage of adversarial learning.
Therefore, I recommend you use the adjusted learning rate for single gpu training.

Best,
Cheng-Chun

@andgitchang
Copy link

@chengchunhsu Thanks for your suggestion. I tried to reproduce vgg16 cityscapes->foggy-cityscapes and ended up reaching AP38.7 at iteration ~6000 (AP38.3 at iteration 8000). The only difference is that the learning rate is 4x larger than your script. It seems that 4x larger lr somehow didn't blow up adversarial training and even increased performance by a large margin.

@vidit09
Copy link
Author

vidit09 commented Sep 10, 2020

@chengchunhsu for sim10k->cityscapes without DA, for how many iterations do you train your network? Simply, setting DA_ON as False, did not work for me. I set USE_DIS_GLOBAL as False, also had to modify the tools/train_net.py and engine/trainer.py in order to only train for the source domain, which you can find here --> https://gist.github.com/vidit09/dad5b704ebe8c0468d75f72838d9d799 . [email protected] achieved is 45.0 in this case, more than the one reported in the paper(39.8). Could you please share the code used for only source domain training inorder to verify what I did wrong?

Best,
Vidit

@andgitchang
Copy link

@vidit09 I didn't try the source only experiment, since the proposed GA+CA method was trained from ImageNet pretrained backbone not the source only model. Your modified trainer looks okay. Please also make sure DATASETS.TRAIN/TEST to be sim10k/cityscapes accordingly. Otherwise, clone the origin FCOS repo may be the simplest way to reproduce/verify source only model.

@vidit09
Copy link
Author

vidit09 commented Sep 11, 2020

thanks, @andgitchang for your clarifications but I still don't get how the original FCOS repo will get me source only result as it was not trained on Sim10k.

@andgitchang
Copy link

@vidit09 It's trivial since both FCOS and EveryPixelMatters use maskrcnn-benchmark. You just need to repeat this steps to the FCOS repo, including path_catelog and the configs.

@vidit09
Copy link
Author

vidit09 commented Sep 11, 2020

@andgitchang I also need to know what is the learning rate used and for how many iterations the network is trained for.

@chengchunhsu
Copy link
Owner

chengchunhsu commented Sep 12, 2020

Hi @andgitchang

It seems our training is not sensitive to the learning rate when small batch size is applied.
Thanks for the feedback!

Please let me know if you found anything else.

Cheng-Chun

@chengchunhsu
Copy link
Owner

Hi @vidit09,

I cannot see the exact problem from your description.
Generally, there would be some randomness during the training, yet it seems the margin (i.e., 39.8 vs. 45.0) is a bit too large.

As mentioned by @andgitchang, run the clean FCOS code would be an easier way since our implementation and FCOS share the same code base.
This is exactly how we get the source-only result.
The only modification is made on the configs, dataloader, and VGG backbone.
Moreover, the training iteration and learning rate are set as 24k and 0.005, respectively.

I hope this can help you detect the problem.

Best,
Cheng-Chun

@vidit09
Copy link
Author

vidit09 commented Sep 14, 2020

In most of our experiments, the pretrained iteration was set as 5k or 10k, except we set 2k for the KITTI dataset since it requires less training iterations.
I think the result is not very sensitive to the pretrained iteration.

Cheng-Chun

Hi Cheng-Chun,
Following your above suggestion, I trained the ga+ca with the weight initialised from 5k and 10k iteration. The corresponding final [email protected] for sim10k-->cityscapes obtained is 49.2 and 46, respectively. So looks like the choice of the iteration does influence the final result.

Best,
Vidit

@vidit09
Copy link
Author

vidit09 commented Sep 15, 2020

Hi Cheng-Chun,
I trained the source only model for 24k iteration and lr 0.005. I could get 39.9 for sim10k-->cityscapes. Previously, I had trained for 80k iteration following the example of training on the single GPU, for which I achieved 45 AP. So here as well the number of iteration affects the final score that we see. Could please tell me how these max iterations were decided?

Best,
Vidit

@chengchunhsu
Copy link
Owner

Hi @vidit09,

Sorry for the late reply as I was a bit busy this week.

Following your above suggestion, I trained the ga+ca with the weight initialised from 5k and 10k iteration. The corresponding final [email protected] for sim10k-->cityscapes obtained is 49.2 and 46, respectively. So looks like the choice of the iteration does influence the final result.

Thanks for letting me know!
Just note that the training iteration of the second stage could also affect the final result sometimes.

I trained the source only model for 24k iteration and lr 0.005. I could get 39.9 for sim10k-->cityscapes. Previously, I had trained for 80k iteration following the example of training on the single GPU, for which I achieved 45 AP. So here as well the number of iteration affects the final score that we see. Could please tell me how these max iterations were decided?

The selection for training iteration is a bit tricky in domain adaptation since we should not be able to evaluate the model on the target dataset. Moreover, the randomness that occurs during training could also result in a performance margin on the target domain.

For the source-only detector, the training iteration is derived from the original FCOS setting on the cityscapes dataset.
We have also tried to evaluate the model from different iterations but still cannot achieve any better performance.
Plus, we have not tried to train a source-only detector by one GPU.
So I am not sure of the difference would be caused by the single GPU setting or the selection of training epochs.

@vidit09
Copy link
Author

vidit09 commented Sep 21, 2020

Thanks Cheng-Chu for the clarifications.

@andgitchang
Copy link

Hi @chengchunhsu ,
As described in subsection 3.2 of the paper that the domain label z of source and target domain is 1 and 0 correspondingly; however, I noticed that in your implementation the label of the source/target domain is opposite. (1 for target and 0 for source)
If I use Loss_D(source, 0) + Loss_D(target, 1) to unfold the binary cross entropy terms, the loss function for the discriminator may be inequivalent to your current Eq.(1).
I appreciate your help, thanks.

@chengchunhsu
Copy link
Owner

Hi @andgitchang,

Now the code is consistent with the equation in the paper.
Thanks for pointing it out.

Cheng-Chun

@unlabeledData
Copy link

Hi @ vidit09 ,

Sorry for the late reply as I was a bit busy this week.

Following your above suggestion, I trained the ga+ca with the weight initialised from 5k and 10k iteration. The corresponding final [email protected] for sim10k-->cityscapes obtained is 49.2 and 46, respectively. So looks like the choice of the iteration does influence the final result.

Thanks for letting me know!
Just note that the training iteration of the second stage could also affect the final result sometimes.

I trained the source only model for 24k iteration and lr 0.005. I could get 39.9 for sim10k-->cityscapes. Previously, I had trained for 80k iteration following the example of training on the single GPU, for which I achieved 45 AP. So here as well the number of iteration affects the final score that we see. Could please tell me how these max iterations were decided?

The selection for training iteration is a bit tricky in domain adaptation since we should not be able to evaluate the model on the target dataset. Moreover, the randomness that occurs during training could also result in a performance margin on the target domain.

For the source-only detector, the training iteration is derived from the original FCOS setting on the cityscapes dataset.
We have also tried to evaluate the model from different iterations but still cannot achieve any better performance.
Plus, we have not tried to train a source-only detector by one GPU.
So I am not sure of the difference would be caused by the single GPU setting or the selection of training epochs.

In most of our experiments, the pretrained iteration was set as 5k or 10k, except we set 2k for the KITTI dataset since it requires less training iterations.
I think the result is not very sensitive to the pretrained iteration.

Cheng-Chun

I set the pretraining iterations as 2k, 5k, 10k and 20k respectively for KITTI-> Cityscapes task.
The second stage (GA+CA) results (AP50) are 33.4, 40.1, 38.5, 38.6 respectively.
However, in your paper it is 43.2. All the experiment are trained using your scripts and hyper parameters.

In my opinion, we can select pretraining iteration according to the model's performance on the source domain (KITTI valid set) since we can not be able to evaluate the model on the target dataset.
I find that you used all images including cars in the kitti training dataset for domain adaptation, so that it is not possible to select proper iteration on the source domain.

So do you think why my reproduction results are very strange? It looks the best iteration for pretraining randomly appear. I have no idea about it.

@mochaojie
Copy link

Hello,
Thanks for providing the code. I was able to run the code for Sim10k --> Cityscapes and got scores lower than reported. mAP 24.8, mAP @0.5 42.6, mAP @0.75 24.4 , mAP @s 5.3 , mAP @m 27.3 , mAP@L 51.1 . One thing that I noticed is that the domain classifier loss goes down to zero for all the pyramid scales after few iterations. Could you please let me know what kind of trend you see for the domain classifier losses, which could possibly be the reason for me seeing the lower scores?

Thanks
Hi, i have received the cityscapes dataset. However, when i click the link for foggy cityscapes, i don't know which one is i really need. In the website, i see some foggy cityscapes datasets for semantic segmentation, not for object detection. Can you give me the link for foggy dataset specifically. Thanks!

@tmp12316
Copy link

tmp12316 commented Feb 3, 2021

@chengchunhsu for sim10k->cityscapes without DA, for how many iterations do you train your network? Simply, setting DA_ON as False, did not work for me. I set USE_DIS_GLOBAL as False, also had to modify the tools/train_net.py and engine/trainer.py in order to only train for the source domain, which you can find here --> https://gist.github.com/vidit09/dad5b704ebe8c0468d75f72838d9d799 . [email protected] achieved is 45.0 in this case, more than the one reported in the paper(39.8). Could you please share the code used for only source domain training inorder to verify what I did wrong?

Best,
Vidit

HI, I also get much higher source-only results on both sim10k and Kitti. Have you reproduced similar results on a pure fcos baseline? Thank you. By the way, I got a similar source-only result (about 26 to 27) on Cityscapes by setting the lamda of discriminator loss as 0, while we still get very high results on the other two datasets, about 44 for Kitti and 47 for sim10k. All the experiments use ResNet-101.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants