- Image classification task results
- ResNetCifar training from scratch on CIFAR100
- Convformer finetune from offical pretrain weight on ImageNet1K(ILSVRC2012)
- DarkNet training from scratch on ImageNet1K(ILSVRC2012)
- ResNet training from scratch on ImageNet1K(ILSVRC2012)
- ResNet finetune from ImageNet21k pretrain weight on ImageNet1K(ILSVRC2012)
- VAN finetune from offical pretrain weight on ImageNet1K(ILSVRC2012)
- ViT finetune from self-trained MAE pretrain weight(400epoch) on ImageNet1K(ILSVRC2012)
- ViT finetune from offical MAE pretrain weight(800 epoch) on ImageNet1K(ILSVRC2012)
- ResNet train from ImageNet1K pretrain weight on ImageNet21K(Winter 2021 release)
- Knowledge distillation task results
- Masked image modeling task results
- Object detection task results
- All detection models training from scratch on COCO2017
- All detection models finetune from objects365 pretrain weight on COCO2017
- All detection models train on Objects365(v2,2020) from COCO2017 pretrain weight
- All detection models training from scratch on VOC2007 and VOC2012
- All detection models finetune from objects365 pretrain weight on VOC2007 and VOC2012
- Semantic Segmentation task results
- Instance Segmentation task results
- Salient object detection task results
- Human matting task results
- OCR text detection task results
- OCR text recognition task results
- Face detection task results
- Face parsing task results
- Human parsing task results
- Interactive segmentation task results
- Diffusion model task results
Convformer
Paper:https://arxiv.org/pdf/2210.13452
DarkNet
Paper:https://arxiv.org/abs/1804.02767?e05802c1_page=1
ResNet
Paper:https://arxiv.org/abs/1512.03385
VAN
Paper:https://arxiv.org/abs/2202.09741
ViT
Paper:https://arxiv.org/abs/2010.11929
ResNetCifar is different from ResNet in the first few layers.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
ResNet18Cifar | 557.935M | 11.220M | 32x32 | 128 | 200 | 76.890 |
ResNet34Cifar | 1.164G | 21.328M | 32x32 | 128 | 200 | 78.010 |
ResNet50Cifar | 1.312G | 23.705M | 32x32 | 128 | 200 | 75.360 |
ResNet101Cifar | 2.531G | 42.697M | 32x32 | 128 | 200 | 77.180 |
ResNet152Cifar | 3.751G | 58.341M | 32x32 | 128 | 200 | 77.340 |
You can find more model training details in 0.classification_training/cifar100/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
convformer-s18 | 3.953G | 24.184M | 224x224 | 2048 | 300 | 82.018 |
convformer-s36 | 7.663G | 37.424M | 224x224 | 1024 | 300 | 83.290 |
convformer-m36 | 12.876G | 53.994M | 224x224 | 1024 | 300 | 84.000 |
convformer-b36 | 22.673G | 95.216M | 224x224 | 1024 | 300 | 84.480 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
DarkNetTiny | 414.602M | 2.087M | 256x256 | 256 | 100 | 57.858 |
DarkNet19 | 3.669G | 20.842M | 256x256 | 256 | 100 | 74.364 |
DarkNet53 | 9.335G | 41.610M | 256x256 | 256 | 100 | 76.250 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
ResNet18 | 1.824G | 11.690M | 224x224 | 256 | 100 | 70.594 |
ResNet34 | 3.679G | 21.798M | 224x224 | 256 | 100 | 73.622 |
ResNet50 | 4.134G | 25.557M | 224x224 | 256 | 100 | 76.182 |
ResNet101 | 7.866G | 44.549M | 224x224 | 256 | 100 | 77.242 |
ResNet152 | 11.604G | 60.193M | 224x224 | 256 | 100 | 77.772 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
ResNet50 | 4.134G | 25.557M | 224x224 | 2048 | 300 | 80.258 |
ResNet101 | 7.866G | 44.549M | 224x224 | 1024 | 300 | 81.668 |
ResNet152 | 11.604G | 60.193M | 224x224 | 1024 | 300 | 81.934 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
van-b0 | 870.860M | 4.103M | 224x224 | 2048 | 300 | 75.424 |
van-b1 | 2.506G | 13.856M | 224x224 | 2048 | 300 | 80.740 |
van-b2 | 5.010G | 26.567M | 224x224 | 1024 | 300 | 82.592 |
van-b3 | 8.951G | 26.567M | 224x224 | 1024 | 300 | 83.202 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
ViT-Base-Patch16 | 16.880G | 86.416M | 224x224 | 256 | 100 | 82.676 |
ViT-Large-Patch16 | 59.731G | 304.124M | 224x224 | 128 | 50 | 84.978 |
ViT-Huge-Patch14 | 162.071G | 631.716M | 224x224 | 128 | 50 | 85.966 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Top-1 |
---|---|---|---|---|---|---|
ViT-Base-Patch16 | 16.880G | 86.416M | 224x224 | 256 | 100 | 83.404 |
ViT-Large-Patch16 | 59.731G | 304.124M | 224x224 | 128 | 50 | 85.672 |
ViT-Huge-Patch14 | 162.071G | 631.716M | 224x224 | 128 | 50 | 86.608 |
You can find more model training details in 0.classification_training/imagenet/.
Network | macs | params | input size | batch | epochs | Semantic Softmax Acc |
---|---|---|---|---|---|---|
ResNet50 | 4.134G | 25.557M | 224x224 | 2048 | 80 | 75.319 |
ResNet101 | 7.866G | 44.549M | 224x224 | 2048 | 80 | 76.795 |
ResNet152 | 11.604G | 60.193M | 224x224 | 1024 | 80 | 77.345 |
You can find more model training details in 0.classification_training/imagenet21k/.
DML loss
Paper:https://arxiv.org/abs/1706.00384
KD loss
Paper:https://arxiv.org/abs/1503.02531
Teacher Network | Student Network | method | Freeze Teacher | input size | batch | epochs | Teacher Top-1 | Student Top-1 |
---|---|---|---|---|---|---|---|---|
ResNet152 | ResNet50 | CE+DML | False | 224x224 | 256 | 100 | 79.246 | 78.168 |
ResNet152 | ResNet50 | CE+DML+Vit Aug | False | 224x224 | 1024 | 300 | 82.760 | 80.798 |
ResNet152 | ResNet50 | CE+KD | True | 224x224 | 256 | 100 | 77.764 | 77.566 |
ResNet152 | ResNet50 | CE+KD+Vit Aug | True | 224x224 | 2048 | 300 | 81.936 | 80.806 |
You can find more model training details in 1.distillation_training/imagenet/.
MAE:Masked Autoencoders Are Scalable Vision Learners
Paper:https://arxiv.org/abs/2111.06377
Network | input size | batch | epochs | Loss |
---|---|---|---|---|
ViT-Base-Patch16 | 224x224 | 1024 | 400 | 0.388 |
ViT-Large-Patch16 | 224x224 | 1024 | 400 | 0.378 |
ViT-Huge-Patch14 | 224x224 | 1024 | 400 | 0.350 |
You can find more model training details in 2.masked_image_modeling_training/imagenet/.
DETR
Paper:https://arxiv.org/abs/2005.12872
DINO-DETR
Paper:https://arxiv.org/abs/2203.03605
RetinaNet
Paper:https://arxiv.org/abs/1708.02002
FCOS
Paper:https://arxiv.org/abs/1904.01355
Trained on COCO2017 train dataset, tested on COCO2017 val dataset.
mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).
Network | resize-style | input size | macs | params | batch | epochs | mAP |
---|---|---|---|---|---|---|---|
ResNet50-DETR | YoloStyle-1024 | 1024x1024 | 89.577G | 30.440M | 64 | 500 | 38.609 |
ResNet50-DINO-DETR | YoloStyle-1024 | 1024x1024 | 844.204G | 47.082M | 16 | 39 | 47.396 |
ResNet50-RetinaNet | RetinaStyle-800 | 800x1333 | 250.069G | 37.969M | 16 | 13 | 37.281 |
ResNet50-FCOS | RetinaStyle-800 | 800x1333 | 214.406G | 32.291M | 16 | 13 | 41.071 |
You can find more model training details in 3.detection_training/coco/.
Trained on COCO2017 train dataset, tested on COCO2017 val dataset.
mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).
Network | resize-style | input size | macs | params | batch | epochs | mAP |
---|---|---|---|---|---|---|---|
ResNet50-RetinaNet | RetinaStyle-800 | 800x1333 | 250.069G | 37.969M | 16 | 13 | 40.947 |
ResNet50-FCOS | RetinaStyle-800 | 800x1333 | 214.406G | 32.291M | 16 | 13 | 46.511 |
You can find more model training details in 3.detection_training/coco/.
Trained on objects365(v2,2020) train dataset, tested on objects365(v2,2020) val dataset.
Network | resize-style | input size | batch | epochs | loss |
---|---|---|---|---|---|
ResNet50-RetinaNet | YoloStyle-1024 | 1024x1024 | 32 | 13 | 0.355 |
ResNet50-FCOS | YoloStyle-1024 | 1024x1024 | 64 | 13 | 0.968 |
Trained on VOC2007 trainval dataset + VOC2012 trainval dataset, tested on VOC2007 test dataset.
mAP is IoU=0.50,area=all,maxDets=100,mAP.
Network | resize-style | input size | macs | params | batch | epochs | mAP |
---|---|---|---|---|---|---|---|
ResNet50-RetinaNet | YoloStyle-640 | 640x640 | 84.947G | 36.724M | 32 | 13 | 83.765 |
ResNet50-FCOS | YoloStyle-640 | 640x640 | 80.764G | 32.153M | 32 | 13 | 83.250 |
You can find more model training details in 3.detection_training/voc/.
Trained on VOC2007 trainval dataset + VOC2012 trainval dataset, tested on VOC2007 test dataset.
mAP is IoU=0.50,area=all,maxDets=100,mAP.
Network | resize-style | input size | macs | params | batch | epochs | mAP |
---|---|---|---|---|---|---|---|
ResNet50-RetinaNet | YoloStyle-640 | 640x640 | 84.947G | 36.724M | 32 | 13 | 90.082 |
ResNet50-FCOS | YoloStyle-640 | 640x640 | 80.764G | 32.153M | 32 | 13 | 90.585 |
You can find more model training details in 3.detection_training/voc/.
DeepLabv3+
Paper:https://arxiv.org/abs/1802.02611
Network | input size | macs | params | batch | epochs | mean_iou |
---|---|---|---|---|---|---|
resnet50_deeplabv3plus | 512x512 | 43.500G | 30.254M | 32 | 100 | 40.462 |
convformerm36_deeplabv3plus | 512x512 | 83.898G | 56.760M | 32 | 100 | 47.826 |
You can find more model training details in 4.semantic_segmentation_training/ade20k/.
Network | input size | macs | params | batch | epochs | mean_iou |
---|---|---|---|---|---|---|
resnet50_deeplabv3plus | 512x512 | 43.500G | 30.254M | 64 | 100 | 68.975 |
convformerm36_deeplabv3plus | 512x512 | 83.898G | 56.760M | 64 | 100 | 74.214 |
You can find more model training details in 4.semantic_segmentation_training/coco/.
YOLACT
Paper:https://arxiv.org/abs/1904.02689
SOLOv2
Paper:https://arxiv.org/abs/2003.10152
Trained on COCO2017 train dataset, tested on COCO2017 val dataset.
mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).
Network | resize-style | input size | macs | params | batch | epochs | mAP |
---|---|---|---|---|---|---|---|
resnet50_yolact | YoloStyle-1024 | 1024x1024 | 202.012G | 31.165M | 64 | 39 | 26.342 |
convformerm36_yolact | YoloStyle-1024 | 1024x1024 | 382.336G | 60.452M | 64 | 39 | 34.047 |
resnet50_solov2 | YoloStyle-1024 | 1024x1024 | 248.965G | 46.582M | 32 | 39 | 37.807 |
convformerm36_solov2 | YoloStyle-1024 | 1024x1024 | 426.605G | 75.828M | 32 | 39 | 40.296 |
You can find more model training details in 5.instance_segmentation_training/coco/.
PFAN+Segmentation
Paper1:https://arxiv.org/abs/1903.00179
Paper2:https://arxiv.org/abs/2202.09741
Use combine dataset DIS5K/HRS10K/HRSOD/UHRSD to train and test.
Network | macs | params | input size | batch | epochs | iou | precision | recall | f_squared_beta |
---|---|---|---|---|---|---|---|---|---|
resnet50_pfan_segmentation | 71.303G | 26.580M | 832x832 | 96 | 100 | 0.8461 | 0.8970 | 0.9346 | 0.9053 |
convformerm36_pfan_segmentation | 186.496G | 54.459M | 832x832 | 96 | 100 | 0.8865 | 0.9263 | 0.9517 | 0.9319 |
You can find more model training details in 6.salient_object_detection_training/.
PFAN+Matting
Paper1:https://arxiv.org/abs/1903.00179
Paper2:https://arxiv.org/abs/2104.14222
Paper3:https://arxiv.org/abs/2202.09741
Use combine dataset Deep_Automatic_Portrait_Matting/RealWorldPortrait636/P3M10K to train and test.
Network | macs | params | input size | batch | epochs | iou | precision | recall | sad | mae | mse | grad | conn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
resnet50_pfan_matting | 86.093G | 29.654M | 832x832 | 96 | 100 | 0.9824 | 0.9884 | 0.9937 | 5.7071 | 0.0082 | 0.0047 | 6.7001 | 5.4373 |
convformerm36_pfan_matting | 195.854G | 55.503M | 832x832 | 96 | 100 | 0.9865 | 0.9912 | 0.9951 | 4.5806 | 0.0066 | 0.0033 | 5.0129 | 4.2882 |
You can find more model training details in 7.human_matting_training/.
DBNet
Paper:https://arxiv.org/abs/1911.08947
Use combine dataset include ICDAR2017RCTW/ICDAR2019ART/ICDAR2019LSVT/ICDAR2019MLT to train and test.
Network | macs | params | input size | batch | epochs | precision | recall | f1 |
---|---|---|---|---|---|---|---|---|
resnet50_dbnet | 158.914G | 24.784M | 1024x1024 | 128 | 100 | 92.072 | 86.595 | 89.249 |
convformerm36_dbnet | 340.367G | 54.528M | 1024x1024 | 64 | 100 | 92.748 | 89.947 | 91.326 |
You can find more model training details in 8.ocr_text_detection_training/.
CRNN+LSTM+CTC
Paper:https://arxiv.org/abs/1507.05717
Use combine dataset aistudio_baidu_street/chinese_dataset/synthetic_chinese_string_dataset/meta_self_learning_dataset to train and test.
Network | macs | params | input size | batch | epochs | lcs_precision | lcs_recall |
---|---|---|---|---|---|---|---|
resnet50_ctc_model | 12.509G | 179.870M | 32x512 | 1024 | 50 | 99.498 | 99.212 |
convformerm36_ctc_model | 8.051G | 70.121M | 32x512 | 1024 | 50 | 99.452 | 99.201 |
You can find more model training details in 9.ocr_text_recognition_training/.
RetinaFace
Paper:https://arxiv.org/pdf/1905.00641
Use WiderFace train and UFDD val datasets to train, WiderFace val dataset to test.
Network | macs | params | input size | batch | epochs | Easy AP | Medium AP | Hard AP |
---|---|---|---|---|---|---|---|---|
resnet50_retinaface | 114.229G | 27.280M | 1024x1024 | 16 | 100 | 0.9369 | 0.9148 | 0.7801 |
You can find more model training details in 10.face_detection_training/.
PFAN face parsing
Paper1:https://arxiv.org/abs/1903.00179
Paper2:https://arxiv.org/abs/2202.09741
Sapiens
Paper:https://arxiv.org/pdf/2408.12569
Use FaceSynthetics and CelebAMask-HQ dataset to train and test.
Network | dataset | macs | params | input size | batch | epochs | precision | recall | iou | dice |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_pfan_face_parsing | FaceSynthetics | 28.361G | 26.585M | 512x512 | 192 | 100 | 95.4084 | 95.0583 | 91.1481 | 95.2320 |
convformerm36_pfan_face_parsing | FaceSynthetics | 71.985G | 54.464M | 512x512 | 192 | 100 | 96.2895 | 96.2122 | 92.9436 | 96.2506 |
sapiens_0_3b_face_parsing | FaceSynthetics | 452.167G | 314.250M | 512x512 | 160 | 100 | 97.0999 | 96.9897 | 94.3823 | 97.0446 |
resnet50_pfan_face_parsing | CelebAMask-HQ | 28.361G | 26.585M | 512x512 | 192 | 100 | 82.0985 | 77.9908 | 69.3835 | 79.7142 |
convformerm36_pfan_face_parsing | CelebAMask-HQ | 71.985G | 54.464M | 512x512 | 192 | 100 | 83.4664 | 81.1791 | 72.6132 | 82.1953 |
sapiens_0_3b_face_parsing | CelebAMask-HQ | 452.167G | 314.250M | 512x512 | 160 | 100 | 86.0223 | 84.0680 | 76.2724 | 84.9471 |
You can find more model training details in 11.face_parsing_training/.
PFAN human parsing
Paper1:https://arxiv.org/abs/1903.00179
Paper2:https://arxiv.org/abs/2202.09741
Sapiens
Paper:https://arxiv.org/pdf/2408.12569
Use LIP and CIHP dataset to train and test.
Network | dataset | macs | params | input size | batch | epochs | precision | recall | iou | dice |
---|---|---|---|---|---|---|---|---|---|---|
resnet50_pfan_human_parsing | LIP | 28.437G | 26.585M | 512x512 | 192 | 100 | 57.5257 | 50.6568 | 39.2989 | 53.2604 |
convformerm36_pfan_human_parsing | LIP | 72.060G | 54.464M | 512x512 | 192 | 100 | 60.6652 | 57.3280 | 44.3857 | 58.7892 |
sapiens_0_3b_human_parsing | LIP | 452.175G | 314.250M | 512x512 | 160 | 100 | 57.0063 | 51.9517 | 39.8993 | 54.0054 |
resnet50_pfan_human_parsing | CIHP | 28.437G | 26.585M | 512x512 | 192 | 100 | 61.9748 | 55.4004 | 44.7195 | 57.8736 |
convformerm36_pfan_human_parsing | CIHP | 72.060G | 54.464M | 512x512 | 192 | 100 | 67.4147 | 62.6415 | 51.0651 | 64.6072 |
sapiens_0_3b_human_parsing | CIHP | 452.175G | 314.250M | 512x512 | 160 | 100 | 65.0747 | 57.9976 | 47.1512 | 60.7108 |
You can find more model training details in 12.human_parsing_training/.
SAM
Paper:https://arxiv.org/pdf/2304.02643
SAM2
Paper:https://arxiv.org/pdf/2408.00714
Use sa_1b_11w dataset, combine salient object detection dataset,combine human matting dataset to train and test.
You can find all jupyter notebook examples in 13.interactive_segmentation_training/sam_predict_example/.
Network | dataset | input size | batch | epochs | loss |
---|---|---|---|---|---|
convformer_m36_sam_encoder | sa_1b_11w | 1024x1024 | 32 | 40 | 0.0030 |
convformer_m36_sam | sa_1b_11w | 1024x1024 | 32 | 5 | 0.1417 |
You can find more model training details in 13.interactive_segmentation_training/sa_1b/.
Network | dataset | input size | batch | epochs | loss | precision | recall | iou |
---|---|---|---|---|---|---|---|---|
convformer_m36_sam | combine dataset | 1024x1024 | 64 | 100 | 0.1012 | 0.9340 | 0.9554 | 0.8988 |
You can find more model training details in 13.interactive_segmentation_training/salient_object_detection_human_matting_pretrain/.
Network | dataset | input size | batch | epochs | iou | precision | recall | sad | mae | mse | grad | conn |
---|---|---|---|---|---|---|---|---|---|---|---|---|
convformer_m36_sam_matting1 | combine dataset | 1024x1024 | 48 | 200 | 0.9806 | 0.9874 | 0.9930 | 6.5461 | 0.0087 | 0.0051 | 6.9578 | 6.3325 |
convformer_m36_sam_matting2 | combine dataset | 1024x1024 | 32 | 200 | 0.9799 | 0.9877 | 0.9919 | 6.8052 | 0.0091 | 0.0055 | 6.9553 | 6.5909 |
You can find more model training details in 13.interactive_segmentation_training/human_matting/.
Network | dataset | input size | batch | epochs | iou | precision | recall | sad | mae | mse | grad | conn |
---|---|---|---|---|---|---|---|---|---|---|---|---|
convformer_m36_sam_matting1 | combine dataset | 1024x1024 | 48 | 200 | 0.8586 | 0.9151 | 0.9326 | 24.2321 | 0.0331 | 0.0320 | 40.8393 | 24.2515 |
convformer_m36_sam_matting2 | combine dataset | 1024x1024 | 32 | 200 | 0.8586 | 0.9193 | 0.9275 | 23.8464 | 0.0326 | 0.0315 | 40.2559 | 23.8651 |
You can find more model training details in 13.interactive_segmentation_training/salient_object_detection/.
Denoising Diffusion Probabilistic Models
Paper:https://arxiv.org/abs/2006.11239
Denoising Diffusion Implicit Models
Paper:https://arxiv.org/abs/2010.02502
Trained diffusion unet on CelebA-HQ dataset(DDPM method).Test image num=28000.
sampling method | input size | steps | condition label(train/test) | FID | IS score(mean/std) |
---|---|---|---|---|---|
DDPM | 64x64 | 1000 | False/False | 6.409 | 2.486/0.082 |
DDIM | 64x64 | 50 | False/False | 14.623 | 2.622/0.073 |
You can find more model training details in 20.diffusion_model_training/celebahq/.
Trained diffusion unet on CIFAR10 dataset(DDPM method).Test image num=50000.
sampling method | input size | steps | condition label(train/test) | FID | IS score(mean/std) |
---|---|---|---|---|---|
DDPM | 32x32 | 1000 | False/False | 10.302 | 8.213/0.257 |
DDIM | 32x32 | 50 | False/False | 12.440 | 8.318/0.408 |
DDPM | 32x32 | 1000 | True/True | 5.049 | 8.654/0.112 |
You can find more model training details in 20.diffusion_model_training/cifar10/.
Trained diffusion unet on CIFAR100 dataset(DDPM method).Test image num=50000.
sampling method | input size | steps | condition label(train/test) | FID | IS score(mean/std) |
---|---|---|---|---|---|
DDPM | 32x32 | 1000 | False/False | 16.298 | 8.398/0.281 |
DDIM | 32x32 | 50 | False/False | 21.402 | 8.344/0.192 |
DDPM | 32x32 | 1000 | True/True | 6.953 | 10.344/0.150 |
You can find more model training details in 20.diffusion_model_training/cifar100/.
Trained diffusion unet on FFHQ dataset(DDPM method).Test image num=60000.
sampling method | input size | steps | condition label(train/test) | FID | IS score(mean/std) |
---|---|---|---|---|---|
DDPM | 64x64 | 1000 | False/False | 7.758 | 3.283/0.124 |
DDIM | 64x64 | 50 | False/False | 11.328 | 3.417/0.071 |
You can find more model training details in 20.diffusion_model_training/ffhq/.