Skip to content

Latest commit

 

History

History
executable file
·
566 lines (369 loc) · 30.6 KB

File metadata and controls

executable file
·
566 lines (369 loc) · 30.6 KB

Image classification task results

Convformer

Paper:https://arxiv.org/pdf/2210.13452

DarkNet

Paper:https://arxiv.org/abs/1804.02767?e05802c1_page=1

ResNet

Paper:https://arxiv.org/abs/1512.03385

VAN

Paper:https://arxiv.org/abs/2202.09741

ViT

Paper:https://arxiv.org/abs/2010.11929

ResNetCifar training from scratch on CIFAR100

ResNetCifar is different from ResNet in the first few layers.

Network macs params input size batch epochs Top-1
ResNet18Cifar 557.935M 11.220M 32x32 128 200 76.890
ResNet34Cifar 1.164G 21.328M 32x32 128 200 78.010
ResNet50Cifar 1.312G 23.705M 32x32 128 200 75.360
ResNet101Cifar 2.531G 42.697M 32x32 128 200 77.180
ResNet152Cifar 3.751G 58.341M 32x32 128 200 77.340

You can find more model training details in 0.classification_training/cifar100/.

Convformer finetune from offical pretrain weight on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
convformer-s18 3.953G 24.184M 224x224 2048 300 82.018
convformer-s36 7.663G 37.424M 224x224 1024 300 83.290
convformer-m36 12.876G 53.994M 224x224 1024 300 84.000
convformer-b36 22.673G 95.216M 224x224 1024 300 84.480

You can find more model training details in 0.classification_training/imagenet/.

DarkNet training from scratch on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
DarkNetTiny 414.602M 2.087M 256x256 256 100 57.858
DarkNet19 3.669G 20.842M 256x256 256 100 74.364
DarkNet53 9.335G 41.610M 256x256 256 100 76.250

You can find more model training details in 0.classification_training/imagenet/.

ResNet training from scratch on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
ResNet18 1.824G 11.690M 224x224 256 100 70.594
ResNet34 3.679G 21.798M 224x224 256 100 73.622
ResNet50 4.134G 25.557M 224x224 256 100 76.182
ResNet101 7.866G 44.549M 224x224 256 100 77.242
ResNet152 11.604G 60.193M 224x224 256 100 77.772

You can find more model training details in 0.classification_training/imagenet/.

ResNet finetune from ImageNet21k pretrain weight on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
ResNet50 4.134G 25.557M 224x224 2048 300 80.258
ResNet101 7.866G 44.549M 224x224 1024 300 81.668
ResNet152 11.604G 60.193M 224x224 1024 300 81.934

You can find more model training details in 0.classification_training/imagenet/.

VAN finetune from offical pretrain weight on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
van-b0 870.860M 4.103M 224x224 2048 300 75.424
van-b1 2.506G 13.856M 224x224 2048 300 80.740
van-b2 5.010G 26.567M 224x224 1024 300 82.592
van-b3 8.951G 26.567M 224x224 1024 300 83.202

You can find more model training details in 0.classification_training/imagenet/.

ViT finetune from self-trained MAE pretrain weight(400epoch) on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
ViT-Base-Patch16 16.880G 86.416M 224x224 256 100 82.676
ViT-Large-Patch16 59.731G 304.124M 224x224 128 50 84.978
ViT-Huge-Patch14 162.071G 631.716M 224x224 128 50 85.966

You can find more model training details in 0.classification_training/imagenet/.

ViT finetune from offical MAE pretrain weight(800 epoch) on ImageNet1K(ILSVRC2012)

Network macs params input size batch epochs Top-1
ViT-Base-Patch16 16.880G 86.416M 224x224 256 100 83.404
ViT-Large-Patch16 59.731G 304.124M 224x224 128 50 85.672
ViT-Huge-Patch14 162.071G 631.716M 224x224 128 50 86.608

You can find more model training details in 0.classification_training/imagenet/.

ResNet train from ImageNet1K pretrain weight on ImageNet21K(Winter 2021 release)

Network macs params input size batch epochs Semantic Softmax Acc
ResNet50 4.134G 25.557M 224x224 2048 80 75.319
ResNet101 7.866G 44.549M 224x224 2048 80 76.795
ResNet152 11.604G 60.193M 224x224 1024 80 77.345

You can find more model training details in 0.classification_training/imagenet21k/.

Knowledge distillation task results

DML loss

Paper:https://arxiv.org/abs/1706.00384

KD loss

Paper:https://arxiv.org/abs/1503.02531

ResNet distill from pretrain weight on ImageNet1K(ILSVRC2012)

Teacher Network Student Network method Freeze Teacher input size batch epochs Teacher Top-1 Student Top-1
ResNet152 ResNet50 CE+DML False 224x224 256 100 79.246 78.168
ResNet152 ResNet50 CE+DML+Vit Aug False 224x224 1024 300 82.760 80.798
ResNet152 ResNet50 CE+KD True 224x224 256 100 77.764 77.566
ResNet152 ResNet50 CE+KD+Vit Aug True 224x224 2048 300 81.936 80.806

You can find more model training details in 1.distillation_training/imagenet/.

Masked image modeling task results

MAE:Masked Autoencoders Are Scalable Vision Learners

Paper:https://arxiv.org/abs/2111.06377

ViT MAE pretrain on ImageNet1K(ILSVRC2012)

Network input size batch epochs Loss
ViT-Base-Patch16 224x224 1024 400 0.388
ViT-Large-Patch16 224x224 1024 400 0.378
ViT-Huge-Patch14 224x224 1024 400 0.350

You can find more model training details in 2.masked_image_modeling_training/imagenet/.

Object detection task results

DETR

Paper:https://arxiv.org/abs/2005.12872

DINO-DETR

Paper:https://arxiv.org/abs/2203.03605

RetinaNet

Paper:https://arxiv.org/abs/1708.02002

FCOS

Paper:https://arxiv.org/abs/1904.01355

All detection models training from scratch on COCO2017

Trained on COCO2017 train dataset, tested on COCO2017 val dataset.

mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).

Network resize-style input size macs params batch epochs mAP
ResNet50-DETR YoloStyle-1024 1024x1024 89.577G 30.440M 64 500 38.609
ResNet50-DINO-DETR YoloStyle-1024 1024x1024 844.204G 47.082M 16 39 47.396
ResNet50-RetinaNet RetinaStyle-800 800x1333 250.069G 37.969M 16 13 37.281
ResNet50-FCOS RetinaStyle-800 800x1333 214.406G 32.291M 16 13 41.071

You can find more model training details in 3.detection_training/coco/.

All detection models finetune from objects365 pretrain weight on COCO2017

Trained on COCO2017 train dataset, tested on COCO2017 val dataset.

mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).

Network resize-style input size macs params batch epochs mAP
ResNet50-RetinaNet RetinaStyle-800 800x1333 250.069G 37.969M 16 13 40.947
ResNet50-FCOS RetinaStyle-800 800x1333 214.406G 32.291M 16 13 46.511

You can find more model training details in 3.detection_training/coco/.

All detection models train on Objects365(v2,2020) from COCO2017 pretrain weight

Trained on objects365(v2,2020) train dataset, tested on objects365(v2,2020) val dataset.

Network resize-style input size batch epochs loss
ResNet50-RetinaNet YoloStyle-1024 1024x1024 32 13 0.355
ResNet50-FCOS YoloStyle-1024 1024x1024 64 13 0.968

All detection models training from scratch on VOC2007 and VOC2012

Trained on VOC2007 trainval dataset + VOC2012 trainval dataset, tested on VOC2007 test dataset.

mAP is IoU=0.50,area=all,maxDets=100,mAP.

Network resize-style input size macs params batch epochs mAP
ResNet50-RetinaNet YoloStyle-640 640x640 84.947G 36.724M 32 13 83.765
ResNet50-FCOS YoloStyle-640 640x640 80.764G 32.153M 32 13 83.250

You can find more model training details in 3.detection_training/voc/.

All detection models finetune from objects365 pretrain weight on VOC2007 and VOC2012

Trained on VOC2007 trainval dataset + VOC2012 trainval dataset, tested on VOC2007 test dataset.

mAP is IoU=0.50,area=all,maxDets=100,mAP.

Network resize-style input size macs params batch epochs mAP
ResNet50-RetinaNet YoloStyle-640 640x640 84.947G 36.724M 32 13 90.082
ResNet50-FCOS YoloStyle-640 640x640 80.764G 32.153M 32 13 90.585

You can find more model training details in 3.detection_training/voc/.

Semantic Segmentation task results

DeepLabv3+

Paper:https://arxiv.org/abs/1802.02611

All semantic segmentation models training from scratch on ADE20K

Network input size macs params batch epochs mean_iou
resnet50_deeplabv3plus 512x512 43.500G 30.254M 32 100 40.462
convformerm36_deeplabv3plus 512x512 83.898G 56.760M 32 100 47.826

You can find more model training details in 4.semantic_segmentation_training/ade20k/.

All semantic segmentation models training from scratch on COCO2017

Network input size macs params batch epochs mean_iou
resnet50_deeplabv3plus 512x512 43.500G 30.254M 64 100 68.975
convformerm36_deeplabv3plus 512x512 83.898G 56.760M 64 100 74.214

You can find more model training details in 4.semantic_segmentation_training/coco/.

Instance Segmentation task results

YOLACT

Paper:https://arxiv.org/abs/1904.02689

SOLOv2

Paper:https://arxiv.org/abs/2003.10152

All instance segmentation models training from scratch on COCO2017

Trained on COCO2017 train dataset, tested on COCO2017 val dataset.

mAP is IoU=0.5:0.95,area=all,maxDets=100,mAP(COCOeval,stats[0]).

Network resize-style input size macs params batch epochs mAP
resnet50_yolact YoloStyle-1024 1024x1024 202.012G 31.165M 64 39 26.342
convformerm36_yolact YoloStyle-1024 1024x1024 382.336G 60.452M 64 39 34.047
resnet50_solov2 YoloStyle-1024 1024x1024 248.965G 46.582M 32 39 37.807
convformerm36_solov2 YoloStyle-1024 1024x1024 426.605G 75.828M 32 39 40.296

You can find more model training details in 5.instance_segmentation_training/coco/.

Salient object detection task results

PFAN+Segmentation

Paper1:https://arxiv.org/abs/1903.00179

Paper2:https://arxiv.org/abs/2202.09741

Use combine dataset DIS5K/HRS10K/HRSOD/UHRSD to train and test.

Network macs params input size batch epochs iou precision recall f_squared_beta
resnet50_pfan_segmentation 71.303G 26.580M 832x832 96 100 0.8461 0.8970 0.9346 0.9053
convformerm36_pfan_segmentation 186.496G 54.459M 832x832 96 100 0.8865 0.9263 0.9517 0.9319

You can find more model training details in 6.salient_object_detection_training/.

Human matting task results

PFAN+Matting

Paper1:https://arxiv.org/abs/1903.00179

Paper2:https://arxiv.org/abs/2104.14222

Paper3:https://arxiv.org/abs/2202.09741

Use combine dataset Deep_Automatic_Portrait_Matting/RealWorldPortrait636/P3M10K to train and test.

Network macs params input size batch epochs iou precision recall sad mae mse grad conn
resnet50_pfan_matting 86.093G 29.654M 832x832 96 100 0.9824 0.9884 0.9937 5.7071 0.0082 0.0047 6.7001 5.4373
convformerm36_pfan_matting 195.854G 55.503M 832x832 96 100 0.9865 0.9912 0.9951 4.5806 0.0066 0.0033 5.0129 4.2882

You can find more model training details in 7.human_matting_training/.

OCR text detection task results

DBNet

Paper:https://arxiv.org/abs/1911.08947

Use combine dataset include ICDAR2017RCTW/ICDAR2019ART/ICDAR2019LSVT/ICDAR2019MLT to train and test.

Network macs params input size batch epochs precision recall f1
resnet50_dbnet 158.914G 24.784M 1024x1024 128 100 92.072 86.595 89.249
convformerm36_dbnet 340.367G 54.528M 1024x1024 64 100 92.748 89.947 91.326

You can find more model training details in 8.ocr_text_detection_training/.

OCR text recognition task results

CRNN+LSTM+CTC

Paper:https://arxiv.org/abs/1507.05717

Use combine dataset aistudio_baidu_street/chinese_dataset/synthetic_chinese_string_dataset/meta_self_learning_dataset to train and test.

Network macs params input size batch epochs lcs_precision lcs_recall
resnet50_ctc_model 12.509G 179.870M 32x512 1024 50 99.498 99.212
convformerm36_ctc_model 8.051G 70.121M 32x512 1024 50 99.452 99.201

You can find more model training details in 9.ocr_text_recognition_training/.

Face detection task results

RetinaFace

Paper:https://arxiv.org/pdf/1905.00641

Use WiderFace train and UFDD val datasets to train, WiderFace val dataset to test.

Network macs params input size batch epochs Easy AP Medium AP Hard AP
resnet50_retinaface 114.229G 27.280M 1024x1024 16 100 0.9369 0.9148 0.7801

You can find more model training details in 10.face_detection_training/.

Face parsing task results

PFAN face parsing

Paper1:https://arxiv.org/abs/1903.00179

Paper2:https://arxiv.org/abs/2202.09741

Sapiens

Paper:https://arxiv.org/pdf/2408.12569

Use FaceSynthetics and CelebAMask-HQ dataset to train and test.

Network dataset macs params input size batch epochs precision recall iou dice
resnet50_pfan_face_parsing FaceSynthetics 28.361G 26.585M 512x512 192 100 95.4084 95.0583 91.1481 95.2320
convformerm36_pfan_face_parsing FaceSynthetics 71.985G 54.464M 512x512 192 100 96.2895 96.2122 92.9436 96.2506
sapiens_0_3b_face_parsing FaceSynthetics 452.167G 314.250M 512x512 160 100 97.0999 96.9897 94.3823 97.0446
resnet50_pfan_face_parsing CelebAMask-HQ 28.361G 26.585M 512x512 192 100 82.0985 77.9908 69.3835 79.7142
convformerm36_pfan_face_parsing CelebAMask-HQ 71.985G 54.464M 512x512 192 100 83.4664 81.1791 72.6132 82.1953
sapiens_0_3b_face_parsing CelebAMask-HQ 452.167G 314.250M 512x512 160 100 86.0223 84.0680 76.2724 84.9471

You can find more model training details in 11.face_parsing_training/.

Human parsing task results

PFAN human parsing

Paper1:https://arxiv.org/abs/1903.00179

Paper2:https://arxiv.org/abs/2202.09741

Sapiens

Paper:https://arxiv.org/pdf/2408.12569

Use LIP and CIHP dataset to train and test.

Network dataset macs params input size batch epochs precision recall iou dice
resnet50_pfan_human_parsing LIP 28.437G 26.585M 512x512 192 100 57.5257 50.6568 39.2989 53.2604
convformerm36_pfan_human_parsing LIP 72.060G 54.464M 512x512 192 100 60.6652 57.3280 44.3857 58.7892
sapiens_0_3b_human_parsing LIP 452.175G 314.250M 512x512 160 100 57.0063 51.9517 39.8993 54.0054
resnet50_pfan_human_parsing CIHP 28.437G 26.585M 512x512 192 100 61.9748 55.4004 44.7195 57.8736
convformerm36_pfan_human_parsing CIHP 72.060G 54.464M 512x512 192 100 67.4147 62.6415 51.0651 64.6072
sapiens_0_3b_human_parsing CIHP 452.175G 314.250M 512x512 160 100 65.0747 57.9976 47.1512 60.7108

You can find more model training details in 12.human_parsing_training/.

Interactive segmentation task results

SAM

Paper:https://arxiv.org/pdf/2304.02643

SAM2

Paper:https://arxiv.org/pdf/2408.00714

Use sa_1b_11w dataset, combine salient object detection dataset,combine human matting dataset to train and test.

You can find all jupyter notebook examples in 13.interactive_segmentation_training/sam_predict_example/.

light sam distill from pretrain weight on sa_1b_11w

Network dataset input size batch epochs loss
convformer_m36_sam_encoder sa_1b_11w 1024x1024 32 40 0.0030
convformer_m36_sam sa_1b_11w 1024x1024 32 5 0.1417

You can find more model training details in 13.interactive_segmentation_training/sa_1b/.

light sam train on combine salient object detection and human matting dataset

Network dataset input size batch epochs loss precision recall iou
convformer_m36_sam combine dataset 1024x1024 64 100 0.1012 0.9340 0.9554 0.8988

You can find more model training details in 13.interactive_segmentation_training/salient_object_detection_human_matting_pretrain/.

light sam matting train on combine human matting dataset

Network dataset input size batch epochs iou precision recall sad mae mse grad conn
convformer_m36_sam_matting1 combine dataset 1024x1024 48 200 0.9806 0.9874 0.9930 6.5461 0.0087 0.0051 6.9578 6.3325
convformer_m36_sam_matting2 combine dataset 1024x1024 32 200 0.9799 0.9877 0.9919 6.8052 0.0091 0.0055 6.9553 6.5909

You can find more model training details in 13.interactive_segmentation_training/human_matting/.

light sam matting train on combine salient object detection dataset

Network dataset input size batch epochs iou precision recall sad mae mse grad conn
convformer_m36_sam_matting1 combine dataset 1024x1024 48 200 0.8586 0.9151 0.9326 24.2321 0.0331 0.0320 40.8393 24.2515
convformer_m36_sam_matting2 combine dataset 1024x1024 32 200 0.8586 0.9193 0.9275 23.8464 0.0326 0.0315 40.2559 23.8651

You can find more model training details in 13.interactive_segmentation_training/salient_object_detection/.

Diffusion model task results

Denoising Diffusion Probabilistic Models

Paper:https://arxiv.org/abs/2006.11239

Denoising Diffusion Implicit Models

Paper:https://arxiv.org/abs/2010.02502

All diffusion model with different sampling methods on CelebA-HQ

Trained diffusion unet on CelebA-HQ dataset(DDPM method).Test image num=28000.

sampling method input size steps condition label(train/test) FID IS score(mean/std)
DDPM 64x64 1000 False/False 6.409 2.486/0.082
DDIM 64x64 50 False/False 14.623 2.622/0.073

You can find more model training details in 20.diffusion_model_training/celebahq/.

All diffusion model with different sampling methods on CIFAR10

Trained diffusion unet on CIFAR10 dataset(DDPM method).Test image num=50000.

sampling method input size steps condition label(train/test) FID IS score(mean/std)
DDPM 32x32 1000 False/False 10.302 8.213/0.257
DDIM 32x32 50 False/False 12.440 8.318/0.408
DDPM 32x32 1000 True/True 5.049 8.654/0.112

You can find more model training details in 20.diffusion_model_training/cifar10/.

All diffusion model with different sampling methods on CIFAR100

Trained diffusion unet on CIFAR100 dataset(DDPM method).Test image num=50000.

sampling method input size steps condition label(train/test) FID IS score(mean/std)
DDPM 32x32 1000 False/False 16.298 8.398/0.281
DDIM 32x32 50 False/False 21.402 8.344/0.192
DDPM 32x32 1000 True/True 6.953 10.344/0.150

You can find more model training details in 20.diffusion_model_training/cifar100/.

All diffusion model with different sampling methods on FFHQ

Trained diffusion unet on FFHQ dataset(DDPM method).Test image num=60000.

sampling method input size steps condition label(train/test) FID IS score(mean/std)
DDPM 64x64 1000 False/False 7.758 3.283/0.124
DDIM 64x64 50 False/False 11.328 3.417/0.071

You can find more model training details in 20.diffusion_model_training/ffhq/.