Skip to content

MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty

Compare
Choose a tag to compare
@fangyixiao18 fangyixiao18 released this 05 Jul 08:33
· 58 commits to main since this release
ae7a7b7

MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty

Support more multi-modal algorithms and datasets

We are excited to announce that there are several advanced multi-modal methods suppported! We integrated huggingface/transformers with vision backbones in MMPreTrain to run inference and training(in developing).

Methods Datasets
BLIP (arxiv'2022) COCO (caption, retrieval, vqa)
BLIP-2 (arxiv'2023) Flickr30k (caption, retrieval)
OFA (CoRR'2022) GQA
Flamingo (NeurIPS'2022) NLVR2
Chinese CLIP (arxiv'2022) NoCaps
MiniGPT-4 (arxiv'2023) OCR VQA
LLaVA (arxiv'2023) Text VQA
Otter (arxiv'2023) VG VQA
VisualGenomeQA
VizWiz
VSR

Add iTPN, SparK self-supervised learning algorithms.

image
image

Provide examples of New Config and DeepSpeed/FSDP

We test DeepSpeed and FSDP with MMEngine. The following are the memory and training time with ViT-large, ViT-huge and 8B multi-modal models, the left figure is the memory data, and the right figure is the training time data.

Test environment: 8*A100 (80G) PyTorch 2.0.0
image
Remark: Both FSDP and DeepSpeed were tested with default configurations and not tuned, besides manually tuning the FSDP wrap policy can further reduce training time and memory usage.

New Features

  • Transfer shape-bias tool from mmselfsup (#1658)
  • Download dataset by using MIM&OpenDataLab (#1630)
  • Support New Configs (#1639, #1647, #1665)
  • Support Flickr30k Retrieval dataset (#1625)
  • Support SparK (#1531)
  • Support LLaVA (#1652)
  • Support Otter (#1651)
  • Support MiniGPT-4 (#1642)
  • Add support for VizWiz dataset (#1636)
  • Add support for vsr dataset (#1634)
  • Add InternImage Classification project (#1569)
  • Support OCR-VQA dataset (#1621)
  • Support OK-VQA dataset (#1615)
  • Support TextVQA dataset (#1569)
  • Support iTPN and HiViT (#1584)
  • Add retrieval mAP metric (#1552)
  • Support NoCap dataset based on BLIP. (#1582)
  • Add GQA dataset (#1585)

Improvements

  • Update fsdp vit-huge and vit-large config (#1675)
  • Support deepspeed with flexible runner (#1673)
  • Update Otter and LLaVA docs and config. (#1653)
  • Add image_only param of ScienceQA (#1613)
  • Support to use "split" to specify training set/validation (#1535)

Bug Fixes

  • Refactor _prepare_pos_embed in ViT (#1656#1679)
  • Freeze pre norm in vision transformer (#1672)
  • Fix bug loading IN1k dataset (#1641)
  • Fix sam bug (#1633)
  • Fixed circular import error for new transform (#1609)
  • Update torchvision transform wrapper (#1595)
  • Set default out_type in CAM visualization (#1586)

Docs Update

New Contributors