MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty
MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty
Support more multi-modal algorithms and datasets
We are excited to announce that there are several advanced multi-modal methods suppported! We integrated huggingface/transformers with vision backbones in MMPreTrain to run inference and training(in developing).
Methods | Datasets |
---|---|
BLIP (arxiv'2022) | COCO (caption, retrieval, vqa) |
BLIP-2 (arxiv'2023) | Flickr30k (caption, retrieval) |
OFA (CoRR'2022) | GQA |
Flamingo (NeurIPS'2022) | NLVR2 |
Chinese CLIP (arxiv'2022) | NoCaps |
MiniGPT-4 (arxiv'2023) | OCR VQA |
LLaVA (arxiv'2023) | Text VQA |
Otter (arxiv'2023) | VG VQA |
VisualGenomeQA | |
VizWiz | |
VSR |
Add iTPN, SparK self-supervised learning algorithms.
Provide examples of New Config and DeepSpeed/FSDP
We test DeepSpeed and FSDP with MMEngine. The following are the memory and training time with ViT-large, ViT-huge and 8B multi-modal models, the left figure is the memory data, and the right figure is the training time data.
Test environment: 8*A100 (80G) PyTorch 2.0.0
Remark: Both FSDP and DeepSpeed were tested with default configurations and not tuned, besides manually tuning the FSDP wrap policy can further reduce training time and memory usage.
New Features
- Transfer shape-bias tool from mmselfsup (#1658)
- Download dataset by using MIM&OpenDataLab (#1630)
- Support New Configs (#1639, #1647, #1665)
- Support Flickr30k Retrieval dataset (#1625)
- Support SparK (#1531)
- Support LLaVA (#1652)
- Support Otter (#1651)
- Support MiniGPT-4 (#1642)
- Add support for VizWiz dataset (#1636)
- Add support for vsr dataset (#1634)
- Add InternImage Classification project (#1569)
- Support OCR-VQA dataset (#1621)
- Support OK-VQA dataset (#1615)
- Support TextVQA dataset (#1569)
- Support iTPN and HiViT (#1584)
- Add retrieval mAP metric (#1552)
- Support NoCap dataset based on BLIP. (#1582)
- Add GQA dataset (#1585)
Improvements
- Update fsdp vit-huge and vit-large config (#1675)
- Support deepspeed with flexible runner (#1673)
- Update Otter and LLaVA docs and config. (#1653)
- Add image_only param of ScienceQA (#1613)
- Support to use "split" to specify training set/validation (#1535)
Bug Fixes
- Refactor _prepare_pos_embed in ViT (#1656, #1679)
- Freeze pre norm in vision transformer (#1672)
- Fix bug loading IN1k dataset (#1641)
- Fix sam bug (#1633)
- Fixed circular import error for new transform (#1609)
- Update torchvision transform wrapper (#1595)
- Set default out_type in CAM visualization (#1586)
Docs Update
New Contributors
- @alexwangxiang made their first contribution in #1555
- @InvincibleWyq made their first contribution in #1615
- @yyk-wew made their first contribution in #1634
- @fanqiNO1 made their first contribution in #1673
- @Ben-Louis made their first contribution in #1679
- @Lamply made their first contribution in #1671
- @minato-ellie made their first contribution in #1644
- @liweiwp made their first contribution in #1629