awesome-huge-models

A collection of AWESOME things about HUGE AI models.

There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [Intro in Chinese]

Survey

A Dive into Vision-Language Models
Compute Trends Across Three Eras of Machine Learning [chart]
A Roadmap to Big Model
On the Opportunities and Risk of Foundation Models
Pre-Trained Models: Past, Present and Future

Language Model

GPT-4 [OpenAI] Mar 2023 [close] GPT-4 Technical Report [Preprint]
```
Field: Language-Vision
```

LLaMa [Meta] Feb 2023 [open]
Open and Efficient Foundation Language Models [Preprint]

Field: Language
Params: 65B
Training Data: 4TB (1.4T tokens)
Training Cost: 1,022,362 (2048 80G-A100 x 21 days)
Training Power Consumption: 449 MWh
Achitecture: De

AnthropicLM [Anthropic] Dec 2022 [close]
Constitutional AI: Harmlessness from AI Feedback
```
Field: Language
Params: 52B
```
ChatGPT [OpenAI] Nov 2022 [close]
```
Field: Language (Dialog)
Params: 175B
```

BLOOM [BigScience] Nov 2022 [open]
A 176B-Parameter Open-Access Multilingual Language Model [Preprint]

Field: Language
Params: 176B
Training Data: 174GB
Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months
Training Power Consumption: 475 MWh
Architecture: De

Flan-T5, Flan-PaLM [Google] Oct 2022
Scaling Instruction-Finetuned Language Models [Preprint]
```
Field: Language
Note: Intruct tuning of T5 and PaLM
```

UL2 [Google] May 2022 [open]
Unifying Language Learning Paradigms [Preprint]

Field: Language
Params: 20B
Training Data: 800GB
Achitecture: En-De

OPT [Meta] May 2022 [open]
OPT: Open Pre-trained Transformer Language Models [Preprint]

Field: Language
Params: 175B
Training Data: 800GB (180B tokens)
Training Cost: 809,472 A100 hours =  992 80G-A100 x 34 days
Training Power Consumption: 356 MWh
Architecutre: De

PaLM [Google] Apr 2022 [close]
PaLM: Scaling Language Modeling with Pathways [Preprint]

Field: Language
Params: 550B
Training Data: 3TB (780B tokens)
Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)
Training petaFLOPs: 2.5B
Architecture: De

GPT-NeoX [EleutherAI] Apr 2022 [open]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model [Preprint]
```
Field: Language
Params: 20B
Training petaFLOPs: 93B
Architecture: De
```
InstructGPT [OpenAI] Mar 2022 [close]
Training language models to follow instructions with human feedback [Preprint]
```
Field: Language
Params: 175B
```

Chinchilla [DeepMind] Mar 2022 [close]
Training Compute-Optimal Large Language Models [Preprint]

Field: Language
Params: 70B
Training Data: 5.2TB
Training petaFLOPs: 580M
Architecture: De

EVA 2.0 [BAAI] Mar 2022 [open]
EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [Preprint]
```
Field: Language (Dialogue)
Params: 2.8B
Training Data: 180G (1.4B samples, Chinese)
```
AlphaCode [DeepMind] Mar 2022 [close]
Competition-Level Code Generation with AlphaCode [Preprint]
```
Field: Code Generation
Params: 41B
Training Data: (967B tokens)
Architecture: De
```
ST-MoE [Google] Feb 2022 [close]
ST-MoE: Designing Stable and Transferable Sparse Expert Models [Preprint]
```
Field: Language
Params: 296B
Architecture: En-De, MoE
```

LaMDA [Google] Jan 2022 [close]
LaMDA: Language Models for Dialog Applications [Preprint]

Field: Language (Dialogue)
Params: 137B
Training Data: (1.56T words)
Training petaFLOPs: 360M
Architecture: De

ERNIE-ViLG [Baidu] Dec 2022 [close]
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

Field: Image Generation (text to image)
Params: 10B
Training Data: (145M text-image pairs)
Architecture: Transformer, dVAE + De

GLaM [Google] Dec 2021 [close]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [Preprint]
```
Field: Language
Params: 1.2T
Architecture: De, MoE
```
Gopher [DeepMind] Dec 2021 [close]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher [Preprint]
```
Field: Language
Params: 280B
Training Data: 1.3TB (300B tokens)
Training petaFLOPs: 630M
Architecture: De
```

Yuan 1.0 [inspur] Oct 2021 [close]
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [Preprint]

Field: Language
Params: 245B
Training Data: 5TB (180B tokens, Chinese)
Training petaFLOPs: 410M
Architecture: De, MoE

MT-NLG [Microsoft, Nvidia] Oct 2021 [close]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [Preprint]
```
Field: Language
Params: 530B
Training Data: (339B tokens)
Training petaFLOPs: 1.4B
Architecture: De
```

Flan-LaMDA [Google] Sept 2021 [close]
Finetuned Language Models are Zero-shot Learners [Preprint]

Field: Language
Params: 137B
Training Data: Instruct tuning on 60 NLP datasets
Architecture: De

Plato-XL [Baidu] Sept 2021 [close]
PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [Preprint]
```
Field: Language (Dialogue)
Params: 11B
Training Data: (1.2B samples)
```
Jurassic-1 [AI21 Labs] Aug 2021 [close]
Jurassic-1: Technical Details and Evaluation [Preprint]
```
Field: Language
Params: 178B
Training petaFLOPs: 370M
Architecture: De
```
Codex [DeepMind] July 2021 [close]
Evaluating Large Language Models Trained on Code [Preprint]
```
Field: Code Generation
Params: 12B
Training Data: 159GB
Architecture: De
```
ERNIE 3.0 [Baidu] July 2021 [close]
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [Preprint]
```
Field: Language
Params: 10B
Training Data: 4TB (375B tokens, with knowledge graph)
Architecture: En
Objective: MLM
```

CPM-2 [BAAI] June 2021 [open]
CPM-2: Large-scale Cost-effective Pre-trained Language Models [Preprint]

Field: Language
Params: 198B
Training Data: 2.6TB (Chinese 2.3TB, English 300GB)
Architecture: En-De
Objective: MLM

HyperClova [Naver] May 2021 [close]
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [Preprint]
```
Field: Language
Params: 82B
Training Data: (562B tokens, Korean)
Training petaFLOPs: 63B
Architecture: De
```
ByT5 [Google] May 2021 [open]
ByT5: Towards a token-free future with pre-trained byte-to-byte models [TACL'22]
```
Field: Language
Params: 13B
Training Data: (101 languages)
Architecture: En-De
```
PanGu-α [Huawei] Apr 2021 [close]
PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [Preprint]
```
Field: Language
Params: 200B
Training Data: 1.1TB (Chinese)
Training petaFLOPs: 58M
Architecture: De
```
mT5 [Google] Mar 2021 [open]
mT5: A massively multilingual pre-trained text-to-text transformer [Preprint]
```
Field: Language
Params: 13B
Training Data: (101 languages)
Architecture: En-De
```

WuDao-WenHui [BAAI] Mar 2021 [open]

Field: Language
Params: 2.9B
Training Data: 303GB (Chinese)

GLM [BAAI] Mar 2021 [open]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling [Preprint]
```
Field: Language
Params: 10B
Architecture: De
```
Switch Transformer [Google] Jan 2021 [open]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Preprint]
```
Field: Language
Params: 1.6T
Training Data: 750GB
Training petaFLOPs: 82M
Architecture: En-De, MoE
Objective: MLM
```

CPM [BAAI] Dec 2020 [open]
CPM: A Large-scale Generative Chinese Pre-trained Language Model [Preprint]

Field: Language
Params: 2.6B
Training Data: 100G (Chinese)
Training petaFLOPs: 1.8M
Architecture: De
Objective: LTR

GPT-3 [OpenAI] May 2020 [close]
Language Models are Few-Shot Learners [NeurIPS'20]

Field: Language
Params: 175B  
Training Data: 45TB (680B Tokens)
Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)
Training Cost: $4.6M
Training petaFLOPs: 310M
Architecture: De
Obective: LTR

Blender [Meta] Apr 2020 [close]
Recipes for building an open-domain chatbot [Preprint]
```
Field: Language (Dialogue)
Params: 9.4B
```

T-NLG [Microsoft] Feb 2020 [close]

Field: Language
Params: 17B
Training petaFLOPs: 16M
Architecture: De
Obective: LTR

Meena [Google] Jan 2020 [close]
Towards a Human-like Open-Domain Chatbot [Preprint]

Field: Language (Dialogue)
Params: 2.6B
Training Data: 341GB (40B words)
Training petaFLOPs: 110M

DialoGPT [Microsoft] Nov 2019 [open]
DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [ACL'20]
```
Field: Language (Dialogue)
Params: 762M
Training Data: (147M conversation)
Architecture: De
```

T5 [Google] Oct 2019 [open]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [JMLR'19]

Field: Language
Params: 11B
Training Data: 800GB
Training Cost: $1.5M
Training petaFLOPs: 41M
Architecture: En-De
Obective: MLM

Megatron-LM [Nvidia] Sept 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]
```
Field: Language
Params: 8.3B
Training Data: 174 GB
Training petaFLOPs: 9.1M
Architecture: De
Obective: LTR
```
Megatron-BERT [Nvidia] Sept 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]
```
Field: Language
Params: 3.9B
Training Data: 174 GB
Training petaFLOPs: 57M
Architecture: En
Obective: MLM
```

RoBERTa [Meta] July 2019 [open]
RoBERTa: A Robustly Optimized BERT Pretraining Approach [Preprint]

Field: Language
Params: 354M
Training Data: 160GB
Training Time: 1024 V100 GPU days
Architecture: En
Objective: MLM

XLNet [Google] June 2019 [open]
XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS'19]

Field: Language
Params: 340M
Training Data: 113GB (33B words)
Training Time: 1280 TPUv3 days
Training Cost: $245k
Architecture: En
Objective: PLM

GPT-2 [OpenAI] Feb 2019 [open]
Language Models are Unsupervised Multitask Learners [Preprint]

Field: Language  
Params: 1.5B
Training Data: 40GB (8M web pages)
Training Cost: $43k
Training petaFLOPs: 1.5M
Architecture: De
Objective: LTR

BERT [Google] Oct 2018 [open]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NAACL'18]

Field: Language
Params: 330M
Training Data: 16GB (3.3B words)
Training Time: 64 TPUv2 days (280 V100 GPU days)
Training Cost: $7k
Training petaFLOPs: 290k
Architecture: En
Objective: MLM, NSP

GPT [OpenAI] June 2018 [open] Improving Language Understanding by Generative Pre-Training [Preprint]

Field: Language  
Params: 117M 
Training Data: 1GB (7k books)
Training petaFLOPs: 18k
Architecture: De
Objective: LTR

Vision Models

MAE->WSP-2B [Meta] Mar 2023 [close]
The effectiveness of MAE pre-pretraining for billion-scale pretraining

Field: Vision
Params: 6.5B
Training Data: (3B images)
Architecture: Transformer
Objective: MAE, Weakly-Supervised

OpenCLIP G/14 [LAION] Mar 2023 [open]

Field: Vision-Language
Params: 2.5B
Training Data: (2B images)

ViT-22B [Google] Feb 2023 [close] Scaling Vision Transformers to 22 Billion Parameters

Field: Vision
Params: 22B
Training Data: (4B images)
Architecture: Transformer
Objective: Supervised

InternImage-G [Shanghai AI Lab] Nov 2022 [open] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [CVPR'23 Highlight]
```
Field: Vision
Params: 3B
Architecture: CNN
Core Operator: Deformable Convolution v3
```

Stable Diffusion [Stability AI] Aug 2022 [open]

Field: Image Generation (text to image)
Params: 890M
Training Data: (5B images)
Architecture: Transformer, Diffusion

Imagen [Google] May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Preprint]
```
Field: Image Generation (text to image)
Text Encoder: T5
Image Decoder: Diffusion, Upsampler
```
Flamingo [DeepMind] Apr 2022 [close]
Flamingo: a Visual Language Model for Few-Shot Learning [Preprint]
```
Field: Vision-Language
Params: 80B
```

DALL·E 2 [OpenAI] Apr 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents [Preprint]

Field: Image Generation (text to image)
Text Encoder: GPT2 (CLIP)
Image Encoder: ViT (CLIP)
Image Decoder: Diffusion, Upsampler

BaGuaLu [BAAI, Alibaba] Apr 2022
BaGuaLu: targeting brain scale pretrained models with over 37 million cores [PPoPP'22]
```
Field: Vision-Language
Params: 174T
Architecture: M6
```
SEER [Meta] Feb 2022 [open]
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [Preprint]
```
Field: Vision
Params: 10B
Training Data: (1B images)
Architecture: Convolution
Objective: SwAV
```

ERNIE-ViLG [Baidu] Dec 2021
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

Field: Image Generation (text to image)
Params: 10B
Training Data: (145M text-image pairs)
Architecture: Transformer, dVAE + De

NUWA [Microsoft] Nov 2021 [open]
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [Preprint]
```
Field: Vision-Language
Generatioon: Image, Video
Params: 870M
```

SwinV2-G [Google] Nov 2021 [open]
Swin Transformer V2: Scaling Up Capacity and Resolution [CVPR'22]

Field: Vision
Params: 3B
Training Data: 70M
Architecture: Transformer
Objective: Supervised

ViT-G/14 [Google] June 2021
Scaling Vision Transformers [Preprint]

Field: Vision
Params: 1.8B
Training Data: (300M images)
Training petaFLOPs: 3.4M
Architecture: Transformer
Objective: Supervised

CoAtNet [Google] June 2021 [open]
CoAtNet: Marrying Convolution and Attention for All Data Sizes [NeurIPS'21]

Field: Vision
Params: 2.4B
Training Data: (300M images)
Architecture: Transformer, Convolution
Objective: Supervised

V-MoE [Google] June 2021
Scaling Vision with Sparse Mixture of Experts [NeurIPS'21]

Field: Vision
Params: 15B
Training Data: (300M images)
Training Time: 16.8k TPUv3 days
Training petaFLOPs: 33.9M
Architecture: Transformer, MoE
Objective: Supervised

CogView [BAAI, Alibaba] May 2021 </>
CogView: Mastering Text-to-Image Generation via Transformers [NeurIPS'21]

Field: Vision-Language
Params: 4B
Training Data: (30M text-image pairs)
Training petaFLOPs: 27M
Image Encoder: VAE
Text Encoder & Image Decoder: GPT2

M6 [Alibaba] Mar 2021
M6: A Chinese Multimodal Pretrainer [Preprint]

Field: Vision-Language
Params: 10T
Training Data: 300G Texts + 2TB Images
Training petaFLOPs: 5.5M
Fusion: Single-stream
Objective: MLM, IC

DALL·E [OpenAI] Feb 2021
Zero-Shot Text-to-Image Generation [ICML'21]

Field: Image Generation (text to image)
Params: 12B
Training Data: (250M text-images pairs)
Training petaFLOPs: 47M
Image Encoder: dVAE
Text Encoder & Image Decoder: GPT2

CLIP [OpenAI] Jan 2021
Learning Transferable Visual Models From Natural Language Supervision [ICML'22]

Field: Vision-Language
Training Data: 400M text-image pairs
Training petaFLOPs: 11M
Image Encoder: ViT
Text Encoder: GPT-2
Fusion: Dual Encoder
Objective: CMCL

ViT-H/14 [Google] Oct 2020 [open]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR'20]

Field: Vision
Params: 632M
Training Data: (300M images)
Training petaFLOPs: 13M
Architecture: Transformer
Objective: Supervised

iGPT-XL [OpenAI] June 2020 [open]
Generative Pretraining From Pixels [ICML'20]

Field: Image Generation
Params: 6.8B
Training Data: (1M images)
Training petaFLOPs: 33M
Architecture: Transformer, De

BigGAN-deep [DeepMind] Sept 2018 [open]
Large Scale GAN Training for High Fidelity Natural Image Synthesis [ICLR'19]

Field: Image Generation
Params: 158M
Training Data: (300M images)
Training petaFLOPs: 3M
Architecture: Convolution, GAN
Resolution: 512x512

Models (Others)

PaLM-E [Google] March 2023 PaLM-E: An Embodied Multimodal Language Model [Preprint]
```
Field: Reinforcement Learning
Params: 562B (540B LLM + 22B Vi)
```

Gato [DeepMind] May 2022
A Generalist Agent [Preprint]

Field: Reinforcement Learning
Params: 1.2B
Training Data: (604 Tasks)
Objective: Supervised

Zidongtaichu [CASIA] Sept 2021

Field: Image, Video, Language, Speech
Params: 100B

AlphaFold 2 [DeepMind] July 2021 </>
Highly accurate protein structure prediction with AlphaFold [Nature]
```
Field: Biology
Params: 21B
Training petaFLOPs: 100k
```
HuBERT [Meta] June 2021 </>
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [Preprint]
```
Field: Speech
Params: 1B
Training Data: (60k hours)
Objective: MLM
```
wav2vec 2.0 [Meta] Oct 2020 </>
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS'20]
```
Field: Speech
Params: 317M
Training Data: (50k hours)
Training petaFLOPs: 430M
Objective: MLM
```

Recommendation Training Framework

HET [Tencent] Dec 2021
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [VLDB'22]
Persia [Kuaishou] Nov 2021
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [Preprint]
Embeddings Params: 100T
ZionEX [Facebook] Apr 2021
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [ISCA'21]
Embeddings Params: 10T
ScaleFreeCTR [Huawei] Apr 2021
ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [SIGIR'21]
Kraken [Kuaishou] Nov 2020
Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [SC'20]
TensorNet [Qihoo360] Sept 2020 </>
HierPS [Baidu] Mar 2020
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [MLSys'20]
AIBox [Baidu] Oct 2019
AIBox: CTR Prediction Model Training on a Single Node [CIKM'20]
Embeddings Params: 0.1T
XDL [Alibaba] Aug 2019
XDL: an industrial deep learning framework for high-dimensional sparse data [DLP-KDD'21]
Embeddings Params: 0.01T

Distributed Training Framework

Deep Learning frameworks supportting distributed training are marked with *.

Pathways [Google] Mar 2021
Pathways: Asynchronous Distributed Dataflow for ML [Preprint]
Colossal-AI [HPC-AI TECH] Nov 2021 </>
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [Preprint]
OneFlow* [OneFlow] July 2020 </>
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [Preprint]
GShard [Google] June 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [Preprint]
MindSpore* [Huawei] Mar 2020 </>
DeepSpeed [Microsoft] Oct 2019 </>
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [SC'20]
Megatron [Nivida] Sept 2019 </>
Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]
PaddlePaddle [Baidu] Nov 2018 </>
End-to-end Adaptive Distributed Training on PaddlePaddle [Preprint]
Horovod [Uber] Feb 2018 </>
Horovod: fast and easy distributed deep learning in TensorFlow [Preprint]
PyTorch* [Meta] Sept 2016 </>
PyTorch: An Imperative Style, High-Performance Deep Learning Library [NeurIPS'19]
Tensorflow* [Google] Nov 2015 </>
TensorFlow: A system for large-scale machine learning [OSDI'16]

Keys Explanations

Company tags: the related company name. Other institudes may also involve in the job.
Params: number of parameters of the largest model
Training data size, training cost and training petaFLOPs may have some uncertainty.
Training cost
- TPUv2 hour: $4.5
- TPUv3 hour: $8
- V100 GPU hour: $0.55 (2022)
- A100 GPU hoor: $1.10 (2022)
Architecture
- En: Encoder-based Language Model
- De: Decoder-based Language Model
- En-De=Encoder-Decoder-based Language Model
- The above three architectures are powered with transformers.
- MoE: Mixture of Experts
Objective (See explanation in section 6–8 of this paper)
- MLM: Masked Language Modeling
- LTR: Left-To-Right Language Modeling
- NSP: Next Sentence Prediction
- PLM: Permuted Language Modeling
- IC: Image Captioning
- VLM: Vision Languauge Matching
- CMCL: Cross-Modal Contrastive Learning
FLOPs: number of FLOating-Point operations [explanation]
- 1 petaFLOPs = 1e15 FLOPs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

awesome-huge-models

Contents

Survey

Language Model

Vision Models

Models (Others)

Recommendation Training Framework

Distributed Training Framework

Keys Explanations

Files

README.md

Latest commit

History

README.md

File metadata and controls

awesome-huge-models

Contents

Survey

Language Model

Vision Models

Models (Others)

Recommendation Training Framework

Distributed Training Framework

Keys Explanations