Name	Name	Last commit message	Last commit date
Latest commit History 54 Commits
img	img
.gitignore	.gitignore
README.md	README.md

awesome-huge-models

A collection of AWESOME things about HUGE AI models.

[2023.06] We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and the datasets used for pretraining LLMs. In this update, I try to catch up with the latest developments in the open-source wave of LLMs.

[2023.03] Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [open].

[2022.06] There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [Intro in Chinese]

awesome-huge-models

Survey

A Survey of Large Language Models [2023.03]
A Dive into Vision-Language Models [2023.02]
Compute Trends Across Three Eras of Machine Learning [chart] [2022.02]
Vision-and-Language Pretrained Models: A Survey [2022.04]
A Roadmap to Big Model [2022.03]
A Survey of Vision-Language Pre-trained Models [2022.02]
Transformers in Vision: A Survey [2022.01]
On the Opportunities and Risk of Foundation Models [2021.08]
Pre-Trained Models: Past, Present and Future [2021.06]

Resources list:

Models

Language Model

Baichuan [Baichuan] Jun. 2023 [open]

Field: Language
Params: 7B
Training Data: 1.2T tokens (English, Chinese, Private)
License: Apache 2.0
Context: 4096

Falcon [TII] Jun. 2023 [open]

Field: Language
Params: 40B
Training Data: 1T tokens (RefinedWeb)
License: Apache 2.0
Context Length: 2048

OpenLLaMA [OpenLM] May. 2023 [open]

Field: Language
Params: 13B, 7B, 3B
Training Data: 1T tokens (RedPajama)
License: Apache 2.0
Context Length: 2048

Redpajama-INCITE [Together] May. 2023 [open]

Field: Language
Params: 7B, 3B
Training Data: 1T tokens (Redpajama)
License: Apache 2.0
Context Length: 2048

MPT [MosaicML] May. 2023 [open]

Field: Language
Params: 30B, 7B
Training Data: 1T tokens (Private)
License: Apache 2.0, CC BY-SA-3.0
Context Length: 84k

Stable-LM [Stability-AI] Apr. 2023 [open]

Field: Language
Params: 7B, 3B
Training Data: 1.5T tokens
License: CC BY-SA-4.0

LiT-LLaMa [Lightning-AI] Apr. 2023 [open]

Field: Language
Params: 13B, 7B
Training Data: 1.2T tokens (Redpajama)
License: Apache 2.0

h2oGPT [H2O.ai] [open]
h2oGPT: Democratizing Large Language Models

Field: Language
Params: 13B, 7B
Training Data: 1.0T tokens
License: Apache 2.0
Context Length: 2048

Cerabras-GPT [Cerabras] Mar. 2023 [open]
Training Compute-Optimal Large Language Models [preprint]

Field: Language
Params: 13B
Training Data: 371B tokens (Redpajama)
License: Apache 2.0
Context Length: 2048

Claude [Anthropic] Mar. 2023 [close]
```
Field: Language-Vision
```

GPT-4 [OpenAI] Mar. 2023 [close]
GPT-4 Technical Report [Preprint]

Field: Language-Vision
Params: 1.7T
Architecture: De, MoE

Bard [Google]
```
Field: Language-Vision
```

LLaMa [Meta] Feb. 2023 [open]
Open and Efficient Foundation Language Models [Preprint]

Field: Language
Params: 65B, 33B, 13B, 7B
Training Data: 4TB (1.4T tokens)
Training Cost: 1,022,362 (2048 80G-A100 x 21 days)
Training Power Consumption: 449 MWh
Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc.
License: GPL

RWKV-4 [Personal] Dec. 2022 [open]

Field: Language
Params: 14B, 7B, 3B, 1.5B
Training Data: 332B tokens
Architecture: De, RNN
License: Apache 2.0

AnthropicLM [Anthropic] Dec. 2022 [close]
Constitutional AI: Harmlessness from AI Feedback
```
Field: Language
Params: 52B
```

BLOOM [BigScience] Nov. 2022 [open]
A 176B-Parameter Open-Access Multilingual Language Model [Preprint]

Field: Language
Params: 176B
Training Data: 174GB (336B tokens)
Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months
Training Power Consumption: 475 MWh
Training Framework: Megatron + Deepspeed
Instruction-tuned Variants: BLOOMZ
License: OpenRAIL-M v1
Context Length: 2048

Galactica [Meta] Nov. 2022 [open] A scientific language model trained on over 48 million scientific texts [Preprint]
```
Field: Language
Params: 125M, 1.3B, 6.7B, 30B, 120B
```

Pythia [EleutherAI] Oct. 2022 [open]

Field: Language
Params: 12B
Instruction-tuned Variants: Dolly 2.0
License: Apache 2.0
Context Length: 2048

GLM-130B [BAAI] Oct. 2022 [open]
GLM-130B: An Open Bilingual Pre-trained Model [ICLR'23]

Field: Language
Params: 130B
Training Data: (400B tokens)
Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days
Training Framework: Megatron + Deepspeed

UL2 [Google] May 2022 [open]
Unifying Language Learning Paradigms [Preprint]

Field: Language
Params: 20B (1T tokens)
Training Data: 800GB
Achitecture: En-De
Training Framework: Jax + T5x
License: Apache 2.0
Instruction-tuned Variants: Flan-UL2
Context Length: 2048

OPT [Meta] May 2022 [open]
OPT: Open Pre-trained Transformer Language Models [Preprint]

Field: Language
Params: 175B
Training Data: 800GB (180B tokens)
Training Cost: 809,472 A100 hours =  992 80G-A100 x 34 days
Training Power Consumption: 356 MWh
Architecutre: De
Training Framework: Megatron + Fairscale

PaLM [Google] Apr. 2022 [close]
PaLM: Scaling Language Modeling with Pathways [Preprint]

Field: Language
Params: 550B
Training Data: 3TB (780B tokens)
Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)
Training petaFLOPs: 2.5B
Architecture: De
Training Framework: Jax + T5x

GPT-NeoX [EleutherAI] Apr. 2022 [open]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model [Preprint]

Field: Language
Params: 20B
Training Data: 525GiB
Training petaFLOPs: 93B
Architecture: De
Training Framework: Megatron + Fairscale
License: Apache 2.0
Context Length: 2048

InstructGPT [OpenAI] Mar. 2022 [close]
Training language models to follow instructions with human feedback [Preprint]
```
Field: Language
Params: 175B
```

Chinchilla [DeepMind] Mar. 2022 [close]
Training Compute-Optimal Large Language Models [Preprint]

Field: Language
Params: 70B
Training Data: 5.2TB (1.4T tokens)
Training petaFLOPs: 580M
Architecture: De

EVA 2.0 [BAAI] Mar. 2022 [open]
EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [Preprint]
```
Field: Language (Dialogue)
Params: 2.8B
Training Data: 180G (1.4B samples, Chinese)
```
AlphaCode [DeepMind] Mar. 2022 [close]
Competition-Level Code Generation with AlphaCode [Preprint]
```
Field: Code Generation
Params: 41B
Training Data: (967B tokens)
Architecture: De
```
ST-MoE [Google] Feb. 2022 [close]
ST-MoE: Designing Stable and Transferable Sparse Expert Models [Preprint]
```
Field: Language
Params: 296B
Architecture: En-De, MoE
```

LaMDA [Google] Jan. 2022 [close]
LaMDA: Language Models for Dialog Applications [Preprint]

Field: Language (Dialogue)
Params: 137B
Training Data: (1.56T words)
Training petaFLOPs: 360M
Architecture: De

GLaM [Google] Dec. 2021 [close]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [Preprint]
```
Field: Language
Params: 1.2T
Architecture: De, MoE
```
Gopher [DeepMind] Dec. 2021 [close]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher [Preprint]
```
Field: Language
Params: 280B
Training Data: 1.3TB (300B tokens)
Training petaFLOPs: 630M
Architecture: De
```

Yuan 1.0 [inspur] Oct. 2021 [close]
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [Preprint]

Field: Language
Params: 245B
Training Data: 5TB (180B tokens, Chinese)
Training petaFLOPs: 410M
Architecture: De, MoE

MT-NLG [Microsoft, Nvidia] Oct. 2021 [close]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [Preprint]
```
Field: Language
Params: 530B
Training Data: 339B tokens
Training petaFLOPs: 1.4B
Architecture: De
```
Plato-XL [Baidu] Sept. 2021 [close]
PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [Preprint]
```
Field: Language (Dialogue)
Params: 11B
Training Data: (1.2B samples)
```

GPT-J [EleutherAI] Aug. 2021 [open]

Field: Language
Params: 6B
Programming Language: Jax

Jurassic-1 [AI21 Labs] Aug. 2021 [close]
Jurassic-1: Technical Details and Evaluation [Preprint]
```
Field: Language
Params: 178B
Training petaFLOPs: 370M
Architecture: De
```
Codex [OpenAI] July 2021 [close]
Evaluating Large Language Models Trained on Code [Preprint]
```
Field: Code Generation
Params: 12B
Training Data: 159GB
Architecture: De
```
ERNIE 3.0 [Baidu] July 2021 [close]
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [Preprint]
```
Field: Language
Params: 10B
Training Data: 4TB (375B tokens, with knowledge graph)
Architecture: En
Objective: MLM
```

CPM-2 [BAAI] June 2021 [open]
CPM-2: Large-scale Cost-effective Pre-trained Language Models [Preprint]

Field: Language
Params: 198B
Training Data: 2.6TB (Chinese 2.3TB, English 300GB)
Architecture: En-De
Objective: MLM

HyperClova [Naver] May 2021 [close]
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [Preprint]
```
Field: Language
Params: 82B
Training Data: 562B tokens (Korean)
Training petaFLOPs: 63B
Architecture: De
```
ByT5 [Google] May 2021 [open]
ByT5: Towards a token-free future with pre-trained byte-to-byte models [TACL'22]
```
Field: Language
Params: 13B
Training Data: (101 languages)
Architecture: En-De
```
PanGu-α [Huawei] Apr. 2021 [close]
PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [Preprint]
```
Field: Language
Params: 200B
Training Data: 1.1TB (Chinese)
Training petaFLOPs: 58M
Architecture: De
```
mT5 [Google] Mar. 2021 [open]
mT5: A massively multilingual pre-trained text-to-text transformer [Preprint]
```
Field: Language
Params: 13B
Training Data: (101 languages)
Architecture: En-De
```

WuDao-WenHui [BAAI] Mar. 2021 [open]

Field: Language
Params: 2.9B
Training Data: 303GB (Chinese)

GLM [BAAI] Mar. 2021 [open]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling [Preprint]
```
Field: Language
Params: 10B
Architecture: De
```
Switch Transformer [Google] Jan. 2021 [open]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Preprint]
```
Field: Language
Params: 1.6T
Training Data: 750GB
Training petaFLOPs: 82M
Architecture: En-De, MoE
Objective: MLM
```

CPM [BAAI] Dec. 2020 [open]
CPM: A Large-scale Generative Chinese Pre-trained Language Model [Preprint]

Field: Language
Params: 2.6B
Training Data: 100G (Chinese)
Training petaFLOPs: 1.8M
Architecture: De
Objective: LTR

GPT-3 [OpenAI] May 2020 [close]
Language Models are Few-Shot Learners [NeurIPS'20]

Field: Language
Params: 175B
Training Data: 45TB (680B Tokens)
Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)
Training Cost: $4.6M
Training petaFLOPs: 310M
Architecture: De
Obective: LTR
Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT

Blender [Meta] Apr. 2020 [close]
Recipes for building an open-domain chatbot [Preprint]
```
Field: Language (Dialogue)
Params: 9.4B
```

T-NLG [Microsoft] Feb. 2020 [close]

Field: Language
Params: 17B
Training petaFLOPs: 16M
Architecture: De
Obective: LTR

Meena [Google] Jan. 2020 [close]
Towards a Human-like Open-Domain Chatbot [Preprint]

Field: Language (Dialogue)
Params: 2.6B
Training Data: 341GB (40B words)
Training petaFLOPs: 110M

DialoGPT [Microsoft] Nov. 2019 [open]
DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [ACL'20]
```
Field: Language (Dialogue)
Params: 762M
Training Data: (147M conversation)
Architecture: De
```

T5 [Google] Oct. 2019 [open]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [JMLR'19]

Field: Language
Params: 11B
Training Data: 800GB
Training Cost: $1.5M
Training petaFLOPs: 41M
Architecture: En-De
Obective: MLM
License: Apache 2.0
Instruction-tuned Variants: Flan-T5
Context-Length: 512

Megatron-LM [Nvidia] Sept. 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]

Field: Language
Params: 8.3B
Training Data: 174GB
Training petaFLOPs: 9.1M
Architecture: De
Obective: LTR
Training Framework: Megatron

Megatron-BERT [Nvidia] Sept. 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]

Field: Language
Params: 3.9B
Training Data: 174GB
Training petaFLOPs: 57M
Architecture: En
Obective: MLM
Training Framework: Megatron

RoBERTa [Meta] July 2019 [open]
RoBERTa: A Robustly Optimized BERT Pretraining Approach [Preprint]

Field: Language
Params: 354M
Training Data: 160GB
Training Time: 1024 V100 GPU days
Architecture: En
Objective: MLM

XLNet [Google] June 2019 [open]
XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS'19]

Field: Language
Params: 340M
Training Data: 113GB (33B words)
Training Time: 1280 TPUv3 days
Training Cost: $245k
Architecture: En
Objective: PLM

GPT-2 [OpenAI] Feb. 2019 [open]
Language Models are Unsupervised Multitask Learners [Preprint]

Field: Language
Params: 1.5B
Training Data: 40GB (8M web pages)
Training Cost: $43k
Training petaFLOPs: 1.5M
Architecture: De
Objective: LTR

BERT [Google] Oct. 2018 [open]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NAACL'18]

Field: Language
Params: 330M
Training Data: 16GB (3.3B words)
Training Time: 64 TPUv2 days (280 V100 GPU days)
Training Cost: $7k
Training petaFLOPs: 290k
Architecture: En
Objective: MLM, NSP

GPT [OpenAI] June 2018 [open] Improving Language Understanding by Generative Pre-Training [Preprint]

Field: Language
Params: 117M
Training Data: 1GB (7k books)
Training petaFLOPs: 18k
Architecture: De
Objective: LTR

Vision Models

Eva02-E [BAAI] Mar. 2023 [open]
EVA-02: A Visual Representation for Neon Genesis [Preprint]

Field: Vision-Language
Params: 5B
Training Data: 2B image-text pairs
Architecture: Transformer
Objective: MIM, Clip Constrastive

MAE->WSP-2B [Meta] Mar. 2023 [close]
The effectiveness of MAE pre-pretraining for billion-scale pretraining [Preprint]

Field: Vision
Params: 6.5B
Training Data: 3B images
Architecture: Transformer
Objective: MAE, Weakly-Supervised

OpenCLIP G/14 [LAION] Mar. 2023 [open]

Field: Vision-Language
Params: 2.5B
Training Data: 2B images

ViT-22B [Google] Feb. 2023 [close]
Scaling Vision Transformers to 22 Billion Parameters

Field: Vision
Params: 22B
Training Data: 4B images
Architecture: Transformer
Objective: Supervised

ERNIE-ViLG [Baidu] Dec. 2022 [close]
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

Field: Image Generation (text to image)
Params: 10B
Training Data: 145M text-image pairs
Architecture: Transformer, dVAE + De

InternImage-G [Shanghai AI Lab] Nov. 2022 [open] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [CVPR'23 Highlight]
```
Field: Vision
Params: 3B
Architecture: CNN
Core Operator: Deformable Convolution v3
```

Stable Diffusion [Stability AI] Aug. 2022 [open]

Field: Image Generation (text to image)
Params: 890M
Training Data: 5B images
Architecture: Transformer, Diffusion

Imagen [Google] May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Preprint]
```
Field: Image Generation (text to image)
Text Encoder: T5
Image Decoder: Diffusion, Upsampler
```
Flamingo [DeepMind] Apr. 2022 [close]
Flamingo: a Visual Language Model for Few-Shot Learning [Preprint]
```
Field: Vision-Language
Params: 80B
```

DALL·E 2 [OpenAI] Apr. 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents [Preprint]

Field: Image Generation (text to image)
Text Encoder: GPT2 (CLIP)
Image Encoder: ViT (CLIP)
Image Decoder: Diffusion, Upsampler

BaGuaLu [BAAI, Alibaba] Apr. 2022
BaGuaLu: targeting brain scale pretrained models with over 37 million cores [PPoPP'22]
```
Field: Vision-Language
Params: 174T
Architecture: M6
```
SEER [Meta] Feb. 2022 [open]
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [Preprint]
```
Field: Vision
Params: 10B
Training Data: 1B images
Architecture: Convolution
Objective: SwAV
```

ERNIE-ViLG [Baidu] Dec. 2021
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

Field: Image Generation (text to image)
Params: 10B
Training Data: 145M text-image pairs
Architecture: Transformer, dVAE + De

NUWA [Microsoft] Nov. 2021 [open]
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [Preprint]
```
Field: Vision-Language
Generatioon: Image, Video
Params: 870M
```

SwinV2-G [Google] Nov. 2021 [open]
Swin Transformer V2: Scaling Up Capacity and Resolution [CVPR'22]

Field: Vision
Params: 3B
Training Data: 70M
Architecture: Transformer
Objective: Supervised

Zidongtaichu [CASIA] Sept. 2021 [close]

Field: Image, Video, Language, Speech
Params: 100B

ViT-G/14 [Google] June 2021
Scaling Vision Transformers [Preprint]

Field: Vision
Params: 1.8B
Training Data: 300M images
Training petaFLOPs: 3.4M
Architecture: Transformer
Objective: Supervised

CoAtNet [Google] June 2021 [open]
CoAtNet: Marrying Convolution and Attention for All Data Sizes [NeurIPS'21]

Field: Vision
Params: 2.4B
Training Data: 300M images
Architecture: Transformer, Convolution
Objective: Supervised

V-MoE [Google] June 2021
Scaling Vision with Sparse Mixture of Experts [NeurIPS'21]

Field: Vision
Params: 15B
Training Data: 300M images
Training Time: 16.8k TPUv3 days
Training petaFLOPs: 33.9M
Architecture: Transformer, MoE
Objective: Supervised

CogView [BAAI, Alibaba] May 2021 </>
CogView: Mastering Text-to-Image Generation via Transformers [NeurIPS'21]

Field: Vision-Language
Params: 4B
Training Data: 30M text-image pairs
Training petaFLOPs: 27M
Image Encoder: VAE
Text Encoder & Image Decoder: GPT2

M6 [Alibaba] Mar. 2021
M6: A Chinese Multimodal Pretrainer [Preprint]

Field: Vision-Language
Params: 10T
Training Data: 300G Texts + 2TB Images
Training petaFLOPs: 5.5M
Fusion: Single-stream
Objective: MLM, IC

DALL·E [OpenAI] Feb. 2021
Zero-Shot Text-to-Image Generation [ICML'21]

Field: Image Generation (text to image)
Params: 12B
Training Data: 250M text-images pairs
Training petaFLOPs: 47M
Image Encoder: dVAE
Text Encoder & Image Decoder: GPT2

CLIP [OpenAI] Jan. 2021
Learning Transferable Visual Models From Natural Language Supervision [ICML'22]

Field: Vision-Language
Training Data: 400M text-image pairs
Training petaFLOPs: 11M
Image Encoder: ViT
Text Encoder: GPT-2
Fusion: Dual Encoder
Objective: CMCL

ViT-H/14 [Google] Oct. 2020 [open]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR'20]

Field: Vision
Params: 632M
Training Data: 300M images
Training petaFLOPs: 13M
Architecture: Transformer
Objective: Supervised

iGPT-XL [OpenAI] June 2020 [open]
Generative Pretraining From Pixels [ICML'20]

Field: Image Generation
Params: 6.8B
Training Data: 1M images
Training petaFLOPs: 33M
Architecture: Transformer, De

BigGAN-deep [DeepMind] Sept. 2018 [open]
Large Scale GAN Training for High Fidelity Natural Image Synthesis [ICLR'19]

Field: Image Generation
Params: 158M
Training Data: 300M images
Training petaFLOPs: 3M
Architecture: Convolution, GAN
Resolution: 512x512

Reinforcement Learning

PaLM-E [Google] March 2023 [close]
PaLM-E: An Embodied Multimodal Language Model [Preprint]
```
Field: Reinforcement Learning
Params: 562B (540B LLM + 22B Vi)
```

Gato [DeepMind] May 2022 [close]
A Generalist Agent [Preprint]

Field: Reinforcement Learning
Params: 1.2B
Training Data: (604 Tasks)
Objective: Supervised

Speech

USM [Google] Mar. 2023 [close]
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [Preprint]
```
Field: Speech
Params: 2B
Training Data: 12,000,000 hours
```
Whisper [OpenAI] Sept. 2022 [close]
Robust Speech Recognition via Large-Scale Weak Supervision [Preprint]
```
Field: Speech
Params: 1.55B
Training Data: 680,000 hours
Objective: Weakly Supervised
```
HuBERT [Meta] June 2021 [open]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [Preprint]
```
Field: Speech
Params: 1B
Training Data: 60,000 hours
Objective: MLM
```
wav2vec 2.0 [Meta] Oct. 2020 [open]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS'20]
```
Field: Speech
Params: 317M
Training Data: 50,000 hours
Training petaFLOPs: 430M
Objective: MLM
```
DeepSpeech 2 [Meta] Dec. 2015 [open]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin [ICML'15]
```
```yaml
Field: Speech
Params: 300M
Training Data: 21,340 hours
```
```

Science

AlphaFold 2 [DeepMind] July 2021 [open]
Highly accurate protein structure prediction with AlphaFold [Nature]
```
Field: Biology
Params: 21B
Training petaFLOPs: 100k
```

Open LLM Training Dataset

This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.

SlimPajama: 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0
RefinedWeb: ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)
MNBVC: 5TB (on-going, target 40TB), Chinese, MIT License
The Pile: 825G
RedPajama: 1.2T tokens

Distributed Training Framework

Deep Learning frameworks supportting distributed training are marked with *.

PyTorch Ecosystem

Accelerate [Huggingface] Oct. 2020 [open]
Hivemind Aug. 2020 [open]
Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [Preprint]
FairScale [Meta] July 2020 [open]
DeepSpeed [Microsoft] Oct. 2019 [open]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [SC'20]
Megatron [Nivida] Sept. 2019 [open]
Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]
PyTorch* [Meta] Sept. 2016 [open]
PyTorch: An Imperative Style, High-Performance Deep Learning Library [NeurIPS'19]

XLA Ecosystem

T5x [Google] Mar. 2022 [open]
Scaling Up Models and Data with 𝚝𝟻𝚡 and 𝚜𝚎𝚚𝚒𝚘 [Preprint]
Alpa [Google] Jan. 2022 [open]
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [OSDI'22]
Pathways [Google] Mar. 2021 [close]
Pathways: Asynchronous Distributed Dataflow for ML [Preprint]
Colossal-AI [HPC-AI TECH] Nov. 2021 [open]
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [Preprint]
GShard [Google] June 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [Preprint]
Jax* Google Oct 2019 [open]
Mesh Tensorflow [Google] Nov. 2018 [open]
Horovod [Uber] Feb. 2018 [open]
Horovod: fast and easy distributed deep learning in TensorFlow [Preprint]
Tensorflow* [Google] Nov. 2015 [open]
TensorFlow: A system for large-scale machine learning [OSDI'16]

Other Frameworks

OneFlow* [OneFlow] July 2020 [open]
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [Preprint]
MindSpore* [Huawei] Mar. 2020 [open]
PaddlePaddle* [Baidu] Nov. 2018 [open]
End-to-end Adaptive Distributed Training on PaddlePaddle [Preprint]
Ray [Berkeley] Dec. 2017 [open]
Ray: A Distributed Framework for Emerging AI Applications [OSDI'17]

Inference Frameworks

Petals [BigScience] Dec. 2022 [open]
FlexGen [Stanford, Berkerley, CMU, etc.] May 2022 [open]
FastTransformer [NVIDIA] Apr. 2021 [open]
MegEngine [MegEngine] Mar. 2020
DeepSpeed-Inference [Microsoft] Oct. 2019 [open]
MediaPipe [Google] July 2019 [open]
TensorRT [Nvidia] Jun 2019 [open]
MNN [Alibaba] May 2019 [open]
OpenVINO [Intel] Oct. 2019 [open]
ONNX [Linux Foundation] Sep 2017 [open]
ncnn [Tencent] July 2017 [open]

Recommendation Training Framework

HET [Tencent] Dec. 2021
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [VLDB'22]
Persia [Kuaishou] Nov. 2021
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [Preprint]
```
Embeddings Params: 100T
```
ZionEX [Meta] Apr. 2021
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [ISCA'21]
```
Embeddings Params: 10T
```
ScaleFreeCTR [Huawei] Apr. 2021
ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [SIGIR'21]
Kraken [Kuaishou] Nov. 2020
Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [SC'20]
TensorNet [Qihoo360] Sept. 2020 [open]
HierPS [Baidu] Mar. 2020
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [MLSys'20]
AIBox [Baidu] Oct. 2019
AIBox: CTR Prediction Model Training on a Single Node [CIKM'20]
```
Embeddings Params: 0.1T
```
XDL [Alibaba] Aug. 2019
XDL: an industrial deep learning framework for high-dimensional sparse data [DLP-KDD'21]
```
Embeddings Params: 0.01T
```

Keys Explanations

Company tags: the related company name. Other institudes may also involve in the job.
Params: number of parameters of the largest model
Training data size, training cost and training petaFLOPs may have some uncertainty.
Training cost
- TPUv2 hour: $4.5
- TPUv3 hour: $8
- V100 GPU hour: $0.55 (2022)
- A100 GPU hoor: $1.10 (2022)
Architecture
- En: Encoder-based Language Model
- De: Decoder-based Language Model
- En-De=Encoder-Decoder-based Language Model
- The above three architectures are powered with transformers.
- MoE: Mixture of Experts
Objective (See explanation in section 6–8 of this paper)
- MLM: Masked Language Modeling
- LTR: Left-To-Right Language Modeling
- NSP: Next Sentence Prediction
- PLM: Permuted Language Modeling
- IC: Image Captioning
- VLM: Vision Languauge Matching
- CMCL: Cross-Modal Contrastive Learning
FLOPs: number of FLOating-Point operations [explanation]
- 1 petaFLOPs = 1e15 FLOPs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-huge-models

Contents

Survey

Models

Language Model

Vision Models

Reinforcement Learning

Speech

Science

Open LLM Training Dataset

Distributed Training Framework

PyTorch Ecosystem

XLA Ecosystem

Other Frameworks

Inference Frameworks

Recommendation Training Framework

Keys Explanations

About

Contributors 6

zhengzangw/awesome-huge-models

Folders and files

Latest commit

History

Repository files navigation

awesome-huge-models

Contents

Survey

Models

Language Model

Vision Models

Reinforcement Learning

Speech

Science

Open LLM Training Dataset

Distributed Training Framework

PyTorch Ecosystem

XLA Ecosystem

Other Frameworks

Inference Frameworks

Recommendation Training Framework

Keys Explanations

About

Topics

Resources

Stars

Watchers

Forks

Contributors 6