A collection of AWESOME things about HUGE AI models.
[2023.06] We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and the datasets used for pretraining LLMs. In this update, I try to catch up with the latest developments in the open-source wave of LLMs.
[2023.03] Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [open].
[2022.06] There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [Intro in Chinese]
- A Survey of Large Language Models [2023.03]
- A Dive into Vision-Language Models [2023.02]
- Compute Trends Across Three Eras of Machine Learning [chart] [2022.02]
- Vision-and-Language Pretrained Models: A Survey [2022.04]
- A Roadmap to Big Model [2022.03]
- A Survey of Vision-Language Pre-trained Models [2022.02]
- Transformers in Vision: A Survey [2022.01]
- On the Opportunities and Risk of Foundation Models [2021.08]
- Pre-Trained Models: Past, Present and Future [2021.06]
Resources list:
-
Baichuan [Baichuan] Jun. 2023 [open]
Field: Language Params: 7B Training Data: 1.2T tokens (English, Chinese, Private) License: Apache 2.0 Context: 4096
-
Field: Language Params: 40B Training Data: 1T tokens (RefinedWeb) License: Apache 2.0 Context Length: 2048
-
OpenLLaMA [OpenLM] May. 2023 [open]
Field: Language Params: 13B, 7B, 3B Training Data: 1T tokens (RedPajama) License: Apache 2.0 Context Length: 2048
-
Redpajama-INCITE [Together] May. 2023 [open]
Field: Language Params: 7B, 3B Training Data: 1T tokens (Redpajama) License: Apache 2.0 Context Length: 2048
-
MPT [MosaicML] May. 2023 [open]
Field: Language Params: 30B, 7B Training Data: 1T tokens (Private) License: Apache 2.0, CC BY-SA-3.0 Context Length: 84k
-
Stable-LM [Stability-AI] Apr. 2023 [open]
Field: Language Params: 7B, 3B Training Data: 1.5T tokens License: CC BY-SA-4.0
-
LiT-LLaMa [Lightning-AI] Apr. 2023 [open]
Field: Language Params: 13B, 7B Training Data: 1.2T tokens (Redpajama) License: Apache 2.0
-
h2oGPT [H2O.ai] [open]
h2oGPT: Democratizing Large Language ModelsField: Language Params: 13B, 7B Training Data: 1.0T tokens License: Apache 2.0 Context Length: 2048
-
Cerabras-GPT [Cerabras] Mar. 2023 [open]
Training Compute-Optimal Large Language Models [preprint]Field: Language Params: 13B Training Data: 371B tokens (Redpajama) License: Apache 2.0 Context Length: 2048
-
Claude [Anthropic] Mar. 2023 [close]
Field: Language-Vision
-
GPT-4 [OpenAI] Mar. 2023 [close]
GPT-4 Technical Report [Preprint]Field: Language-Vision Params: 1.7T Architecture: De, MoE
-
Bard [Google]
Field: Language-Vision
-
LLaMa [Meta] Feb. 2023 [open]
Open and Efficient Foundation Language Models [Preprint]Field: Language Params: 65B, 33B, 13B, 7B Training Data: 4TB (1.4T tokens) Training Cost: 1,022,362 (2048 80G-A100 x 21 days) Training Power Consumption: 449 MWh Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc. License: GPL
-
RWKV-4 [Personal] Dec. 2022 [open]
Field: Language Params: 14B, 7B, 3B, 1.5B Training Data: 332B tokens Architecture: De, RNN License: Apache 2.0
-
AnthropicLM [Anthropic] Dec. 2022 [close]
Constitutional AI: Harmlessness from AI FeedbackField: Language Params: 52B
-
BLOOM [BigScience] Nov. 2022 [open]
A 176B-Parameter Open-Access Multilingual Language Model [Preprint]Field: Language Params: 176B Training Data: 174GB (336B tokens) Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months Training Power Consumption: 475 MWh Training Framework: Megatron + Deepspeed Instruction-tuned Variants: BLOOMZ License: OpenRAIL-M v1 Context Length: 2048
-
Galactica [Meta] Nov. 2022 [open] A scientific language model trained on over 48 million scientific texts [Preprint]
Field: Language Params: 125M, 1.3B, 6.7B, 30B, 120B
-
Pythia [EleutherAI] Oct. 2022 [open]
Field: Language Params: 12B Instruction-tuned Variants: Dolly 2.0 License: Apache 2.0 Context Length: 2048
-
GLM-130B [BAAI] Oct. 2022 [open]
GLM-130B: An Open Bilingual Pre-trained Model [ICLR'23]Field: Language Params: 130B Training Data: (400B tokens) Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days Training Framework: Megatron + Deepspeed
-
UL2 [Google] May 2022 [open]
Unifying Language Learning Paradigms [Preprint]Field: Language Params: 20B (1T tokens) Training Data: 800GB Achitecture: En-De Training Framework: Jax + T5x License: Apache 2.0 Instruction-tuned Variants: Flan-UL2 Context Length: 2048
-
OPT [Meta] May 2022 [open]
OPT: Open Pre-trained Transformer Language Models [Preprint]Field: Language Params: 175B Training Data: 800GB (180B tokens) Training Cost: 809,472 A100 hours = 992 80G-A100 x 34 days Training Power Consumption: 356 MWh Architecutre: De Training Framework: Megatron + Fairscale
-
PaLM [Google] Apr. 2022 [close]
PaLM: Scaling Language Modeling with Pathways [Preprint]Field: Language Params: 550B Training Data: 3TB (780B tokens) Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days) Training petaFLOPs: 2.5B Architecture: De Training Framework: Jax + T5x
-
GPT-NeoX [EleutherAI] Apr. 2022 [open]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model [Preprint]Field: Language Params: 20B Training Data: 525GiB Training petaFLOPs: 93B Architecture: De Training Framework: Megatron + Fairscale License: Apache 2.0 Context Length: 2048
-
InstructGPT [OpenAI] Mar. 2022 [close]
Training language models to follow instructions with human feedback [Preprint]Field: Language Params: 175B
-
Chinchilla [DeepMind] Mar. 2022 [close]
Training Compute-Optimal Large Language Models [Preprint]Field: Language Params: 70B Training Data: 5.2TB (1.4T tokens) Training petaFLOPs: 580M Architecture: De
-
EVA 2.0 [BAAI] Mar. 2022 [open]
EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [Preprint]Field: Language (Dialogue) Params: 2.8B Training Data: 180G (1.4B samples, Chinese)
-
AlphaCode [DeepMind] Mar. 2022 [close]
Competition-Level Code Generation with AlphaCode [Preprint]Field: Code Generation Params: 41B Training Data: (967B tokens) Architecture: De
-
ST-MoE [Google] Feb. 2022 [close]
ST-MoE: Designing Stable and Transferable Sparse Expert Models [Preprint]Field: Language Params: 296B Architecture: En-De, MoE
-
LaMDA [Google] Jan. 2022 [close]
LaMDA: Language Models for Dialog Applications [Preprint]Field: Language (Dialogue) Params: 137B Training Data: (1.56T words) Training petaFLOPs: 360M Architecture: De
-
GLaM [Google] Dec. 2021 [close]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [Preprint]Field: Language Params: 1.2T Architecture: De, MoE
-
Gopher [DeepMind] Dec. 2021 [close]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher [Preprint]Field: Language Params: 280B Training Data: 1.3TB (300B tokens) Training petaFLOPs: 630M Architecture: De
-
Yuan 1.0 [inspur] Oct. 2021 [close]
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [Preprint]Field: Language Params: 245B Training Data: 5TB (180B tokens, Chinese) Training petaFLOPs: 410M Architecture: De, MoE
-
MT-NLG [Microsoft, Nvidia] Oct. 2021 [close]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [Preprint]Field: Language Params: 530B Training Data: 339B tokens Training petaFLOPs: 1.4B Architecture: De
-
Plato-XL [Baidu] Sept. 2021 [close]
PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [Preprint]Field: Language (Dialogue) Params: 11B Training Data: (1.2B samples)
-
GPT-J [EleutherAI] Aug. 2021 [open]
Field: Language Params: 6B Programming Language: Jax
-
Jurassic-1 [AI21 Labs] Aug. 2021 [close]
Jurassic-1: Technical Details and Evaluation [Preprint]Field: Language Params: 178B Training petaFLOPs: 370M Architecture: De
-
Codex [OpenAI] July 2021 [close]
Evaluating Large Language Models Trained on Code [Preprint]Field: Code Generation Params: 12B Training Data: 159GB Architecture: De
-
ERNIE 3.0 [Baidu] July 2021 [close]
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [Preprint]Field: Language Params: 10B Training Data: 4TB (375B tokens, with knowledge graph) Architecture: En Objective: MLM
-
CPM-2 [BAAI] June 2021 [open]
CPM-2: Large-scale Cost-effective Pre-trained Language Models [Preprint]Field: Language Params: 198B Training Data: 2.6TB (Chinese 2.3TB, English 300GB) Architecture: En-De Objective: MLM
-
HyperClova [Naver] May 2021 [close]
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [Preprint]Field: Language Params: 82B Training Data: 562B tokens (Korean) Training petaFLOPs: 63B Architecture: De
-
ByT5 [Google] May 2021 [open]
ByT5: Towards a token-free future with pre-trained byte-to-byte models [TACL'22]Field: Language Params: 13B Training Data: (101 languages) Architecture: En-De
-
PanGu-Ξ± [Huawei] Apr. 2021 [close]
PanGu-Ξ±: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [Preprint]Field: Language Params: 200B Training Data: 1.1TB (Chinese) Training petaFLOPs: 58M Architecture: De
-
mT5 [Google] Mar. 2021 [open]
mT5: A massively multilingual pre-trained text-to-text transformer [Preprint]Field: Language Params: 13B Training Data: (101 languages) Architecture: En-De
-
WuDao-WenHui [BAAI] Mar. 2021 [open]
Field: Language Params: 2.9B Training Data: 303GB (Chinese)
-
GLM [BAAI] Mar. 2021 [open]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling [Preprint]Field: Language Params: 10B Architecture: De
-
Switch Transformer [Google] Jan. 2021 [open]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Preprint]Field: Language Params: 1.6T Training Data: 750GB Training petaFLOPs: 82M Architecture: En-De, MoE Objective: MLM
-
CPM [BAAI] Dec. 2020 [open]
CPM: A Large-scale Generative Chinese Pre-trained Language Model [Preprint]Field: Language Params: 2.6B Training Data: 100G (Chinese) Training petaFLOPs: 1.8M Architecture: De Objective: LTR
-
GPT-3 [OpenAI] May 2020 [close]
Language Models are Few-Shot Learners [NeurIPS'20]Field: Language Params: 175B Training Data: 45TB (680B Tokens) Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years) Training Cost: $4.6M Training petaFLOPs: 310M Architecture: De Obective: LTR Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT
-
Blender [Meta] Apr. 2020 [close]
Recipes for building an open-domain chatbot [Preprint]Field: Language (Dialogue) Params: 9.4B
-
T-NLG [Microsoft] Feb. 2020 [close]
Field: Language Params: 17B Training petaFLOPs: 16M Architecture: De Obective: LTR
-
Meena [Google] Jan. 2020 [close]
Towards a Human-like Open-Domain Chatbot [Preprint]Field: Language (Dialogue) Params: 2.6B Training Data: 341GB (40B words) Training petaFLOPs: 110M
-
DialoGPT [Microsoft] Nov. 2019 [open]
DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [ACL'20]Field: Language (Dialogue) Params: 762M Training Data: (147M conversation) Architecture: De
-
T5 [Google] Oct. 2019 [open]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [JMLR'19]Field: Language Params: 11B Training Data: 800GB Training Cost: $1.5M Training petaFLOPs: 41M Architecture: En-De Obective: MLM License: Apache 2.0 Instruction-tuned Variants: Flan-T5 Context-Length: 512
-
Megatron-LM [Nvidia] Sept. 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]Field: Language Params: 8.3B Training Data: 174GB Training petaFLOPs: 9.1M Architecture: De Obective: LTR Training Framework: Megatron
-
Megatron-BERT [Nvidia] Sept. 2019 [open]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]Field: Language Params: 3.9B Training Data: 174GB Training petaFLOPs: 57M Architecture: En Obective: MLM Training Framework: Megatron
-
RoBERTa [Meta] July 2019 [open]
RoBERTa: A Robustly Optimized BERT Pretraining Approach [Preprint]Field: Language Params: 354M Training Data: 160GB Training Time: 1024 V100 GPU days Architecture: En Objective: MLM
-
XLNet [Google] June 2019 [open]
XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS'19]Field: Language Params: 340M Training Data: 113GB (33B words) Training Time: 1280 TPUv3 days Training Cost: $245k Architecture: En Objective: PLM
-
GPT-2 [OpenAI] Feb. 2019 [open]
Language Models are Unsupervised Multitask Learners [Preprint]Field: Language Params: 1.5B Training Data: 40GB (8M web pages) Training Cost: $43k Training petaFLOPs: 1.5M Architecture: De Objective: LTR
-
BERT [Google] Oct. 2018 [open]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NAACL'18]Field: Language Params: 330M Training Data: 16GB (3.3B words) Training Time: 64 TPUv2 days (280 V100 GPU days) Training Cost: $7k Training petaFLOPs: 290k Architecture: En Objective: MLM, NSP
-
GPT [OpenAI] June 2018 [open] Improving Language Understanding by Generative Pre-Training [Preprint]
Field: Language Params: 117M Training Data: 1GB (7k books) Training petaFLOPs: 18k Architecture: De Objective: LTR
-
Eva02-E [BAAI] Mar. 2023 [open]
EVA-02: A Visual Representation for Neon Genesis [Preprint]Field: Vision-Language Params: 5B Training Data: 2B image-text pairs Architecture: Transformer Objective: MIM, Clip Constrastive
-
MAE->WSP-2B [Meta] Mar. 2023 [close]
The effectiveness of MAE pre-pretraining for billion-scale pretraining [Preprint]Field: Vision Params: 6.5B Training Data: 3B images Architecture: Transformer Objective: MAE, Weakly-Supervised
-
OpenCLIP G/14 [LAION] Mar. 2023 [open]
Field: Vision-Language Params: 2.5B Training Data: 2B images
-
ViT-22B [Google] Feb. 2023 [close]
Scaling Vision Transformers to 22 Billion ParametersField: Vision Params: 22B Training Data: 4B images Architecture: Transformer Objective: Supervised
-
ERNIE-ViLG [Baidu] Dec. 2022 [close]
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]Field: Image Generation (text to image) Params: 10B Training Data: 145M text-image pairs Architecture: Transformer, dVAE + De
-
InternImage-G [Shanghai AI Lab] Nov. 2022 [open] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [CVPR'23 Highlight]
Field: Vision Params: 3B Architecture: CNN Core Operator: Deformable Convolution v3
-
Stable Diffusion [Stability AI] Aug. 2022 [open]
Field: Image Generation (text to image) Params: 890M Training Data: 5B images Architecture: Transformer, Diffusion
-
Imagen [Google] May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Preprint]Field: Image Generation (text to image) Text Encoder: T5 Image Decoder: Diffusion, Upsampler
-
Flamingo [DeepMind] Apr. 2022 [close]
Flamingo: a Visual Language Model for Few-Shot Learning [Preprint]Field: Vision-Language Params: 80B
-
DALLΒ·E 2 [OpenAI] Apr. 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents [Preprint]Field: Image Generation (text to image) Text Encoder: GPT2 (CLIP) Image Encoder: ViT (CLIP) Image Decoder: Diffusion, Upsampler
-
BaGuaLu [BAAI, Alibaba] Apr. 2022
BaGuaLu: targeting brain scale pretrained models with over 37 million cores [PPoPP'22]Field: Vision-Language Params: 174T Architecture: M6
-
SEER [Meta] Feb. 2022 [open]
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [Preprint]Field: Vision Params: 10B Training Data: 1B images Architecture: Convolution Objective: SwAV
-
ERNIE-ViLG [Baidu] Dec. 2021
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]Field: Image Generation (text to image) Params: 10B Training Data: 145M text-image pairs Architecture: Transformer, dVAE + De
-
NUWA [Microsoft] Nov. 2021 [open]
NΓWA: Visual Synthesis Pre-training for Neural visUal World creAtion [Preprint]Field: Vision-Language Generatioon: Image, Video Params: 870M
-
SwinV2-G [Google] Nov. 2021 [open]
Swin Transformer V2: Scaling Up Capacity and Resolution [CVPR'22]Field: Vision Params: 3B Training Data: 70M Architecture: Transformer Objective: Supervised
-
Zidongtaichu [CASIA] Sept. 2021 [close]
Field: Image, Video, Language, Speech Params: 100B
-
ViT-G/14 [Google] June 2021
Scaling Vision Transformers [Preprint]Field: Vision Params: 1.8B Training Data: 300M images Training petaFLOPs: 3.4M Architecture: Transformer Objective: Supervised
-
CoAtNet [Google] June 2021 [open]
CoAtNet: Marrying Convolution and Attention for All Data Sizes [NeurIPS'21]Field: Vision Params: 2.4B Training Data: 300M images Architecture: Transformer, Convolution Objective: Supervised
-
V-MoE [Google] June 2021
Scaling Vision with Sparse Mixture of Experts [NeurIPS'21]Field: Vision Params: 15B Training Data: 300M images Training Time: 16.8k TPUv3 days Training petaFLOPs: 33.9M Architecture: Transformer, MoE Objective: Supervised
-
CogView [BAAI, Alibaba] May 2021 </>
CogView: Mastering Text-to-Image Generation via Transformers [NeurIPS'21]Field: Vision-Language Params: 4B Training Data: 30M text-image pairs Training petaFLOPs: 27M Image Encoder: VAE Text Encoder & Image Decoder: GPT2
-
M6 [Alibaba] Mar. 2021
M6: A Chinese Multimodal Pretrainer [Preprint]Field: Vision-Language Params: 10T Training Data: 300G Texts + 2TB Images Training petaFLOPs: 5.5M Fusion: Single-stream Objective: MLM, IC
-
DALLΒ·E [OpenAI] Feb. 2021
Zero-Shot Text-to-Image Generation [ICML'21]Field: Image Generation (text to image) Params: 12B Training Data: 250M text-images pairs Training petaFLOPs: 47M Image Encoder: dVAE Text Encoder & Image Decoder: GPT2
-
CLIP [OpenAI] Jan. 2021
Learning Transferable Visual Models From Natural Language Supervision [ICML'22]Field: Vision-Language Training Data: 400M text-image pairs Training petaFLOPs: 11M Image Encoder: ViT Text Encoder: GPT-2 Fusion: Dual Encoder Objective: CMCL
-
ViT-H/14 [Google] Oct. 2020 [open]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR'20]Field: Vision Params: 632M Training Data: 300M images Training petaFLOPs: 13M Architecture: Transformer Objective: Supervised
-
iGPT-XL [OpenAI] June 2020 [open]
Generative Pretraining From Pixels [ICML'20]Field: Image Generation Params: 6.8B Training Data: 1M images Training petaFLOPs: 33M Architecture: Transformer, De
-
BigGAN-deep [DeepMind] Sept. 2018 [open]
Large Scale GAN Training for High Fidelity Natural Image Synthesis [ICLR'19]Field: Image Generation Params: 158M Training Data: 300M images Training petaFLOPs: 3M Architecture: Convolution, GAN Resolution: 512x512
-
PaLM-E [Google] March 2023 [close]
PaLM-E: An Embodied Multimodal Language Model [Preprint]Field: Reinforcement Learning Params: 562B (540B LLM + 22B Vi)
-
Gato [DeepMind] May 2022 [close]
A Generalist Agent [Preprint]Field: Reinforcement Learning Params: 1.2B Training Data: (604 Tasks) Objective: Supervised
-
USM [Google] Mar. 2023 [close]
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [Preprint]Field: Speech Params: 2B Training Data: 12,000,000 hours
-
Whisper [OpenAI] Sept. 2022 [close]
Robust Speech Recognition via Large-Scale Weak Supervision [Preprint]Field: Speech Params: 1.55B Training Data: 680,000 hours Objective: Weakly Supervised
-
HuBERT [Meta] June 2021 [open]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [Preprint]Field: Speech Params: 1B Training Data: 60,000 hours Objective: MLM
-
wav2vec 2.0 [Meta] Oct. 2020 [open]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS'20]Field: Speech Params: 317M Training Data: 50,000 hours Training petaFLOPs: 430M Objective: MLM
-
DeepSpeech 2 [Meta] Dec. 2015 [open]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin [ICML'15]```yaml Field: Speech Params: 300M Training Data: 21,340 hours ```
-
AlphaFold 2 [DeepMind] July 2021 [open]
Highly accurate protein structure prediction with AlphaFold [Nature]Field: Biology Params: 21B Training petaFLOPs: 100k
This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.
- SlimPajama: 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0
- RefinedWeb: ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)
- MNBVC: 5TB (on-going, target 40TB), Chinese, MIT License
- The Pile: 825G
- RedPajama: 1.2T tokens
Deep Learning frameworks supportting distributed training are marked with *.
- Accelerate [Huggingface] Oct. 2020 [open]
- Hivemind Aug. 2020 [open]
Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [Preprint] - FairScale [Meta] July 2020 [open]
- DeepSpeed [Microsoft] Oct. 2019 [open]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [SC'20] - Megatron [Nivida] Sept. 2019 [open]
Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint] - PyTorch* [Meta] Sept. 2016 [open]
PyTorch: An Imperative Style, High-Performance Deep Learning Library [NeurIPS'19]
- T5x [Google] Mar. 2022 [open]
Scaling Up Models and Data with ππ»π‘ and πππππ [Preprint] - Alpa [Google] Jan. 2022 [open]
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [OSDI'22] - Pathways [Google] Mar. 2021 [close]
Pathways: Asynchronous Distributed Dataflow for ML [Preprint] - Colossal-AI [HPC-AI TECH] Nov. 2021 [open]
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [Preprint] - GShard [Google] June 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [Preprint] - Jax* Google Oct 2019 [open]
- Mesh Tensorflow [Google] Nov. 2018 [open]
- Horovod [Uber] Feb. 2018 [open]
Horovod: fast and easy distributed deep learning in TensorFlow [Preprint] - Tensorflow* [Google] Nov. 2015 [open]
TensorFlow: A system for large-scale machine learning [OSDI'16]
- OneFlow* [OneFlow] July 2020 [open]
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [Preprint] - MindSpore* [Huawei] Mar. 2020 [open]
- PaddlePaddle* [Baidu] Nov. 2018 [open]
End-to-end Adaptive Distributed Training on PaddlePaddle [Preprint] - Ray [Berkeley] Dec. 2017 [open]
Ray: A Distributed Framework for Emerging AI Applications [OSDI'17]
- Petals [BigScience] Dec. 2022 [open]
- FlexGen [Stanford, Berkerley, CMU, etc.] May 2022 [open]
- FastTransformer [NVIDIA] Apr. 2021 [open]
- MegEngine [MegEngine] Mar. 2020
- DeepSpeed-Inference [Microsoft] Oct. 2019 [open]
- MediaPipe [Google] July 2019 [open]
- TensorRT [Nvidia] Jun 2019 [open]
- MNN [Alibaba] May 2019 [open]
- OpenVINO [Intel] Oct. 2019 [open]
- ONNX [Linux Foundation] Sep 2017 [open]
- ncnn [Tencent] July 2017 [open]
-
HET [Tencent] Dec. 2021
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [VLDB'22] -
Persia [Kuaishou] Nov. 2021
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [Preprint]Embeddings Params: 100T
-
ZionEX [Meta] Apr. 2021
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [ISCA'21]Embeddings Params: 10T
-
ScaleFreeCTR [Huawei] Apr. 2021
ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [SIGIR'21] -
Kraken [Kuaishou] Nov. 2020
Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [SC'20] -
TensorNet [Qihoo360] Sept. 2020 [open]
-
HierPS [Baidu] Mar. 2020
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [MLSys'20] -
AIBox [Baidu] Oct. 2019
AIBox: CTR Prediction Model Training on a Single Node [CIKM'20]Embeddings Params: 0.1T
-
XDL [Alibaba] Aug. 2019
XDL: an industrial deep learning framework for high-dimensional sparse data [DLP-KDD'21]Embeddings Params: 0.01T
- Company tags: the related company name. Other institudes may also involve in the job.
- Params: number of parameters of the largest model
- Training data size, training cost and training petaFLOPs may have some uncertainty.
- Training cost
- TPUv2 hour: $4.5
- TPUv3 hour: $8
- V100 GPU hour: $0.55 (2022)
- A100 GPU hoor: $1.10 (2022)
- Architecture
- En: Encoder-based Language Model
- De: Decoder-based Language Model
- En-De=Encoder-Decoder-based Language Model
- The above three architectures are powered with transformers.
- MoE: Mixture of Experts
- Objective (See explanation in section 6β8 of this paper)
- MLM: Masked Language Modeling
- LTR: Left-To-Right Language Modeling
- NSP: Next Sentence Prediction
- PLM: Permuted Language Modeling
- IC: Image Captioning
- VLM: Vision Languauge Matching
- CMCL: Cross-Modal Contrastive Learning
- FLOPs: number of FLOating-Point operations [explanation]
- 1 petaFLOPs = 1e15 FLOPs