Skip to content

zhengzangw/awesome-huge-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

awesome-huge-models Awesome

A collection of AWESOME things about HUGE AI models.

[2023.06] We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and the datasets used for pretraining LLMs. In this update, I try to catch up with the latest developments in the open-source wave of LLMs.

[2023.03] Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [open].

[2022.06] There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [Intro in Chinese]

Contents

Survey

Big models in NLP

Resources list:

Models

Language Model

LLM evolutionary tree

  • Baichuan [Baichuan] Jun. 2023 [open]

    Field: Language
    Params: 7B
    Training Data: 1.2T tokens (English, Chinese, Private)
    License: Apache 2.0
    Context: 4096
  • Falcon [TII] Jun. 2023 [open]

    Field: Language
    Params: 40B
    Training Data: 1T tokens (RefinedWeb)
    License: Apache 2.0
    Context Length: 2048
  • OpenLLaMA [OpenLM] May. 2023 [open]

    Field: Language
    Params: 13B, 7B, 3B
    Training Data: 1T tokens (RedPajama)
    License: Apache 2.0
    Context Length: 2048
  • Redpajama-INCITE [Together] May. 2023 [open]

    Field: Language
    Params: 7B, 3B
    Training Data: 1T tokens (Redpajama)
    License: Apache 2.0
    Context Length: 2048
  • MPT [MosaicML] May. 2023 [open]

    Field: Language
    Params: 30B, 7B
    Training Data: 1T tokens (Private)
    License: Apache 2.0, CC BY-SA-3.0
    Context Length: 84k
  • Stable-LM [Stability-AI] Apr. 2023 [open]

    Field: Language
    Params: 7B, 3B
    Training Data: 1.5T tokens
    License: CC BY-SA-4.0
  • LiT-LLaMa [Lightning-AI] Apr. 2023 [open]

    Field: Language
    Params: 13B, 7B
    Training Data: 1.2T tokens (Redpajama)
    License: Apache 2.0
  • h2oGPT [H2O.ai] [open]
    h2oGPT: Democratizing Large Language Models

    Field: Language
    Params: 13B, 7B
    Training Data: 1.0T tokens
    License: Apache 2.0
    Context Length: 2048
  • Cerabras-GPT [Cerabras] Mar. 2023 [open]
    Training Compute-Optimal Large Language Models [preprint]

    Field: Language
    Params: 13B
    Training Data: 371B tokens (Redpajama)
    License: Apache 2.0
    Context Length: 2048
  • Claude [Anthropic] Mar. 2023 [close]

    Field: Language-Vision
  • GPT-4 [OpenAI] Mar. 2023 [close]
    GPT-4 Technical Report [Preprint]

    Field: Language-Vision
    Params: 1.7T
    Architecture: De, MoE
  • Bard [Google]

    Field: Language-Vision
  • LLaMa [Meta] Feb. 2023 [open]
    Open and Efficient Foundation Language Models [Preprint]

    Field: Language
    Params: 65B, 33B, 13B, 7B
    Training Data: 4TB (1.4T tokens)
    Training Cost: 1,022,362 (2048 80G-A100 x 21 days)
    Training Power Consumption: 449 MWh
    Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc.
    License: GPL
  • RWKV-4 [Personal] Dec. 2022 [open]

    Field: Language
    Params: 14B, 7B, 3B, 1.5B
    Training Data: 332B tokens
    Architecture: De, RNN
    License: Apache 2.0
  • AnthropicLM [Anthropic] Dec. 2022 [close]
    Constitutional AI: Harmlessness from AI Feedback

    Field: Language
    Params: 52B
  • BLOOM [BigScience] Nov. 2022 [open]
    A 176B-Parameter Open-Access Multilingual Language Model [Preprint]

    Field: Language
    Params: 176B
    Training Data: 174GB (336B tokens)
    Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months
    Training Power Consumption: 475 MWh
    Training Framework: Megatron + Deepspeed
    Instruction-tuned Variants: BLOOMZ
    License: OpenRAIL-M v1
    Context Length: 2048
  • Galactica [Meta] Nov. 2022 [open] A scientific language model trained on over 48 million scientific texts [Preprint]

    Field: Language
    Params: 125M, 1.3B, 6.7B, 30B, 120B
  • Pythia [EleutherAI] Oct. 2022 [open]

    Field: Language
    Params: 12B
    Instruction-tuned Variants: Dolly 2.0
    License: Apache 2.0
    Context Length: 2048
  • GLM-130B [BAAI] Oct. 2022 [open]
    GLM-130B: An Open Bilingual Pre-trained Model [ICLR'23]

    Field: Language
    Params: 130B
    Training Data: (400B tokens)
    Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days
    Training Framework: Megatron + Deepspeed
  • UL2 [Google] May 2022 [open]
    Unifying Language Learning Paradigms [Preprint]

    Field: Language
    Params: 20B (1T tokens)
    Training Data: 800GB
    Achitecture: En-De
    Training Framework: Jax + T5x
    License: Apache 2.0
    Instruction-tuned Variants: Flan-UL2
    Context Length: 2048
  • OPT [Meta] May 2022 [open]
    OPT: Open Pre-trained Transformer Language Models [Preprint]

    Field: Language
    Params: 175B
    Training Data: 800GB (180B tokens)
    Training Cost: 809,472 A100 hours =  992 80G-A100 x 34 days
    Training Power Consumption: 356 MWh
    Architecutre: De
    Training Framework: Megatron + Fairscale
  • PaLM [Google] Apr. 2022 [close]
    PaLM: Scaling Language Modeling with Pathways [Preprint]

    Field: Language
    Params: 550B
    Training Data: 3TB (780B tokens)
    Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)
    Training petaFLOPs: 2.5B
    Architecture: De
    Training Framework: Jax + T5x
  • GPT-NeoX [EleutherAI] Apr. 2022 [open]
    GPT-NeoX-20B: An Open-Source Autoregressive Language Model [Preprint]

    Field: Language
    Params: 20B
    Training Data: 525GiB
    Training petaFLOPs: 93B
    Architecture: De
    Training Framework: Megatron + Fairscale
    License: Apache 2.0
    Context Length: 2048
  • InstructGPT [OpenAI] Mar. 2022 [close]
    Training language models to follow instructions with human feedback [Preprint]

    Field: Language
    Params: 175B
  • Chinchilla [DeepMind] Mar. 2022 [close]
    Training Compute-Optimal Large Language Models [Preprint]

    Field: Language
    Params: 70B
    Training Data: 5.2TB (1.4T tokens)
    Training petaFLOPs: 580M
    Architecture: De
  • EVA 2.0 [BAAI] Mar. 2022 [open]
    EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [Preprint]

    Field: Language (Dialogue)
    Params: 2.8B
    Training Data: 180G (1.4B samples, Chinese)
  • AlphaCode [DeepMind] Mar. 2022 [close]
    Competition-Level Code Generation with AlphaCode [Preprint]

    Field: Code Generation
    Params: 41B
    Training Data: (967B tokens)
    Architecture: De
  • ST-MoE [Google] Feb. 2022 [close]
    ST-MoE: Designing Stable and Transferable Sparse Expert Models [Preprint]

    Field: Language
    Params: 296B
    Architecture: En-De, MoE
  • LaMDA [Google] Jan. 2022 [close]
    LaMDA: Language Models for Dialog Applications [Preprint]

    Field: Language (Dialogue)
    Params: 137B
    Training Data: (1.56T words)
    Training petaFLOPs: 360M
    Architecture: De
  • GLaM [Google] Dec. 2021 [close]
    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [Preprint]

    Field: Language
    Params: 1.2T
    Architecture: De, MoE
  • Gopher [DeepMind] Dec. 2021 [close]
    Scaling Language Models: Methods, Analysis & Insights from Training Gopher [Preprint]

    Field: Language
    Params: 280B
    Training Data: 1.3TB (300B tokens)
    Training petaFLOPs: 630M
    Architecture: De
  • Yuan 1.0 [inspur] Oct. 2021 [close]
    Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [Preprint]

    Field: Language
    Params: 245B
    Training Data: 5TB (180B tokens, Chinese)
    Training petaFLOPs: 410M
    Architecture: De, MoE
  • MT-NLG [Microsoft, Nvidia] Oct. 2021 [close]
    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [Preprint]

    Field: Language
    Params: 530B
    Training Data: 339B tokens
    Training petaFLOPs: 1.4B
    Architecture: De
  • Plato-XL [Baidu] Sept. 2021 [close]
    PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [Preprint]

    Field: Language (Dialogue)
    Params: 11B
    Training Data: (1.2B samples)
  • GPT-J [EleutherAI] Aug. 2021 [open]

    Field: Language
    Params: 6B
    Programming Language: Jax
  • Jurassic-1 [AI21 Labs] Aug. 2021 [close]
    Jurassic-1: Technical Details and Evaluation [Preprint]

    Field: Language
    Params: 178B
    Training petaFLOPs: 370M
    Architecture: De
  • Codex [OpenAI] July 2021 [close]
    Evaluating Large Language Models Trained on Code [Preprint]

    Field: Code Generation
    Params: 12B
    Training Data: 159GB
    Architecture: De
  • ERNIE 3.0 [Baidu] July 2021 [close]
    ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [Preprint]

    Field: Language
    Params: 10B
    Training Data: 4TB (375B tokens, with knowledge graph)
    Architecture: En
    Objective: MLM
  • CPM-2 [BAAI] June 2021 [open]
    CPM-2: Large-scale Cost-effective Pre-trained Language Models [Preprint]

    Field: Language
    Params: 198B
    Training Data: 2.6TB (Chinese 2.3TB, English 300GB)
    Architecture: En-De
    Objective: MLM
  • HyperClova [Naver] May 2021 [close]
    What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [Preprint]

    Field: Language
    Params: 82B
    Training Data: 562B tokens (Korean)
    Training petaFLOPs: 63B
    Architecture: De
  • ByT5 [Google] May 2021 [open]
    ByT5: Towards a token-free future with pre-trained byte-to-byte models [TACL'22]

    Field: Language
    Params: 13B
    Training Data: (101 languages)
    Architecture: En-De
  • PanGu-Ξ± [Huawei] Apr. 2021 [close]
    PanGu-Ξ±: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [Preprint]

    Field: Language
    Params: 200B
    Training Data: 1.1TB (Chinese)
    Training petaFLOPs: 58M
    Architecture: De
  • mT5 [Google] Mar. 2021 [open]
    mT5: A massively multilingual pre-trained text-to-text transformer [Preprint]

    Field: Language
    Params: 13B
    Training Data: (101 languages)
    Architecture: En-De
  • WuDao-WenHui [BAAI] Mar. 2021 [open]

    Field: Language
    Params: 2.9B
    Training Data: 303GB (Chinese)
  • GLM [BAAI] Mar. 2021 [open]
    GLM: General Language Model Pretraining with Autoregressive Blank Infilling [Preprint]

    Field: Language
    Params: 10B
    Architecture: De
  • Switch Transformer [Google] Jan. 2021 [open]
    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Preprint]

    Field: Language
    Params: 1.6T
    Training Data: 750GB
    Training petaFLOPs: 82M
    Architecture: En-De, MoE
    Objective: MLM
  • CPM [BAAI] Dec. 2020 [open]
    CPM: A Large-scale Generative Chinese Pre-trained Language Model [Preprint]

    Field: Language
    Params: 2.6B
    Training Data: 100G (Chinese)
    Training petaFLOPs: 1.8M
    Architecture: De
    Objective: LTR
  • GPT-3 [OpenAI] May 2020 [close]
    Language Models are Few-Shot Learners [NeurIPS'20]

    Field: Language
    Params: 175B
    Training Data: 45TB (680B Tokens)
    Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)
    Training Cost: $4.6M
    Training petaFLOPs: 310M
    Architecture: De
    Obective: LTR
    Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT
  • Blender [Meta] Apr. 2020 [close]
    Recipes for building an open-domain chatbot [Preprint]

    Field: Language (Dialogue)
    Params: 9.4B
  • T-NLG [Microsoft] Feb. 2020 [close]

    Field: Language
    Params: 17B
    Training petaFLOPs: 16M
    Architecture: De
    Obective: LTR
  • Meena [Google] Jan. 2020 [close]
    Towards a Human-like Open-Domain Chatbot [Preprint]

    Field: Language (Dialogue)
    Params: 2.6B
    Training Data: 341GB (40B words)
    Training petaFLOPs: 110M
  • DialoGPT [Microsoft] Nov. 2019 [open]
    DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [ACL'20]

    Field: Language (Dialogue)
    Params: 762M
    Training Data: (147M conversation)
    Architecture: De
  • T5 [Google] Oct. 2019 [open]
    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [JMLR'19]

    Field: Language
    Params: 11B
    Training Data: 800GB
    Training Cost: $1.5M
    Training petaFLOPs: 41M
    Architecture: En-De
    Obective: MLM
    License: Apache 2.0
    Instruction-tuned Variants: Flan-T5
    Context-Length: 512
  • Megatron-LM [Nvidia] Sept. 2019 [open]
    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]

    Field: Language
    Params: 8.3B
    Training Data: 174GB
    Training petaFLOPs: 9.1M
    Architecture: De
    Obective: LTR
    Training Framework: Megatron
  • Megatron-BERT [Nvidia] Sept. 2019 [open]
    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Preprint]

    Field: Language
    Params: 3.9B
    Training Data: 174GB
    Training petaFLOPs: 57M
    Architecture: En
    Obective: MLM
    Training Framework: Megatron
  • RoBERTa [Meta] July 2019 [open]
    RoBERTa: A Robustly Optimized BERT Pretraining Approach [Preprint]

    Field: Language
    Params: 354M
    Training Data: 160GB
    Training Time: 1024 V100 GPU days
    Architecture: En
    Objective: MLM
  • XLNet [Google] June 2019 [open]
    XLNet: Generalized Autoregressive Pretraining for Language Understanding [NeurIPS'19]

    Field: Language
    Params: 340M
    Training Data: 113GB (33B words)
    Training Time: 1280 TPUv3 days
    Training Cost: $245k
    Architecture: En
    Objective: PLM
  • GPT-2 [OpenAI] Feb. 2019 [open]
    Language Models are Unsupervised Multitask Learners [Preprint]

    Field: Language
    Params: 1.5B
    Training Data: 40GB (8M web pages)
    Training Cost: $43k
    Training petaFLOPs: 1.5M
    Architecture: De
    Objective: LTR
  • BERT [Google] Oct. 2018 [open]
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NAACL'18]

    Field: Language
    Params: 330M
    Training Data: 16GB (3.3B words)
    Training Time: 64 TPUv2 days (280 V100 GPU days)
    Training Cost: $7k
    Training petaFLOPs: 290k
    Architecture: En
    Objective: MLM, NSP
  • GPT [OpenAI] June 2018 [open] Improving Language Understanding by Generative Pre-Training [Preprint]

    Field: Language
    Params: 117M
    Training Data: 1GB (7k books)
    Training petaFLOPs: 18k
    Architecture: De
    Objective: LTR

Vision Models

  • Eva02-E [BAAI] Mar. 2023 [open]
    EVA-02: A Visual Representation for Neon Genesis [Preprint]

    Field: Vision-Language
    Params: 5B
    Training Data: 2B image-text pairs
    Architecture: Transformer
    Objective: MIM, Clip Constrastive
  • MAE->WSP-2B [Meta] Mar. 2023 [close]
    The effectiveness of MAE pre-pretraining for billion-scale pretraining [Preprint]

    Field: Vision
    Params: 6.5B
    Training Data: 3B images
    Architecture: Transformer
    Objective: MAE, Weakly-Supervised
  • OpenCLIP G/14 [LAION] Mar. 2023 [open]

    Field: Vision-Language
    Params: 2.5B
    Training Data: 2B images
  • ViT-22B [Google] Feb. 2023 [close]
    Scaling Vision Transformers to 22 Billion Parameters

    Field: Vision
    Params: 22B
    Training Data: 4B images
    Architecture: Transformer
    Objective: Supervised
  • ERNIE-ViLG [Baidu] Dec. 2022 [close]
    ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

    Field: Image Generation (text to image)
    Params: 10B
    Training Data: 145M text-image pairs
    Architecture: Transformer, dVAE + De
  • InternImage-G [Shanghai AI Lab] Nov. 2022 [open] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [CVPR'23 Highlight]

    Field: Vision
    Params: 3B
    Architecture: CNN
    Core Operator: Deformable Convolution v3
  • Stable Diffusion [Stability AI] Aug. 2022 [open]

    Field: Image Generation (text to image)
    Params: 890M
    Training Data: 5B images
    Architecture: Transformer, Diffusion
  • Imagen [Google] May 2022
    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Preprint]

    Field: Image Generation (text to image)
    Text Encoder: T5
    Image Decoder: Diffusion, Upsampler
  • Flamingo [DeepMind] Apr. 2022 [close]
    Flamingo: a Visual Language Model for Few-Shot Learning [Preprint]

    Field: Vision-Language
    Params: 80B
  • DALLΒ·E 2 [OpenAI] Apr. 2022
    Hierarchical Text-Conditional Image Generation with CLIP Latents [Preprint]

    Field: Image Generation (text to image)
    Text Encoder: GPT2 (CLIP)
    Image Encoder: ViT (CLIP)
    Image Decoder: Diffusion, Upsampler
  • BaGuaLu [BAAI, Alibaba] Apr. 2022
    BaGuaLu: targeting brain scale pretrained models with over 37 million cores [PPoPP'22]

    Field: Vision-Language
    Params: 174T
    Architecture: M6
  • SEER [Meta] Feb. 2022 [open]
    Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [Preprint]

    Field: Vision
    Params: 10B
    Training Data: 1B images
    Architecture: Convolution
    Objective: SwAV
  • ERNIE-ViLG [Baidu] Dec. 2021
    ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [Preprint]

    Field: Image Generation (text to image)
    Params: 10B
    Training Data: 145M text-image pairs
    Architecture: Transformer, dVAE + De
  • NUWA [Microsoft] Nov. 2021 [open]
    NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [Preprint]

    Field: Vision-Language
    Generatioon: Image, Video
    Params: 870M
  • SwinV2-G [Google] Nov. 2021 [open]
    Swin Transformer V2: Scaling Up Capacity and Resolution [CVPR'22]

    Field: Vision
    Params: 3B
    Training Data: 70M
    Architecture: Transformer
    Objective: Supervised
  • Zidongtaichu [CASIA] Sept. 2021 [close]

    Field: Image, Video, Language, Speech
    Params: 100B
  • ViT-G/14 [Google] June 2021
    Scaling Vision Transformers [Preprint]

    Field: Vision
    Params: 1.8B
    Training Data: 300M images
    Training petaFLOPs: 3.4M
    Architecture: Transformer
    Objective: Supervised
  • CoAtNet [Google] June 2021 [open]
    CoAtNet: Marrying Convolution and Attention for All Data Sizes [NeurIPS'21]

    Field: Vision
    Params: 2.4B
    Training Data: 300M images
    Architecture: Transformer, Convolution
    Objective: Supervised
  • V-MoE [Google] June 2021
    Scaling Vision with Sparse Mixture of Experts [NeurIPS'21]

    Field: Vision
    Params: 15B
    Training Data: 300M images
    Training Time: 16.8k TPUv3 days
    Training petaFLOPs: 33.9M
    Architecture: Transformer, MoE
    Objective: Supervised
  • CogView [BAAI, Alibaba] May 2021 </>
    CogView: Mastering Text-to-Image Generation via Transformers [NeurIPS'21]

    Field: Vision-Language
    Params: 4B
    Training Data: 30M text-image pairs
    Training petaFLOPs: 27M
    Image Encoder: VAE
    Text Encoder & Image Decoder: GPT2
  • M6 [Alibaba] Mar. 2021
    M6: A Chinese Multimodal Pretrainer [Preprint]

    Field: Vision-Language
    Params: 10T
    Training Data: 300G Texts + 2TB Images
    Training petaFLOPs: 5.5M
    Fusion: Single-stream
    Objective: MLM, IC
  • DALLΒ·E [OpenAI] Feb. 2021
    Zero-Shot Text-to-Image Generation [ICML'21]

    Field: Image Generation (text to image)
    Params: 12B
    Training Data: 250M text-images pairs
    Training petaFLOPs: 47M
    Image Encoder: dVAE
    Text Encoder & Image Decoder: GPT2
  • CLIP [OpenAI] Jan. 2021
    Learning Transferable Visual Models From Natural Language Supervision [ICML'22]

    Field: Vision-Language
    Training Data: 400M text-image pairs
    Training petaFLOPs: 11M
    Image Encoder: ViT
    Text Encoder: GPT-2
    Fusion: Dual Encoder
    Objective: CMCL
  • ViT-H/14 [Google] Oct. 2020 [open]
    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR'20]

    Field: Vision
    Params: 632M
    Training Data: 300M images
    Training petaFLOPs: 13M
    Architecture: Transformer
    Objective: Supervised
  • iGPT-XL [OpenAI] June 2020 [open]
    Generative Pretraining From Pixels [ICML'20]

    Field: Image Generation
    Params: 6.8B
    Training Data: 1M images
    Training petaFLOPs: 33M
    Architecture: Transformer, De
  • BigGAN-deep [DeepMind] Sept. 2018 [open]
    Large Scale GAN Training for High Fidelity Natural Image Synthesis [ICLR'19]

    Field: Image Generation
    Params: 158M
    Training Data: 300M images
    Training petaFLOPs: 3M
    Architecture: Convolution, GAN
    Resolution: 512x512

Reinforcement Learning

  • PaLM-E [Google] March 2023 [close]
    PaLM-E: An Embodied Multimodal Language Model [Preprint]

    Field: Reinforcement Learning
    Params: 562B (540B LLM + 22B Vi)
  • Gato [DeepMind] May 2022 [close]
    A Generalist Agent [Preprint]

    Field: Reinforcement Learning
    Params: 1.2B
    Training Data: (604 Tasks)
    Objective: Supervised

Speech

  • USM [Google] Mar. 2023 [close]
    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [Preprint]

    Field: Speech
    Params: 2B
    Training Data: 12,000,000 hours
  • Whisper [OpenAI] Sept. 2022 [close]
    Robust Speech Recognition via Large-Scale Weak Supervision [Preprint]

    Field: Speech
    Params: 1.55B
    Training Data: 680,000 hours
    Objective: Weakly Supervised
  • HuBERT [Meta] June 2021 [open]
    HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [Preprint]

    Field: Speech
    Params: 1B
    Training Data: 60,000 hours
    Objective: MLM
  • wav2vec 2.0 [Meta] Oct. 2020 [open]
    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [NeurIPS'20]

    Field: Speech
    Params: 317M
    Training Data: 50,000 hours
    Training petaFLOPs: 430M
    Objective: MLM
  • DeepSpeech 2 [Meta] Dec. 2015 [open]
    Deep Speech 2: End-to-End Speech Recognition in English and Mandarin [ICML'15]

    ```yaml
    Field: Speech
    Params: 300M
    Training Data: 21,340 hours
    ```
    

Science

  • AlphaFold 2 [DeepMind] July 2021 [open]
    Highly accurate protein structure prediction with AlphaFold [Nature]

    Field: Biology
    Params: 21B
    Training petaFLOPs: 100k

Open LLM Training Dataset

This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.

  • SlimPajama: 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0
  • RefinedWeb: ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)
  • MNBVC: 5TB (on-going, target 40TB), Chinese, MIT License
  • The Pile: 825G
  • RedPajama: 1.2T tokens

Distributed Training Framework

Deep Learning frameworks supportting distributed training are marked with *.

PyTorch Ecosystem

XLA Ecosystem

Other Frameworks

Inference Frameworks

Recommendation Training Framework

  • HET [Tencent] Dec. 2021
    HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [VLDB'22]

  • Persia [Kuaishou] Nov. 2021
    Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [Preprint]

    Embeddings Params: 100T
  • ZionEX [Meta] Apr. 2021
    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [ISCA'21]

    Embeddings Params: 10T
  • ScaleFreeCTR [Huawei] Apr. 2021
    ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [SIGIR'21]

  • Kraken [Kuaishou] Nov. 2020
    Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [SC'20]

  • TensorNet [Qihoo360] Sept. 2020 [open]

  • HierPS [Baidu] Mar. 2020
    Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [MLSys'20]

  • AIBox [Baidu] Oct. 2019
    AIBox: CTR Prediction Model Training on a Single Node [CIKM'20]

    Embeddings Params: 0.1T
  • XDL [Alibaba] Aug. 2019
    XDL: an industrial deep learning framework for high-dimensional sparse data [DLP-KDD'21]

    Embeddings Params: 0.01T

Keys Explanations

  • Company tags: the related company name. Other institudes may also involve in the job.
  • Params: number of parameters of the largest model
  • Training data size, training cost and training petaFLOPs may have some uncertainty.
  • Training cost
    • TPUv2 hour: $4.5
    • TPUv3 hour: $8
    • V100 GPU hour: $0.55 (2022)
    • A100 GPU hoor: $1.10 (2022)
  • Architecture
    • En: Encoder-based Language Model
    • De: Decoder-based Language Model
    • En-De=Encoder-Decoder-based Language Model
    • The above three architectures are powered with transformers.
    • MoE: Mixture of Experts
  • Objective (See explanation in section 6–8 of this paper)
    • MLM: Masked Language Modeling
    • LTR: Left-To-Right Language Modeling
    • NSP: Next Sentence Prediction
    • PLM: Permuted Language Modeling
    • IC: Image Captioning
    • VLM: Vision Languauge Matching
    • CMCL: Cross-Modal Contrastive Learning
  • FLOPs: number of FLOating-Point operations [explanation]
    • 1 petaFLOPs = 1e15 FLOPs