MaLLaM 🌙 Malaysia Large Language Model

Please make sure login Github account to see the images.

MaLLaM 🌙 (Malaysia Large Language Model), Malaysian Foundation Model, trained on 349GB JSONL equivalent to 90 Billion tokens.

We released 3 different sizes,

1.1B Parameters, https://huggingface.co/mesolitica/mallam-1.1B-4096
3B Parameters, https://huggingface.co/mesolitica/mallam-3B-4096
5B Parameters, https://huggingface.co/mesolitica/mallam-5B-4096

What is Foundation Model?

It is a non-technical term for pretrained model, https://en.wikipedia.org/wiki/Foundation_models, but in this context, it is a large language model (causal language model) trained on massive dataset.

The term Foundation Model is been legally used in,

United States, the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.
European Union, the European Parliament’s negotiated position on the E.U. AI Act.
United Kingdom, the Competition and Markets Authority’s AI Foundation Models: Initial Report.

Why we started it?

After https://github.com/huseinzol05, https://github.com/aisyahrzk and https://github.com/KamarulAdha scraped Malaysian websites for almost 4 months, we decided to train our own Malaysian Foundation Model from scratch.

Dataset

The main idea of this scraping to prepare hyperlocalization for future LLM development, we want the LLM able to understand local slangs, manglish and etc, but the future is here.

How big is the dataset?

We use 5 datasets,

Dedup text dataset.
Extra dedup text dataset (last minute research papers dataset).
Filtered StarCoder dataset.
Instruction dataset.
Madlad-400 MS dataset, https://huggingface.co/datasets/allenai/MADLAD-400

Total 349GB JSONL equivalent to 90 Billion tokens using custom tokenizer.

Text dedup dataset

You can check the list of websites we gathered at https://github.com/users/huseinzol05/projects/1. Each dataset already included the notebooks how to reproduce to collect the dataset.

Lowyat, c.cari.com.my, b.cari.com.my, carigold, news, everything in there.

Extra dedup text dataset

All these from local journals and Crossref filtered using keyword,

malay
malaysia
melayu

Filtered StarCoder

Original dataset at https://huggingface.co/datasets/bigcode/the-stack-dedup

We only pick specific programming languages,

Python
Julia
C
C++
HTML
CSS
JavaScript
Go
Rust
Java
SQL
Markdown
R
Dockerfile
Ruby
Typescript
YAML

Each language maxed at 10GB only.

Instruction dataset

Consist of translated instruction dataset, synthetic instruction dataset and crawled instruction dataset.

Postprocessing

After we gathered all the dataset we deduped and do simple postprocessing,

Deduped at 95% similar.
Capped \n, \r to max 6 consecutive only.
Remove any related HTML error texts.

All steps to reproduce at https://github.com/malaysia-ai/dedup-text-dataset

Train our own tokenizer

We trained 2 kind of tokenizers with max 32k vocab size,

BPE, https://huggingface.co/malaysia-ai/bpe-tokenizer
SentencePiece, https://huggingface.co/malaysia-ai/sentencepiece-tokenizer

Both trained on the same size of data, consist of languages,

Malay
Mandarin
Tamil
Jawi
English
Arabic

Step to reproduce to train tokenizers at https://github.com/malaysia-ai/prepare-tokenizer, we use https://huggingface.co/docs/tokenizers/index from HuggingFace.

Between BPE and SentencePiece, we proceed with BPE because,

We found out newlines characters messed up in SentencePiece we trained.
We found out some Tamil and Jawi characters are missing in SentencePiece we trained.
SentencePiece is super slow on very long texts.

Why trained our own tokenizers?

We want to reduce the size of tokens required during input and output, let's compare using Malaysian Ultrachat AstroAwani dataset, https://huggingface.co/datasets/mesolitica/malaysian-ultrachat

# !wget https://huggingface.co/datasets/mesolitica/malaysian-ultrachat/resolve/main/ultrachat-astroawani-malay.jsonl

import json
from tqdm import tqdm
from transformers import AutoTokenizer

tokenizer_mallam = AutoTokenizer.from_pretrained('malaysia-ai/sentencepiece-tokenizer')
tokenizer_llama2 = AutoTokenizer.from_pretrained('mesolitica/llama-7b-hf-2048-fpf')
tokenizer_mistral = AutoTokenizer.from_pretrained('mesolitica/mistral-7b-4096-fpf')

mallam, llama2, mistral = 0, 0, 0
with open('ultrachat-astroawani-malay.jsonl') as fopen:
    for l in tqdm(fopen):
        l = json.loads(l)
        for r in l[1:]:
            if r['content_ms']:
                mallam += len(tokenizer_mallam(r['content_ms'])['input_ids'])
                llama2 += len(tokenizer_llama2(r['content_ms'])['input_ids'])
                mistral += len(tokenizer_mistral(r['content_ms'])['input_ids'])

print(mallam, llama2, mistral)

(26157664, 60391551, 60823929)

We able to reduce up to 43% of token size if compared to Llama2 and Mistral tokenizers, notebook at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/compare-tokens.ipynb

Tokenizing dataset

As we mentioned, accumulated text size is 349GB JSONL equivalent to 90B tokens, to tokenize huge text file is not easy, we explained further more about tokenizing process and pain points (eventually able to solved it) at https://github.com/malaysia-ai/dedup-text-dataset/tree/main/pretrain-llm

Total tokens,

prepare-dedup-text-dataset-4096.ipynb, 31702310912
prepare-starcoder-4096.ipynb, 40981254144
prepare-madlad-400-4096.ipynb, 14983720960
prepare-instructions.ipynb, 1577877504
prepare-extra.ipynb, 1140461568

Total, 90B tokens, we uploaded the dataset at https://huggingface.co/datasets/malaysia-ai/mosaic-combine-all, so you can use it directly with https://docs.mosaicml.com/projects/streaming/en/latest/index.html

GPU Infrastructure

We use 10 nodes of 8x A100 GPUs, https://azureprice.net/vm/Standard_ND96amsr_A100_v4,

means we have 10 x 8, physical 80 GPUs, and GPUs in each node connected using NVLINK 2.0, but nodes connected each other using NCCL inside torch distributed.
means we have 10 x 8 x 80GB, 6400GB VRAM.
based on https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf, means we have 10 x 8 x 312 TFLOPS on bfloat16, 24960 TFLOPS.

Multinodes training

Why Ray?

We use Ray Cluster to train multinodes style, why Ray? If you used to standard torch distributed script,

MASTER= python master.py
SLAVE= python slave.py
SLAVE= python slave.py

Each slaves must run manually the training script, and yes, we can run the script during pods startup, but not really practical. So we use Ray! As long the workers connected to the master, you just run the script that connected to master, and done.

Why Kubernetes?

Because we use spot instances to reduce the cost up to 60%! cloud Kubernetes has capability to autoscale to necessary node sizes if any instances died.
Attaching ReadWriteMany is easier in Kubernetes, we want to share the same directory to save and load checkpoints for each workers.
Internal Kubernetes DNS, we can connect Ray cluster simply use Kubernetes Service.

Where is the battle tested Ray cluster?

We explained further more about our battle tested and pain points (eventually able to solved it) at https://github.com/malaysia-ai/jupyter-gpu/tree/main/ray,

We able to use DeepSpeed3 for each workers, this able to train larger models if we have more GPUs.
Use 100% HuggingFace Trainer API with our own forked.
Auto recovery corrupted checkpoints.

How about the cluster deployments?

We created a Kubernetes node group with 10 count of https://azureprice.net/vm/Standard_ND96amsr_A100_v4, and deployed 2 Ray clusters.

First cluster,

1 master, 4 workers, each got 8x GPUs, 5 x 8 = 40 GPUs.
Specially to train 5B parameters.
master YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/master-stateful-us-west2.yaml
workers YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/worker-stateful-us-west2.yaml

Second cluster,

1 master, 4 workers, each got 8x GPUs, 5 x 8 = 40 GPUs.
shared cluster to train 1.1B and 3B, each use 20 workers == 20 GPUs.
master YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/master-stateful-v2-us-west2.yaml
workers YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/worker-stateful-v2-us-west2.yaml

Why created 2 clusters? Because we use spot instances, if the master died, all scripts will crashed, so it is better we have 2 masters.

Training scripts and sessions

1.1B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#11b-4096-context-length

20 workers, equal to 20 GPUs.

3B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#3b-4096-context-length

20 workers, equal to 20 GPUs.

5B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#5b-4096-context-length

40 workers, equal to 40 GPUs.

All use the same configs,

We use 1e-4 learning rate, with 2000 warmup steps.
WarmupDecayLR scheduler from DeepSpeed, https://deepspeed.readthedocs.io/en/latest/schedulers.html#warmupdecaylr
We use AdamW 0.1 decay rate.
DeepSpeed Zero3.
Batch size of 24.
4096 context length.
Mistral architecture.

You can check it at https://github.com/mesolitica/malaya/blob/5.1/pretrained-model/mistral/train.py#L347

WanDB

We created a WanDB report at https://wandb.ai/mesolitica/pretrain-mistral-3b/reports/Pretrain-Larger-Malaysian-Mistral--Vmlldzo2MDkyOTgz

Training hiccup

This only happened to 5B model, if you look at the graph,

This usually happened for bigger models that trained on smaller batch size. Bigger models learned faster, if your batch size is not big enough, and it found a different kind of data samples (like code texts), this can caused sudden loss spiked, and slow to recover back. To solve this problem,

Reshuffled the dataset by changing the torch seed or change dataset indices. But, we don't do this, we scared the model will missed some of the dataset. Required reference.
Reduced learning rate temporarily and revert old checkpoints, after stable enough, revert back to old learning rate. This is what we done.

GPUs failures

We hit 3 times GPUs failures,

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

And if you try to nvidia-smi in Ray and nvidia-device-plugin-daemonset pods, it just stuck.

This can be 2 issues,

nvidia-device-plugin-daemonset issue.
hardware failure.

For first and second failures, restarted nvidia-device-plugin-daemonset for particular node solved the problem, but for the third time we had to delete the node and pray cloud provider gives us a new node,

kubectl cordon node
kubectl drain node
kubectl delete node node

Benchmarking

After done 1 epoch, we benchmarked on Tatabahasa dataset,

With only 90B tokens and 80 GPUs only, MaLLaM 🌙 able to beat bigger models.

Our Malay LLM benchmark leaderboard at https://huggingface.co/spaces/mesolitica/malay-llm-leaderboard

Total cost

Because we use spot instances, total is 17k USD to train 3 different model sizes, 24/7 for 10 days.

So, what next?

This is foundation model, not much able can do, we have to finetune MaLLaM 🌙 for instructions dataset, so it will become like ChatGPT, able to have almost human-like conversation and do multiturn QA, we gathered instruction datasets at https://huggingface.co/collections/mesolitica/malay-instructions-dataset-655ad6d1e0a6202d36868e4f and https://huggingface.co/collections/mesolitica/malaysian-synthetic-dataset-656c2673fe7fe0b1e9e25fe2
Multimodal, we are preparing dataset for Vision and Speech.

2023-01-11 update

Released finetune MaLLaM 1.1B 🌙 for instructions dataset, https://huggingface.co/mesolitica/mallam-1.1b-20k-instructions
Released finetune MaLLaM 5B 🌙 for instructions dataset, https://huggingface.co/mesolitica/mallam-5b-20k-instructions
Vision instruction dataset is done, https://huggingface.co/collections/mesolitica/vision-malaysian-llm-653a16214037a1bc4417eb3a
Audio encoder is done, https://huggingface.co/collections/mesolitica/malaysian-whisper-6590b6b733d72b44f0cfae79
Audio instruction dataset is done, https://huggingface.co/collections/mesolitica/audio-malaysian-llm-6590b69ee7c71d6d9e209104

Collaboration

There is no research papers been produced for this development, we are open for any kind of collaboration,

dataset, if we able to get up to 150B tokens, we decided to train 7B or more parameters.
research papers! We can provide GPUs if you are interested to write a paper.

Contribution

Special thanks to https://github.com/aisyahrzk for contributing scraping datasets and training scripts.
Special thanks to other volunteer Malaysia-AI, without you guys, development MaLLaM 🌙 will be really slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly