-
-
Notifications
You must be signed in to change notification settings - Fork 127
MaLLaM π Malaysia Large Language Model
Please make sure login Github account to see the images.
MaLLaM π (Malaysia Large Language Model), Malaysian Foundation Model, trained on 349GB JSONL equivalent to 90 Billion tokens.
We released 3 different sizes,
- 1.1B Parameters, https://huggingface.co/mesolitica/mallam-1.1B-4096
- 3B Parameters, https://huggingface.co/mesolitica/mallam-3B-4096
- 5B Parameters, https://huggingface.co/mesolitica/mallam-5B-4096
It is a non-technical term for pretrained model, https://en.wikipedia.org/wiki/Foundation_models, but in this context, it is a large language model (causal language model) trained on massive dataset.
The term Foundation Model is been legally used in,
- United States, the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.
- European Union, the European Parliamentβs negotiated position on the E.U. AI Act.
- United Kingdom, the Competition and Markets Authorityβs AI Foundation Models: Initial Report.
After https://github.com/huseinzol05, https://github.com/aisyahrzk and https://github.com/KamarulAdha scraped Malaysian websites for almost 4 months, we decided to train our own Malaysian Foundation Model from scratch.
The main idea of this scraping to prepare hyperlocalization for future LLM development, we want the LLM able to understand local slangs, manglish and etc, but the future is here.
We use 5 datasets,
- Dedup text dataset.
- Extra dedup text dataset (last minute research papers dataset).
- Filtered StarCoder dataset.
- Instruction dataset.
- Madlad-400 MS dataset, https://huggingface.co/datasets/allenai/MADLAD-400
Total 349GB JSONL equivalent to 90 Billion tokens using custom tokenizer.
You can check the list of websites we gathered at https://github.com/users/huseinzol05/projects/1. Each dataset already included the notebooks how to reproduce to collect the dataset.
Lowyat, c.cari.com.my, b.cari.com.my, carigold, news, everything in there.
- https://huggingface.co/datasets/syafie-nzm/crawl-jurnaldbp/resolve/main/jurnaldbp.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpharm.org.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjgeosc.com.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjsustainagri.com.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akademisains.gov.my.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/crossref-pdf.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/Kamus_Dewan_Bahasa_Edisi_Keempat_pdf.pdf
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/melayu-pdf.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majcafe.com.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjms.mohe.gov.my.jsonl
- https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/newera.edu.my.jsonl
All these from local journals and Crossref filtered using keyword,
- malay
- malaysia
- melayu
Original dataset at https://huggingface.co/datasets/bigcode/the-stack-dedup
We only pick specific programming languages,
- Python
- Julia
- C
- C++
- HTML
- CSS
- JavaScript
- Go
- Rust
- Java
- SQL
- Markdown
- R
- Dockerfile
- Ruby
- Typescript
- YAML
Each language maxed at 10GB only.
- https://huggingface.co/datasets/mesolitica/translated-glaive-code-assistant-v2/resolve/main/glaive_code_assistant_v2.translated.jsonl
- https://huggingface.co/datasets/mesolitica/translated-sql-create-context/resolve/main/sql_create_context_v4.translated.jsonl
- https://huggingface.co/datasets/mesolitica/translated-MetaMathQA/resolve/main/metamathqa.jsonl
- https://huggingface.co/datasets/mesolitica/translated-competition_math/resolve/main/gather-competition-math.jsonl
- https://huggingface.co/datasets/mesolitica/translated-MathInstruct/resolve/main/math-instruct.jsonl
- https://huggingface.co/datasets/mesolitica/translated-math_qa/resolve/main/math-qa.jsonl.translated
- https://huggingface.co/datasets/mesolitica/translated-mini-math23k-v1/resolve/main/mini-math23k.jsonl.requested
- https://huggingface.co/datasets/mesolitica/translated-WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.translated.jsonl
- https://huggingface.co/datasets/mesolitica/translated-unnatural_code_instructions_20M/resolve/main/unnatural-instructions.jsonl.requested
- https://huggingface.co/datasets/mesolitica/translated-code-context/resolve/main/code_context.jsonl.t5.translated
- https://huggingface.co/datasets/mesolitica/translated-python-evol-instruct-51k/resolve/main/python_evol_instruct_51k.jsonl.requested
- https://huggingface.co/datasets/mesolitica/dependency-parsing-instructions/resolve/main/dependency.jsonl
- https://huggingface.co/datasets/mesolitica/constituency-parsing-instructions/resolve/main/constituency.jsonl
- https://huggingface.co/datasets/mesolitica/kesalahan-tatabahasa-choice/resolve/main/kesalahan-tatabahasa-choice.jsonl
- https://huggingface.co/datasets/mesolitica/ms-wikipedia-choice/resolve/main/qa-ms-wikipedia.jsonl
- https://huggingface.co/datasets/mesolitica/dewanbahasa-jdbp-choice/resolve/main/qa-dewanbahasa-jdbp.jsonl
- https://huggingface.co/datasets/mesolitica/majalahsains-choice/resolve/main/qa-majalahsains.jsonl
- https://huggingface.co/datasets/mesolitica/rumi-jawi-instructions/resolve/main/jawi-rumi.jsonl
- https://huggingface.co/datasets/mesolitica/rumi-jawi-instructions/resolve/main/rumi-jawi.jsonl
- https://huggingface.co/datasets/mesolitica/ayat-aktif-pasif-instructions/resolve/main/synthetic-ayat-aktif-pasif.jsonl
- https://huggingface.co/datasets/mesolitica/maksud-instructions/resolve/main/maksud.jsonl
- https://huggingface.co/datasets/mesolitica/google-translate-camel-ai
- https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1
- https://huggingface.co/datasets/mesolitica/malaysian-ultrachat
- https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-pt3online.jsonl
- https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-upsr.jsonl
- https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalanspm.jsonl
Consist of translated instruction dataset, synthetic instruction dataset and crawled instruction dataset.
After we gathered all the dataset we deduped and do simple postprocessing,
- Deduped at 95% similar.
- Capped
\n
,\r
to max 6 consecutive only. - Remove any related HTML error texts.
All steps to reproduce at https://github.com/malaysia-ai/dedup-text-dataset
We trained 2 kind of tokenizers with max 32k vocab size,
- BPE, https://huggingface.co/malaysia-ai/bpe-tokenizer
- SentencePiece, https://huggingface.co/malaysia-ai/sentencepiece-tokenizer
Both trained on the same size of data, consist of languages,
- Malay
- Mandarin
- Tamil
- Jawi
- English
- Arabic
Step to reproduce to train tokenizers at https://github.com/malaysia-ai/prepare-tokenizer, we use https://huggingface.co/docs/tokenizers/index from HuggingFace.
Between BPE and SentencePiece, we proceed with BPE because,
- We found out newlines characters messed up in SentencePiece we trained.
- We found out some Tamil and Jawi characters are missing in SentencePiece we trained.
- SentencePiece is super slow on very long texts.
We want to reduce the size of tokens required during input and output, let's compare using Malaysian Ultrachat AstroAwani dataset, https://huggingface.co/datasets/mesolitica/malaysian-ultrachat
# !wget https://huggingface.co/datasets/mesolitica/malaysian-ultrachat/resolve/main/ultrachat-astroawani-malay.jsonl
import json
from tqdm import tqdm
from transformers import AutoTokenizer
tokenizer_mallam = AutoTokenizer.from_pretrained('malaysia-ai/sentencepiece-tokenizer')
tokenizer_llama2 = AutoTokenizer.from_pretrained('mesolitica/llama-7b-hf-2048-fpf')
tokenizer_mistral = AutoTokenizer.from_pretrained('mesolitica/mistral-7b-4096-fpf')
mallam, llama2, mistral = 0, 0, 0
with open('ultrachat-astroawani-malay.jsonl') as fopen:
for l in tqdm(fopen):
l = json.loads(l)
for r in l[1:]:
if r['content_ms']:
mallam += len(tokenizer_mallam(r['content_ms'])['input_ids'])
llama2 += len(tokenizer_llama2(r['content_ms'])['input_ids'])
mistral += len(tokenizer_mistral(r['content_ms'])['input_ids'])
print(mallam, llama2, mistral)
(26157664, 60391551, 60823929)
We able to reduce up to 43% of token size if compared to Llama2 and Mistral tokenizers, notebook at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/compare-tokens.ipynb
As we mentioned, accumulated text size is 349GB JSONL equivalent to 90B tokens, to tokenize huge text file is not easy, we explained further more about tokenizing process and pain points (eventually able to solved it) at https://github.com/malaysia-ai/dedup-text-dataset/tree/main/pretrain-llm
Total tokens,
- prepare-dedup-text-dataset-4096.ipynb, 31702310912
- prepare-starcoder-4096.ipynb, 40981254144
- prepare-madlad-400-4096.ipynb, 14983720960
- prepare-instructions.ipynb, 1577877504
- prepare-extra.ipynb, 1140461568
Total, 90B tokens, we uploaded the dataset at https://huggingface.co/datasets/malaysia-ai/mosaic-combine-all, so you can use it directly with https://docs.mosaicml.com/projects/streaming/en/latest/index.html
We use 10 nodes of 8x A100 GPUs, https://azureprice.net/vm/Standard_ND96amsr_A100_v4,
- means we have 10 x 8, physical 80 GPUs, and GPUs in each node connected using NVLINK 2.0, but nodes connected each other using NCCL inside torch distributed.
- means we have 10 x 8 x 80GB, 6400GB VRAM.
- based on https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf, means we have 10 x 8 x 312 TFLOPS on bfloat16, 24960 TFLOPS.
We use Ray Cluster to train multinodes style, why Ray? If you used to standard torch distributed script,
MASTER= python master.py
SLAVE= python slave.py
SLAVE= python slave.py
Each slaves must run manually the training script, and yes, we can run the script during pods startup, but not really practical. So we use Ray! As long the workers connected to the master, you just run the script that connected to master, and done.
- Because we use spot instances to reduce the cost up to 60%! cloud Kubernetes has capability to autoscale to necessary node sizes if any instances died.
- Attaching ReadWriteMany is easier in Kubernetes, we want to share the same directory to save and load checkpoints for each workers.
- Internal Kubernetes DNS, we can connect Ray cluster simply use Kubernetes Service.
We explained further more about our battle tested and pain points (eventually able to solved it) at https://github.com/malaysia-ai/jupyter-gpu/tree/main/ray,
- We able to use DeepSpeed3 for each workers, this able to train larger models if we have more GPUs.
- Use 100% HuggingFace Trainer API with our own forked.
- Auto recovery corrupted checkpoints.
We created a Kubernetes node group with 10 count of https://azureprice.net/vm/Standard_ND96amsr_A100_v4, and deployed 2 Ray clusters.
- First cluster,
- 1 master, 4 workers, each got 8x GPUs, 5 x 8 = 40 GPUs.
- Specially to train 5B parameters.
- master YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/master-stateful-us-west2.yaml
- workers YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/worker-stateful-us-west2.yaml
- Second cluster,
- 1 master, 4 workers, each got 8x GPUs, 5 x 8 = 40 GPUs.
- shared cluster to train 1.1B and 3B, each use 20 workers == 20 GPUs.
- master YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/master-stateful-v2-us-west2.yaml
- workers YAML, https://github.com/malaysia-ai/jupyter-gpu/blob/main/ray/worker-stateful-v2-us-west2.yaml
Why created 2 clusters? Because we use spot instances, if the master died, all scripts will crashed, so it is better we have 2 masters.
- 1.1B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#11b-4096-context-length
- 20 workers, equal to 20 GPUs.
- 20 workers, equal to 20 GPUs.
- 40 workers, equal to 40 GPUs.
All use the same configs,
- We use 1e-4 learning rate, with 2000 warmup steps.
- WarmupDecayLR scheduler from DeepSpeed, https://deepspeed.readthedocs.io/en/latest/schedulers.html#warmupdecaylr
- We use AdamW 0.1 decay rate.
- DeepSpeed Zero3.
- Batch size of 24.
- 4096 context length.
- Mistral architecture.
You can check it at https://github.com/mesolitica/malaya/blob/5.1/pretrained-model/mistral/train.py#L347
- 1.1B, https://wandb.ai/mesolitica/pretrain-mistral-1.1b?workspace=user-husein-mesolitica
- 3B, https://wandb.ai/mesolitica/pretrain-mistral-3b?workspace=user-husein-mesolitica
- 5B, https://wandb.ai/mesolitica/pretrain-mistral-5b?workspace=user-husein-mesolitica
We created a WanDB report at https://wandb.ai/mesolitica/pretrain-mistral-3b/reports/Pretrain-Larger-Malaysian-Mistral--Vmlldzo2MDkyOTgz
This only happened to 5B model, if you look at the graph,
This usually happened for bigger models that trained on smaller batch size. Bigger models learned faster, if your batch size is not big enough, and it found a different kind of data samples (like code texts), this can caused sudden loss spiked, and slow to recover back. To solve this problem,
- Reshuffled the dataset by changing the torch seed or change dataset indices. But, we don't do this, we scared the model will missed some of the dataset. Required reference.
- Reduced learning rate temporarily and revert old checkpoints, after stable enough, revert back to old learning rate. This is what we done.
We hit 3 times GPUs failures,
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
And if you try to nvidia-smi
in Ray and nvidia-device-plugin-daemonset
pods, it just stuck.
This can be 2 issues,
-
nvidia-device-plugin-daemonset
issue. - hardware failure.
For first and second failures, restarted nvidia-device-plugin-daemonset
for particular node solved the problem, but for the third time we had to delete the node and pray cloud provider gives us a new node,
kubectl cordon node
kubectl drain node
kubectl delete node node
After done 1 epoch, we benchmarked on Tatabahasa dataset,
With only 90B tokens and 80 GPUs only, MaLLaM π able to beat bigger models.
Our Malay LLM benchmark leaderboard at https://huggingface.co/spaces/mesolitica/malay-llm-leaderboard
Because we use spot instances, total is 17k USD to train 3 different model sizes, 24/7 for 10 days.
- This is foundation model, not much able can do, we have to finetune MaLLaM π for instructions dataset, so it will become like ChatGPT, able to have almost human-like conversation and do multiturn QA, we gathered instruction datasets at https://huggingface.co/collections/mesolitica/malay-instructions-dataset-655ad6d1e0a6202d36868e4f and https://huggingface.co/collections/mesolitica/malaysian-synthetic-dataset-656c2673fe7fe0b1e9e25fe2
- Multimodal, we are preparing dataset for Vision and Speech.
- Released finetune MaLLaM 1.1B π for instructions dataset, https://huggingface.co/mesolitica/mallam-1.1b-20k-instructions
- Released finetune MaLLaM 5B π for instructions dataset, https://huggingface.co/mesolitica/mallam-5b-20k-instructions
- Vision instruction dataset is done, https://huggingface.co/collections/mesolitica/vision-malaysian-llm-653a16214037a1bc4417eb3a
- Audio encoder is done, https://huggingface.co/collections/mesolitica/malaysian-whisper-6590b6b733d72b44f0cfae79
- Audio instruction dataset is done, https://huggingface.co/collections/mesolitica/audio-malaysian-llm-6590b69ee7c71d6d9e209104
There is no research papers been produced for this development, we are open for any kind of collaboration,
- dataset, if we able to get up to 150B tokens, we decided to train 7B or more parameters.
- research papers! We can provide GPUs if you are interested to write a paper.
- Special thanks to https://github.com/aisyahrzk for contributing scraping datasets and training scripts.
- Special thanks to other volunteer Malaysia-AI, without you guys, development MaLLaM π will be really slow.