Skip to content

MaLLaM πŸŒ™ Malaysia Large Language Model

HUSEIN ZOLKEPLI edited this page Jan 11, 2024 · 36 revisions

Please make sure login Github account to see the images.

MaLLaM πŸŒ™ (Malaysia Large Language Model), Malaysian Foundation Model, trained on 349GB JSONL equivalent to 90 Billion tokens.

We released 3 different sizes,

  1. 1.1B Parameters, https://huggingface.co/mesolitica/mallam-1.1B-4096
  2. 3B Parameters, https://huggingface.co/mesolitica/mallam-3B-4096
  3. 5B Parameters, https://huggingface.co/mesolitica/mallam-5B-4096

What is Foundation Model?

It is a non-technical term for pretrained model, https://en.wikipedia.org/wiki/Foundation_models, but in this context, it is a large language model (causal language model) trained on massive dataset.

The term Foundation Model is been legally used in,

  1. United States, the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.
  2. European Union, the European Parliament’s negotiated position on the E.U. AI Act.
  3. United Kingdom, the Competition and Markets Authority’s AI Foundation Models: Initial Report.

Why we started it?

After https://github.com/huseinzol05, https://github.com/aisyahrzk and https://github.com/KamarulAdha scraped Malaysian websites for almost 4 months, we decided to train our own Malaysian Foundation Model from scratch.

Dataset

The main idea of this scraping to prepare hyperlocalization for future LLM development, we want the LLM able to understand local slangs, manglish and etc, but the future is here.

How big is the dataset?

We use 5 datasets,

  1. Dedup text dataset.
  2. Extra dedup text dataset (last minute research papers dataset).
  3. Filtered StarCoder dataset.
  4. Instruction dataset.
  5. Madlad-400 MS dataset, https://huggingface.co/datasets/allenai/MADLAD-400

Total 349GB JSONL equivalent to 90 Billion tokens using custom tokenizer.

Text dedup dataset

You can check the list of websites we gathered at https://github.com/users/huseinzol05/projects/1. Each dataset already included the notebooks how to reproduce to collect the dataset.

Lowyat, c.cari.com.my, b.cari.com.my, carigold, news, everything in there.

Extra dedup text dataset

  1. https://huggingface.co/datasets/syafie-nzm/crawl-jurnaldbp/resolve/main/jurnaldbp.jsonl
  2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpharm.org.jsonl
  3. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjgeosc.com.jsonl
  4. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjsustainagri.com.jsonl
  5. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akademisains.gov.my.jsonl
  6. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/crossref-pdf.jsonl
  7. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/Kamus_Dewan_Bahasa_Edisi_Keempat_pdf.pdf
  8. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/melayu-pdf.jsonl
  9. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majcafe.com.jsonl
  10. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjms.mohe.gov.my.jsonl
  11. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/newera.edu.my.jsonl

All these from local journals and Crossref filtered using keyword,

  1. malay
  2. malaysia
  3. melayu

Filtered StarCoder

Original dataset at https://huggingface.co/datasets/bigcode/the-stack-dedup

We only pick specific programming languages,

  1. Python
  2. Julia
  3. C
  4. C++
  5. HTML
  6. CSS
  7. JavaScript
  8. Go
  9. Rust
  10. Java
  11. SQL
  12. Markdown
  13. R
  14. Dockerfile
  15. Ruby
  16. Typescript
  17. YAML

Each language maxed at 10GB only.

Instruction dataset

  1. https://huggingface.co/datasets/mesolitica/translated-glaive-code-assistant-v2/resolve/main/glaive_code_assistant_v2.translated.jsonl
  2. https://huggingface.co/datasets/mesolitica/translated-sql-create-context/resolve/main/sql_create_context_v4.translated.jsonl
  3. https://huggingface.co/datasets/mesolitica/translated-MetaMathQA/resolve/main/metamathqa.jsonl
  4. https://huggingface.co/datasets/mesolitica/translated-competition_math/resolve/main/gather-competition-math.jsonl
  5. https://huggingface.co/datasets/mesolitica/translated-MathInstruct/resolve/main/math-instruct.jsonl
  6. https://huggingface.co/datasets/mesolitica/translated-math_qa/resolve/main/math-qa.jsonl.translated
  7. https://huggingface.co/datasets/mesolitica/translated-mini-math23k-v1/resolve/main/mini-math23k.jsonl.requested
  8. https://huggingface.co/datasets/mesolitica/translated-WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.translated.jsonl
  9. https://huggingface.co/datasets/mesolitica/translated-unnatural_code_instructions_20M/resolve/main/unnatural-instructions.jsonl.requested
  10. https://huggingface.co/datasets/mesolitica/translated-code-context/resolve/main/code_context.jsonl.t5.translated
  11. https://huggingface.co/datasets/mesolitica/translated-python-evol-instruct-51k/resolve/main/python_evol_instruct_51k.jsonl.requested
  12. https://huggingface.co/datasets/mesolitica/dependency-parsing-instructions/resolve/main/dependency.jsonl
  13. https://huggingface.co/datasets/mesolitica/constituency-parsing-instructions/resolve/main/constituency.jsonl
  14. https://huggingface.co/datasets/mesolitica/kesalahan-tatabahasa-choice/resolve/main/kesalahan-tatabahasa-choice.jsonl
  15. https://huggingface.co/datasets/mesolitica/ms-wikipedia-choice/resolve/main/qa-ms-wikipedia.jsonl
  16. https://huggingface.co/datasets/mesolitica/dewanbahasa-jdbp-choice/resolve/main/qa-dewanbahasa-jdbp.jsonl
  17. https://huggingface.co/datasets/mesolitica/majalahsains-choice/resolve/main/qa-majalahsains.jsonl
  18. https://huggingface.co/datasets/mesolitica/rumi-jawi-instructions/resolve/main/jawi-rumi.jsonl
  19. https://huggingface.co/datasets/mesolitica/rumi-jawi-instructions/resolve/main/rumi-jawi.jsonl
  20. https://huggingface.co/datasets/mesolitica/ayat-aktif-pasif-instructions/resolve/main/synthetic-ayat-aktif-pasif.jsonl
  21. https://huggingface.co/datasets/mesolitica/maksud-instructions/resolve/main/maksud.jsonl
  22. https://huggingface.co/datasets/mesolitica/google-translate-camel-ai
  23. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1
  24. https://huggingface.co/datasets/mesolitica/malaysian-ultrachat
  25. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-pt3online.jsonl
  26. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-upsr.jsonl
  27. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalanspm.jsonl

Consist of translated instruction dataset, synthetic instruction dataset and crawled instruction dataset.

Postprocessing

After we gathered all the dataset we deduped and do simple postprocessing,

  1. Deduped at 95% similar.
  2. Capped \n, \r to max 6 consecutive only.
  3. Remove any related HTML error texts.

All steps to reproduce at https://github.com/malaysia-ai/dedup-text-dataset

Train our own tokenizer

We trained 2 kind of tokenizers with max 32k vocab size,

  1. BPE, https://huggingface.co/malaysia-ai/bpe-tokenizer
  2. SentencePiece, https://huggingface.co/malaysia-ai/sentencepiece-tokenizer

Both trained on the same size of data, consist of languages,

  1. Malay
  2. Mandarin
  3. Tamil
  4. Jawi
  5. English
  6. Arabic

Step to reproduce to train tokenizers at https://github.com/malaysia-ai/prepare-tokenizer, we use https://huggingface.co/docs/tokenizers/index from HuggingFace.

Between BPE and SentencePiece, we proceed with BPE because,

  1. We found out newlines characters messed up in SentencePiece we trained.
  2. We found out some Tamil and Jawi characters are missing in SentencePiece we trained.
  3. SentencePiece is super slow on very long texts.

Why trained our own tokenizers?

We want to reduce the size of tokens required during input and output, let's compare using Malaysian Ultrachat AstroAwani dataset, https://huggingface.co/datasets/mesolitica/malaysian-ultrachat

# !wget https://huggingface.co/datasets/mesolitica/malaysian-ultrachat/resolve/main/ultrachat-astroawani-malay.jsonl

import json
from tqdm import tqdm
from transformers import AutoTokenizer

tokenizer_mallam = AutoTokenizer.from_pretrained('malaysia-ai/sentencepiece-tokenizer')
tokenizer_llama2 = AutoTokenizer.from_pretrained('mesolitica/llama-7b-hf-2048-fpf')
tokenizer_mistral = AutoTokenizer.from_pretrained('mesolitica/mistral-7b-4096-fpf')

mallam, llama2, mistral = 0, 0, 0
with open('ultrachat-astroawani-malay.jsonl') as fopen:
    for l in tqdm(fopen):
        l = json.loads(l)
        for r in l[1:]:
            if r['content_ms']:
                mallam += len(tokenizer_mallam(r['content_ms'])['input_ids'])
                llama2 += len(tokenizer_llama2(r['content_ms'])['input_ids'])
                mistral += len(tokenizer_mistral(r['content_ms'])['input_ids'])

print(mallam, llama2, mistral)
(26157664, 60391551, 60823929)

We able to reduce up to 43% of token size if compared to Llama2 and Mistral tokenizers, notebook at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/compare-tokens.ipynb

Tokenizing dataset

As we mentioned, accumulated text size is 349GB JSONL equivalent to 90B tokens, to tokenize huge text file is not easy, we explained further more about tokenizing process and pain points (eventually able to solved it) at https://github.com/malaysia-ai/dedup-text-dataset/tree/main/pretrain-llm

Total tokens,

  1. prepare-dedup-text-dataset-4096.ipynb, 31702310912
  2. prepare-starcoder-4096.ipynb, 40981254144
  3. prepare-madlad-400-4096.ipynb, 14983720960
  4. prepare-instructions.ipynb, 1577877504
  5. prepare-extra.ipynb, 1140461568

Total, 90B tokens, we uploaded the dataset at https://huggingface.co/datasets/malaysia-ai/mosaic-combine-all, so you can use it directly with https://docs.mosaicml.com/projects/streaming/en/latest/index.html

GPU Infrastructure

We use 10 nodes of 8x A100 GPUs, https://azureprice.net/vm/Standard_ND96amsr_A100_v4,

Multinodes training

Why Ray?

We use Ray Cluster to train multinodes style, why Ray? If you used to standard torch distributed script,

MASTER= python master.py
SLAVE= python slave.py
SLAVE= python slave.py

Each slaves must run manually the training script, and yes, we can run the script during pods startup, but not really practical. So we use Ray! As long the workers connected to the master, you just run the script that connected to master, and done.

Why Kubernetes?

  1. Because we use spot instances to reduce the cost up to 60%! cloud Kubernetes has capability to autoscale to necessary node sizes if any instances died.
  2. Attaching ReadWriteMany is easier in Kubernetes, we want to share the same directory to save and load checkpoints for each workers.
  3. Internal Kubernetes DNS, we can connect Ray cluster simply use Kubernetes Service.

Where is the battle tested Ray cluster?

We explained further more about our battle tested and pain points (eventually able to solved it) at https://github.com/malaysia-ai/jupyter-gpu/tree/main/ray,

  1. We able to use DeepSpeed3 for each workers, this able to train larger models if we have more GPUs.
  2. Use 100% HuggingFace Trainer API with our own forked.
  3. Auto recovery corrupted checkpoints.

How about the cluster deployments?

We created a Kubernetes node group with 10 count of https://azureprice.net/vm/Standard_ND96amsr_A100_v4, and deployed 2 Ray clusters.

  1. First cluster,
  1. Second cluster,

Why created 2 clusters? Because we use spot instances, if the master died, all scripts will crashed, so it is better we have 2 masters.

Training scripts and sessions

  1. 1.1B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#11b-4096-context-length
  • 20 workers, equal to 20 GPUs.
  1. 3B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#3b-4096-context-length
  • 20 workers, equal to 20 GPUs.
  1. 5B, https://github.com/mesolitica/malaya/tree/5.1/pretrained-model/mistral#5b-4096-context-length
  • 40 workers, equal to 40 GPUs.

All use the same configs,

You can check it at https://github.com/mesolitica/malaya/blob/5.1/pretrained-model/mistral/train.py#L347

WanDB

  1. 1.1B, https://wandb.ai/mesolitica/pretrain-mistral-1.1b?workspace=user-husein-mesolitica
  2. 3B, https://wandb.ai/mesolitica/pretrain-mistral-3b?workspace=user-husein-mesolitica
  3. 5B, https://wandb.ai/mesolitica/pretrain-mistral-5b?workspace=user-husein-mesolitica

We created a WanDB report at https://wandb.ai/mesolitica/pretrain-mistral-3b/reports/Pretrain-Larger-Malaysian-Mistral--Vmlldzo2MDkyOTgz

Screenshot 2023-12-11 at 10 59 20 AM

Training hiccup

This only happened to 5B model, if you look at the graph,

Screenshot 2023-12-11 at 11 00 29 AM

This usually happened for bigger models that trained on smaller batch size. Bigger models learned faster, if your batch size is not big enough, and it found a different kind of data samples (like code texts), this can caused sudden loss spiked, and slow to recover back. To solve this problem,

  1. Reshuffled the dataset by changing the torch seed or change dataset indices. But, we don't do this, we scared the model will missed some of the dataset. Required reference.
  2. Reduced learning rate temporarily and revert old checkpoints, after stable enough, revert back to old learning rate. This is what we done.

GPUs failures

We hit 3 times GPUs failures,

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

And if you try to nvidia-smi in Ray and nvidia-device-plugin-daemonset pods, it just stuck.

This can be 2 issues,

  1. nvidia-device-plugin-daemonset issue.
  2. hardware failure.

For first and second failures, restarted nvidia-device-plugin-daemonset for particular node solved the problem, but for the third time we had to delete the node and pray cloud provider gives us a new node,

kubectl cordon node
kubectl drain node
kubectl delete node node

Benchmarking

After done 1 epoch, we benchmarked on Tatabahasa dataset,

Screenshot 2023-12-11 at 11 21 14 AM

With only 90B tokens and 80 GPUs only, MaLLaM πŸŒ™ able to beat bigger models.

Our Malay LLM benchmark leaderboard at https://huggingface.co/spaces/mesolitica/malay-llm-leaderboard

Total cost

Screenshot 2023-12-11 at 11 09 58 AM

Because we use spot instances, total is 17k USD to train 3 different model sizes, 24/7 for 10 days.

So, what next?

  1. This is foundation model, not much able can do, we have to finetune MaLLaM πŸŒ™ for instructions dataset, so it will become like ChatGPT, able to have almost human-like conversation and do multiturn QA, we gathered instruction datasets at https://huggingface.co/collections/mesolitica/malay-instructions-dataset-655ad6d1e0a6202d36868e4f and https://huggingface.co/collections/mesolitica/malaysian-synthetic-dataset-656c2673fe7fe0b1e9e25fe2
  2. Multimodal, we are preparing dataset for Vision and Speech.

2023-01-11 update

  1. Released finetune MaLLaM 1.1B πŸŒ™ for instructions dataset, https://huggingface.co/mesolitica/mallam-1.1b-20k-instructions
  2. Released finetune MaLLaM 5B πŸŒ™ for instructions dataset, https://huggingface.co/mesolitica/mallam-5b-20k-instructions
  3. Vision instruction dataset is done, https://huggingface.co/collections/mesolitica/vision-malaysian-llm-653a16214037a1bc4417eb3a
  4. Audio encoder is done, https://huggingface.co/collections/mesolitica/malaysian-whisper-6590b6b733d72b44f0cfae79
  5. Audio instruction dataset is done, https://huggingface.co/collections/mesolitica/audio-malaysian-llm-6590b69ee7c71d6d9e209104

Collaboration

There is no research papers been produced for this development, we are open for any kind of collaboration,

  1. dataset, if we able to get up to 150B tokens, we decided to train 7B or more parameters.
  2. research papers! We can provide GPUs if you are interested to write a paper.

Contribution

  1. Special thanks to https://github.com/aisyahrzk for contributing scraping datasets and training scripts.
  2. Special thanks to other volunteer Malaysia-AI, without you guys, development MaLLaM πŸŒ™ will be really slow.