Sahal Shaji Mullappilly* , Abdelrahman Shaker* , Omkar Thawakar* , Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan.
*Equal Contribution
Mohamed bin Zayed University of Artificial Intelligence, UAE
- May-20 : Our code, models, and pre-processed datasets for English version are released. We will release everything related to the Arabic version as well as the technical report soon.
You can try our demo using the following links :
- ClimateGPT is a specialized Language Model (LLM) developed on top of Vicuna framework and fine-tuned specifically for Climate Change and Sustainability topics in both English and Arabic languages.
- We introduce a vector embedding and datastore framework, which can be utilized during model inference for information retrieval without the need for additional training.
- We have generated over 500k interactive conversational-style samples (Question & Answers) based on the public benchmarks for climate change related datasets. This augmentation of interactive conversational data greatly enhances the performance of LLMs through the fine-tuning process. Our proposed dataset (Clima500) will be available on HuggingFace. The instruction for Dataset creation will be released soon.
- To the best of our knowledge, this marks the first release substantial conversational-style Arabic dataset (Question & Answers) dedicated to climate change and sustainability, comprising over 500k samples, dedicated to climate change and sustainability. The Arabic dataset will be released soon.
1. Prepare the code and the environment
Clone the repository and create a anaconda environment
git clone https://github.com/mbzuai-oryx/ClimateGPT.git
cd ClimateGPT
conda env create -f environment.yml
conda activate climateGPT
pip install -e .
OR
git clone https://github.com/mbzuai-oryx/ClimateGPT.git
cd ClimateGPT
conda create -n climateGPT python=3.8
conda activate climateGPT
pip install -r requirements.txt
pip install -e .
1. Prepare the Datasets for training
The Clima500 Dataset, along with the dataset instructions details, will be released soon. Stay tuned for further updates!
2. Fine-Tuned Model
Download fine-tuned model checkpoint can be downloaded from here.
3. Prepare the pretrained Vicuna weights
We built ClimateGPT on the v1.1 version of Vicuna-7B.
Refer the original repo for Vicuna-7B model weights Vicuna-7B
You can use the following command to train ClimateGPT with 4 x A100 (80GB).
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
--model_name_or_path ~/path_to_model_weights/Vicuna-7B \
--data_path path_to_data/Clima500_en_train.json \
--bf16 True \
--output_dir output \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
Download the fine-tuned model checkpoint from here.
Save the model checkpoint at weights/ClimateGPT_en
Run the following commands in separate Terminals : (see web_run.sh)
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path weights/ClimateGPT_en
python3 -m fastchat.serve.gradio_web_server
Refer Gradio Web GUI for more information.
- Vicuna : The fantastic language ability of Vicuna is just amazing. And it is open-source!
- ChromaDB : Chroma - the open-source embedding database.
- LangChain : Building applications with LLMs through composability
This repository is licensed under CC BY-NC-SA. Please refer to the license terms here.