This project aims to provide some guidence on Fine tuning LLMs

Tools

Jupyter Notebook
Cohere: Provides some powerful LLMs, free and limited API requests for researchers. Cohere
HuggingFace: A platform with huge amount of open-source LLMs, DataSets and so on. Huggingface
Docanno: A tool for labeling the dataset and can be used to build instruction + output for training. Docanno

Find the Data Source :You can take advantage of some open source data, or Crawling some data on the Internet (Please note the legality).
Pre-processing the dataset: Filter the data from the data source, leave only the information needed for further processing or training. Better to structure it to some structured format(e.g. json).

Use LLMs to help:

pre-processing_data/MT_Filter_5TEs_Huggingface.ipynb : This file provides an example to use the model on huggingface to help with processing data.
pre-processing_data/MT_Filter_ISO_Cohere.ipynb : This file provides an example to use the model provided by cohere to processing data.

Processing the dataset: Before training, you need to build you own dataset for training, typically including Instruction + Output. Docanno can be used as a tool.

processing_data/convert_dataset_to_train.ipynb : This file provides an example of converting the dataset to the format of the model needed.
processing_data/convert_jsonL.ipynb: This file provides an example of modifying dataset directly using json.
processing_data/split_dataset_to_train_test.ipynb: This file provides an example of spliting the dataset into the training set and test set.

Training model（Fine Tuning）: After build your own dataset, you can start to fine tune a pre-trained model (choose an instruct pre-trained model from HuggingFace).

training_model: This folder provides some examples to train the model using LoRA and upload the final model to HuggingFace.

Evaluation: After training the model, you can start evaluation, some metrics can be find in metrics:

evaluation_scripts/predicting_with_models: This folder provides some examples for predicting with the final model.(Either use GPU or Inference API in HuggingFace)
evaluation_scripts/evaluation_metric.ipynb: Gives an example of evaluation

Author: 0YHR0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluation_scripts		evaluation_scripts
pre-processing_data		pre-processing_data
processing_data		processing_data
training_model		training_model
.gitignore		.gitignore
CheckData.py		CheckData.py
README.md		README.md
try_final_model.ipynb		try_final_model.ipynb
visualization.ipynb		visualization.ipynb