This project serves as an on-ramp to Valohai and is designed to be the first step for individuals starting with their self-serve trial. The primary goal of this template is to showcase the power of Valohai for fine-tuning large language models, with a special focus on the Mistral 7B model.
-
Loading Data: In our project, data is seamlessly fetched from our S3 bucket. When you initiate an execution, the data is automatically stored in the
/valohai/inputs/
directory on the machine. Additionally, the tokenizer is sourced directly from the Hugging Face repository, and it is also available in/valohai/inputs/
directory. -
Tokenization: To make the data suitable for language models, it's tokenized using the tokenizer from Hugging Face's Mistral repository. Tokenization basically means breaking down the text into smaller units, like words or subwords, so that the model can work with it.
-
Saving Processed Data: After tokenization, the processed data is saved in a way that makes it easy to use later. This processed data is saved to Valohai datasets with a special alias, making it convenient for further steps in the machine learning process.
This streamlined workflow empowers you to focus on your machine learning tasks, while Valohai handles data management, versioning, and efficient storage.
-
Loading Data and Model: The code loads the prepared training data from Valohai datasets. It also fetches the base model from an S3 bucket. This base model is a pre-trained Mistral model.
-
Model Enhancement: The base model is enhanced to make it better for training with a method called "PEFT." This enhancement involves configuring the model for better training performance.
-
Training the Model: The script then trains the model using the prepared data using Trainer from the Transformers library. It fine-tunes the model, making it better at understanding video gaming text.
Saving Results: After training, the script saves checkpoints of the model's progress. These checkpoints are stored in Valohai datasets for easy access in the next steps, like inference.
In the inference step, we use the fine-tuned language model to generate text based on a given prompt. Here's a simplified explanation of what happens in this code:
-
Loading Model and Checkpoints: The code loads the base model from an S3 bucket and the fine-tuned checkpoint from the previous step, which is stored in Valohai datasets.
-
Inference: Using the fine-tuned model and provided test prompt, we get a model-generated response, which tokenizer decodes to make it human-readable.
Before we can run any code, we need to set up the project. This section explains how to set up the project using the Valohai web app or the terminal.
🌐 Using the web app
Login to the Valohai web app and create a new project.
Configure this repository as the project's repository, by following these steps:
- Go to your project's page.
- Navigate to the Settings tab.
- Under the Repository section, locate the URL field.
- Enter the URL of this repository.
- Click on the Save button to save the changes.
⌨️ Using the terminal
To run your code on Valohai using the terminal, follow these steps:
-
Install Valohai on your machine by running the following command:
pip install valohai-cli
-
Log in to Valohai from the terminal using the command:
vh login
-
Create a project for your Valohai workflow. Start by creating a directory for your project:
mkdir valohai-mistral-example cd valohai-mistral-example
Then, create the Valohai project:
vh project create
-
Clone the repository to your local machine:
git clone https://github.com/valohai/mistral-example.git .
🌐 / ⌨️ Setup for both
Authorize the Valohai project to download models and tokenizers from Hugging Face.
-
Login to the Hugging Face platform
-
Agree on the terms of Mistral model; the license is Apache 2.
-
Create an access token under Hugging Face settings.
You can either choose to allow access to all public models you've agreed to or only the Mistral model.
Copy the token and store it in a secure place, you won't be seeing it again.
-
Add the
hf_xxx
token to your Valohai project as a secret namedHF_TOKEN
.Now all workloads on this project have scoped access to Hugging Face if you don't specifically restrict them.
This repository defines the essential tasks or "steps" like data preprocessing, model fine-tuning and inference of Mistral models. You can execute these tasks individually or as part of a pipeline. This section covers how you can run them individually.
🌐 Using the web app
⌨️ Using the terminal
To run individual steps, execute the following command:
vh execution run <step-name> --adhoc
For example, to run the preprocess-dataset step, use the command:
vh execution run data-preprocess --adhoc
When you have a collection of tasks that you want to run together, you create a pipeline. This section explains how to run the predefined pipelines in this repository.
🌐 Using the web app
⌨️ Using the terminal
To run pipelines, use the following command:
vh pipeline run <pipeline-name> --adhoc
For example, to run our pipeline, use the command:
vh pipeline run training-pipeline --adhoc
The completed pipeline view:
The generated response by the model looks like this:
Important
The example configuration undergoes only a limited number of fine-tuning steps. To achieve satisfactory results might require further experimentation with model parameters.