- Click the link to jump to the section you're interested in.
- Follow the instruction and video to prepare your task, model and dataset.
- Finish your task with only a few clicks!
Different models are designed for different tasks, so it's essential to understand which type your task belongs to.
To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.
Here are the task types and their description, so you can recognize your task type based on your task description and objectives.
For Classification and Regression prediction task:
- Classification Task
- Regression Task
- Amino Acid Classification Task
- Pair Classification Task
- Pair Regression Task
For Zero-shot prediciton task:
Train a model based on SaProt and use it to make prediction.
Task Type | Task Description | Example |
---|---|---|
Classification (Protein-level Classification) |
Classify protein sequences. | - Fold Class Prediction - Localization Prediction - Function Prediction |
Regression (Protein-level Regression) |
Predict the value of some property of a protein sequence. | - Thermal Stability Prediction - Fluorescence Intensity Prediction - Binding Affinity Prediction |
Amino Acid Classification (Residue-level Classification) |
Classify the amino acids in a protein sequence. | - Secondary Structure Prediction - Binding Site Prediction - Active Site Prediction |
Pair Classification | Predict if there is interaction between the two proteins. | - Protein-Protein Interaction (PPI) Prediction - Interaction Type Classification Disease - Associated Interaction Prediction |
Pair Regression | Predict the ability of interaction between the two proteins. | - Interaction Strength Prediction - Binding Free Energy Calculation - Interaction Affinity Prediction |
Directly use SaProt (650M) to make prediction.
Task Type | Task Description | Example |
---|---|---|
Mutational Effect Prediction | Predict the mutational effect based on the wild type sequence and mutation information. | - Enzyme Activity Prediction - Virus Fitness Prediction - Driver Mutation Prediction |
Inverse Folding Prediction | Predict the residue sequence given the structure backbone. | - Enzyme Function Optimization - Protein Stability Enhancement - Protein Folding Prediction |
You can use your private data to train and predict. Below are the various data formats corresponding to different data types.
We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.
The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.
Here you can convert your data into SA Sequence format.
- Single AA Sequence
- Single SA Sequence
- Single UniProt ID
- Single PDB/CIF Structure
- Multiple AA Sequences
- Multiple SA Sequences
- Multiple UniProt IDs
- Multiple PDB/CIF Structures
- SaprotHub Dataset
For tasks that require two protein sequences as input (pair classification & pair regression) :
- A pair of AA Sequences
- A pair of SA Sequences
- A pair of UniProt IDs
- A pair of PDB/CIF Structures
- Multiple pairs of AA Sequences
- Multiple pairs of SA Sequences
- Multiple pairs of UniProt IDs
- Multiple pairs of PDB/CIF Structures
- Go to Official SaProtHub Repository to find some datasets.
- Copy the
Dataset ID
for future use.
Link | |
---|---|
Get Structure-Aware Sequence | here |
Convert .fa file to .csv dataset (data type:Multiple AA sequences ) |
here |
Randomly split your dataset | here |
- Official pretrained SaProt (35M)
- Official pretrained SaProt (650M)
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
- Multi-model on SaprotHub
Model type | Used for | Description | Input |
---|---|---|---|
Official pretrained SaProt (35M) |
Training | Train a protein language model based on SaProt(35M) with your dataset | - |
Official pretrained SaProt (650M) |
Training | Train a protein language model based on SaProt(650M) with your dataset | - |
Trained by yourself on ColabSaprot |
Continually training, Prediction | Once you have completed training the model, select this option to use the model you have trained on ColabSaprot for continual training or prediction | Select the model from the dropdown menu |
Shared by peers on SaprotHub |
Continually training, Prediction | Use models shared on SaprotHub for continual training or prediction | Enter the model ID |
Saved in your local computer |
Continually training, Prediction | Use models saved on your local computer (.zip file which were saved when finishing training) for continual training or prediction | Upload the .zip file |
Multi-models on SaprotHub |
Prediction | Ensemble multiple models shared on SaprotHub for prediction Each sample will be predicted using multiple models. Note that: For classification tasks, voting will be used to determine the final predicted category; for regression tasks, the predicted values from each model will be averaged. |
Enter the model IDs |
- Go to Official SaProtHub Repository to find some model based on your requirements.
- Copy the
Model ID
for future use.
For classification or regression task, you can train your model based on SaProt, or continually train a SaprotHub model (trained on ColabSaprot)
- Classification Task
- Regression Task
- Amino Acid Classification Task
- Pair Classification Task
- Pair Regression Task
Click here for detailed information on each model type.
- Official pretrained SaProt (35M)
- Official pretrained SaProt (650M)
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
Dataset should be a .csv file with three required columns: sequence
, label
and stage
- The content of column
sequence
depends on your data type. See the table - The content of column
label
depends on your task type. See the table - The column
stage
indicate whether the sample is used for training, validation, or testing. Ensure your dataset includes samples for all three stages. The values are:train
,valid
,test
.
Data type | Interface | Input | Example |
---|---|---|---|
Multiple AA Sequences |
An upload button | file : the .csv file |
|
Multiple SA Sequences |
An upload button | file : the .csv file |
|
Multiple UniProt IDs |
An upload button | file : the .csv file |
|
Multiple PDB/CIF Structures |
Two upload button | file : a .csv file containing three columns: Sqeuence , type and chain<br />sturcture files : a .zip file containing all the structure files |
type : Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".chain : For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. |
SaprotHub Dataset |
An input box | Dataset ID : SaprotHub Dataset ID |
Find more datasets onSaprotHub |
Example of comlum label
for different task type (the data type in these examples is Multiple SA sequences
)
Training config | Description |
---|---|
batch_size |
batch_size depends on the number of training samples. "Adaptive" (default choice) refers to automatic batch size according to your data size. If your training data set is large enough, you can use 32, 64, 128, 256, ..., others can be set to 8, 4, 2 (Note that you can not use a larger batch size if you use the Colab default T4 GPU. Note that: Strongly suggest you subscribe to Colab Pro for an A100 GPU.). |
max_epochs |
max_epochs refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust max_epochs to control training duration. Note that: The max running time of colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user |
learning_rate |
learning_rate affects the convergence speed of the model. Through experimentation, we have found that 5.0e-4 is a good default value for base model Official pretrained SaProt (650M) and 1.0e-3 for Official pretrained SaProt (35M) . |
Note that: You can expand the code cell to adjust GPU_batch_size
and accumulate_grad_batches
to control the number of samples used for each training step. If you do this, the batch_size
selected in the dropdown menu will be overridden.
You can upload the model to your Huggingface repository and then contribute it to SaprotHub.
You need to add some description for your model:
name
: The name of your model.description
: The description of your model (which task is your model used for).label_meanings
: For classification model, please provide detailed information about the meanings of all labels; for regression model, please provide the numerical range of the value.
For example, in a Subcellular Localization Classification Task with 10 categories, label=0 means the protein is located in the Nucleus, label=1 means the protein is located in the Cytoplasm, and so on. The information should be provided as follows:
Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell.membrane, Endoplasmic.reticulum, Plastid, Golgi.apparatus, Lysosome/Vacuole, Peroxisome
You can also edit the model card (readme.md) to provide more information such as Dataset description
, Performance
and so on.
Step 1
Complete the input and selection of Task Configs
task_name
is the name of the training task you're working on.task_objective
describes the goal of your task, like sorting protein sequences into categories or predicting the values of some protein properties.base_model
is the base model you use for training. By default, it's set to the officially pretrained SaProt, but you can use models either retrained (by yourself) by ColabSaprot or shared on SaprotHub. For example, you can chooseTrained-by-peers
with your own data if you want to retrain on SaProt models shared by others. There are a wide range of retrained models available on SaprotHub.data_type
indicates the kind of data you're using, which is determined by the dataset file you upload. You can find more details about the formats for different types of data in the provided instruction.
Step 2
Click the run button to apply the configs.
Step 3
After clicking the "Run" button, additional input boxes will appear.
Complete the input of additional information and upload files.
(Note: Do not click the "Run" button of the next cell before completing the input and upload.)
Step 4
Complete the input of training configs
batch_size
depends on the number of training samples. If your training data set is large enough, we recommend using 32, 64,128,256, ..., others can be set to 8, 4, 2. (Note that you can not use a larger batch size if you the Colab default T4 GPU. Strongly suggest you subscribe to Colab Pro for an A100 GPU.)max_epochs
refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjustmax_epochs
to control training duration. (Note that the max running time of Colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user)learning_rate
affects the convergence speed of the model. Through experimentation, we have found that5.0e-4
is a good default value for base modelOfficial pretrained SaProt (650M)
and1.0e-3
forOfficial pretrained SaProt (35M)
.
Step 5
Click the "Run" button to start training.
You can monitor the training process by these plots. After training, check the training results and the saved model.
- Classification Task
- Regression Task
- Amino Acid Classification Task
- Pair Classification Task
- Pair Regression Task
Click here for detailed information on each model type.
- Trained by yourself on ColabSaprot
- Shared by peers on SaprotHub
- Saved in your local computer
- Multi-model on SaprotHub
Data type | Interface | Input | Example |
---|---|---|---|
Single AA Sequence |
An input box | sequence : the amino acid sequence |
sequence : MEETMKLATM |
Single SA Sequence |
An input box | sequence : the structure-aware sequence |
sequence : MdEvEvTvMpKpLpApTaMp |
Single UniProt ID |
An input box | sequence : the UniProt ID |
sequence : O95905 |
Single PDB/CIF structure |
Two input box and an upload button | type : Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2".chain : For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.structure file : the .pdb/.cif structure file |
type : AF2chain : Astructure file : O95905.pdb |
Multiple AA Sequences |
An upload button | file : the .csv file |
|
Multiple SA Sequences |
An upload button | file : the .csv file |
|
Multiple UniProt IDs |
An upload button | file : the .csv file |
|
Multiple PDB/CIF Structures |
Two upload button | file : a .csv file containing three columns: Sqeuence , type and chain structure files : a .zip file containing all the structure files |
|
SaprotHub Dataset |
An input box | Dataset ID : SaprotHub Dataset ID |
Find more datasets on SaprotHub |
Step 1
Complete the input and selection of Task Configs, and then
task_objective
describes the goal of your task, like sorting protein sequences into categories or predicting the values of some protein properties.use_model_from
depends on whether you want to use a local model or a Huggingface model. If you chooseShared by peers on SaprotHub
, please enter the Hugging Face model ID in the input box. If you chooseLocal Model
, simply select your local model from the options. Additionally, there's a wide range of models available on SaprotHub.data_type
indicates the kind of data you're using, which determines the dataset file you should upload. You can find more details about the formats for different types of data in the provided instruction.
Step 2
Click the run button to apply the configs.
Step 3
After clicking the "Run" button, additional input boxes and upload button will appear.
Complete the input of additional information and upload files.
(Note: Do not click the "Run" button of the next cell before completing the input and upload.)
Step 4
Click the run button to start predicting. Check your results after finishing prediction.
- Single-site or Multi-site mutagenesis
- Saturation mutagenesis
Default model is Official pretrained SaProt (650M)
.
Here is the detail about the representation of mutation information:
mode | mutation information |
---|---|
Single-site mutagenesis | H87Y |
Multi-site mutagenesis | H87Y:V162M:P179L:P179R |
- For
Single-site mutagenesis
, we use a term like "H87Y" to denote the mutation, where the first letter represents the original amino acid, the number in the middle represents the mutation site (indexed starting from 1), and the last letter represents the mutated amino acid, - For
Multi-site mutagenesis
, we use a colon ":" to connect each single-site mutations, such as "H87Y:V162M:P179L:P179R".
- For
Saturation mutagenesis
, the mutation dataset is the same as the dataset used for classification/regression prediction tasks. - For
Single-site or Multi-site mutagenesis
, one more information are required:mutation
.
Step 1
Complete the selection of Task Configs.
mutation_task
indicates the type of mutation task. You can choose fromSingle-site or Multi-site mutagenesis
andSaturation mutagenesis
.data_type
indicates the kind of data you're using, which determines the dataset file you should upload. You can find more details about the formats for different types of data in the provided instruction.
Step 2
Click the run button to apply the configs.
Step 3
After clicking the "Run" button, additional input boxes and upload button will appear.
For a single sequence, enter the sequence and the mutation information into the corresponding input fields. (Note that for Saturation mutagenesis, you won't see the Mutation input box.)
For multiple sequences, click the upload button to upload your dataset. (Note that for Saturation mutagenesis, you don’t need to provide mutation information in your dataset, which means only sequence
column is required in the .csv dataset.)
Step 4
Click the run button to start predicting. Check your results after finishing prediction.
- For a single sequence, the predicted score will be show in the output.
- For multiple sequences, the predicted score will be saved in a .csv file.
method
refers to the prediction method. It could be eitherargmax
ormultinomial
.argmax
selects the amino acid with the highest probability.multinomial
samples an amino acid from the multinomial distribution.
num_samples
refers to the number of output amino acid sequences.
Default model is Official pretrained SaProt (650M)
.
PDB/CIF file
Step 1
Click the run button to upload the structure file, which could be in the format of .pdb or .cif file.
Step 2
After clicking the "Run" button, additional input boxes and upload button will appear.
Step 3
After uploading the structure file, it will be transformed into AA sequence and structure sequence.
Use '#' to mask some amino acids for prediction.
Step 4
Choose the prediction method.
Step 5
Click the run button to make prediction.
Before contributing to SaprotHub, you need to join the SaprotHub Huggingface Organization to gain write access to the subset of repos within the Organization that you have created.
You have two ways to contribute to SaprotHub:
- Transfer your model to SaprotHub (Recommended)
- Create a new model repository and upload model files
Transfer your model to SaprotHub (Recommended)
Once you have uploaded the model to your Huggingface repository using ColabSaprot, you can directly transfer your model to SaprotHub.
Create a new model repository and upload model files
You can manually create a new model repository on SaprotHub, and then upload the model files to this repository.