Quick start

Click the link to jump to the section you're interested in.
Follow the instruction and video to prepare your task, model and dataset.
Finish your task with only a few clicks!

Instruction	Video
How to train your model	-YouTube - Bilibili
How to use model for classification/regression prediction	-YouTube
How to use model for mutational effect prediction
How to use model for inverse folding prediction
How to contribute to SaprotHub

Overview

Task

Different models are designed for different tasks, so it's essential to understand which type your task belongs to.

To view the full list of tasks supported by ColabSaprot, please refer to task_list.md.

Task type

Here are the task types and their description, so you can recognize your task type based on your task description and objectives.

For Classification and Regression prediction task:

Classification Task
Regression Task
Amino Acid Classification Task
Pair Classification Task
Pair Regression Task

For Zero-shot prediciton task:

Mutational effect prediction
Inverse folding prediction

Classification and Regression prediction task

Train a model based on SaProt and use it to make prediction.

Task Type	Task Description	Example
Classification (Protein-level Classification)	Classify protein sequences.	- Fold Class Prediction - Localization Prediction - Function Prediction
Regression (Protein-level Regression)	Predict the value of some property of a protein sequence.	- Thermal Stability Prediction - Fluorescence Intensity Prediction - Binding Affinity Prediction
Amino Acid Classification (Residue-level Classification)	Classify the amino acids in a protein sequence.	- Secondary Structure Prediction - Binding Site Prediction - Active Site Prediction
Pair Classification	Predict if there is interaction between the two proteins.	- Protein-Protein Interaction (PPI) Prediction - Interaction Type Classification Disease - Associated Interaction Prediction
Pair Regression	Predict the ability of interaction between the two proteins.	- Interaction Strength Prediction - Binding Free Energy Calculation - Interaction Affinity Prediction

Zero-shot prediciton task

Directly use SaProt (650M) to make prediction.

Task Type	Task Description	Example
Mutational Effect Prediction	Predict the mutational effect based on the wild type sequence and mutation information.	- Enzyme Activity Prediction - Virus Fitness Prediction - Driver Mutation Prediction
Inverse Folding Prediction	Predict the residue sequence given the structure backbone.	- Enzyme Function Optimization - Protein Stability Enhancement - Protein Folding Prediction

Dataset

You can use your private data to train and predict. Below are the various data formats corresponding to different data types.

What is SA(Structure-aware) Sequence

We combine the residue and structure tokens at each residue site to create a Structure-aware sequence (SA sequence), merging both residue and structural information.

The structure tokens are generated by encoding the 3D structure of proteins using Foldseek.

Here you can convert your data into SA Sequence format.

Data Type

Single AA Sequence
Single SA Sequence
Single UniProt ID
Single PDB/CIF Structure
Multiple AA Sequences
Multiple SA Sequences
Multiple UniProt IDs
Multiple PDB/CIF Structures
SaprotHub Dataset

For tasks that require two protein sequences as input (pair classification & pair regression) :

A pair of AA Sequences
A pair of SA Sequences
A pair of UniProt IDs
A pair of PDB/CIF Structures
Multiple pairs of AA Sequences
Multiple pairs of SA Sequences
Multiple pairs of UniProt IDs
Multiple pairs of PDB/CIF Structures

How to find a SaprotHub Dataset

Go to Official SaProtHub Repository to find some datasets.
Copy the Dataset ID for future use.

Scripts for dataset preparation

	Link
Get Structure-Aware Sequence	here
Convert .fa file to .csv dataset (data type:`Multiple AA sequences`)	here
Randomly split your dataset	here

Model

Model type

Official pretrained SaProt (35M)
Official pretrained SaProt (650M)
Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer
Multi-model on SaprotHub

Model type	Used for	Description	Input
`Official pretrained SaProt (35M)`	Training	Train a protein language model based on SaProt(35M) with your dataset	-
`Official pretrained SaProt (650M)`	Training	Train a protein language model based on SaProt(650M) with your dataset	-
`Trained by yourself on ColabSaprot`	Continually training, Prediction	Once you have completed training the model, select this option to use the model you have trained on ColabSaprot for continual training or prediction	Select the model from the dropdown menu
`Shared by peers on SaprotHub`	Continually training, Prediction	Use models shared on SaprotHub for continual training or prediction	Enter the model ID
`Saved in your local computer`	Continually training, Prediction	Use models saved on your local computer (.zip file which were saved when finishing training) for continual training or prediction	Upload the .zip file
`Multi-models on SaprotHub`	Prediction	Ensemble multiple models shared on SaprotHub for prediction Each sample will be predicted using multiple models. Note that: For classification tasks, voting will be used to determine the final predicted category; for regression tasks, the predicted values from each model will be averaged.	Enter the model IDs

How to find a model on SaprotHub

Go to Official SaProtHub Repository to find some model based on your requirements.
Copy the Model ID for future use.

How to train your model

For classification or regression task, you can train your model based on SaProt, or continually train a SaprotHub model (trained on ColabSaprot)

Video

Task type

Classification Task
Regression Task
Amino Acid Classification Task
Pair Classification Task
Pair Regression Task

Base model

Click here for detailed information on each model type.

Official pretrained SaProt (35M)
Official pretrained SaProt (650M)
Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer

Training dataset

Dataset should be a .csv file with three required columns: sequence, label and stage

The content of column sequence depends on your data type. See the table
The content of column label depends on your task type. See the table
The column stage indicate whether the sample is used for training, validation, or testing. Ensure your dataset includes samples for all three stages. The values are: train, valid, test.

Data type	Interface	Input	Example
`Multiple AA Sequences`	An upload button	`file`: the .csv file
`Multiple SA Sequences`	An upload button	`file`: the .csv file
`Multiple UniProt IDs`	An upload button	`file`: the .csv file
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing three columns: `Sqeuence`, `type` and `chain<br />sturcture files`: a .zip file containing all the structure files	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default.
`SaprotHub Dataset`	An input box	`Dataset ID`: SaprotHub Dataset ID	Find more datasets onSaprotHub

Example of comlum label for different task type (the data type in these examples is Multiple SA sequences)

Task type	Label	Description
Protein-level classification	Category index starting from zero	- The task have 2 protein sequence categories: 0, 1. - Each protein sequence has a corresponding category index.
Protein-level regression	Numerical values	- Each protein sequence has a corresponding numerical label to represent the value of some property.
Residue-level classification	A list of category indices for each amino acid	- The task have 3 animo acid categories: 0, 1, 2. - Each animo acid has a corresponding category index.

Training config

Training config	Description
`batch_size`	`batch_size` depends on the number of training samples. "Adaptive" (default choice) refers to automatic batch size according to your data size. If your training data set is large enough, you can use 32, 64, 128, 256, ..., others can be set to 8, 4, 2 (Note that you can not use a larger batch size if you use the Colab default T4 GPU. Note that: Strongly suggest you subscribe to Colab Pro for an A100 GPU.).
`max_epochs`	`max_epochs` refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust `max_epochs` to control training duration. Note that: The max running time of colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user
`learning_rate`	`learning_rate` affects the convergence speed of the model. Through experimentation, we have found that `5.0e-4` is a good default value for base model `Official pretrained SaProt (650M)` and `1.0e-3` for `Official pretrained SaProt (35M)`.

Note that: You can expand the code cell to adjust GPU_batch_size and accumulate_grad_batches to control the number of samples used for each training step. If you do this, the batch_size selected in the dropdown menu will be overridden.

Upload model

You can upload the model to your Huggingface repository and then contribute it to SaprotHub.

You need to add some description for your model:

name: The name of your model.
description: The description of your model (which task is your model used for).
label_meanings: For classification model, please provide detailed information about the meanings of all labels; for regression model, please provide the numerical range of the value.

For example, in a Subcellular Localization Classification Task with 10 categories, label=0 means the protein is located in the Nucleus, label=1 means the protein is located in the Cytoplasm, and so on. The information should be provided as follows:

Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell.membrane, Endoplasmic.reticulum, Plastid, Golgi.apparatus, Lysosome/Vacuole, Peroxisome

You can also edit the model card (readme.md) to provide more information such as Dataset description, Performance and so on.

Instruction

Step 1

Complete the input and selection of Task Configs

task_name is the name of the training task you're working on.
task_objective describes the goal of your task, like sorting protein sequences into categories or predicting the values of some protein properties.
base_model is the base model you use for training. By default, it's set to the officially pretrained SaProt, but you can use models either retrained (by yourself) by ColabSaprot or shared on SaprotHub. For example, you can choose Trained-by-peers with your own data if you want to retrain on SaProt models shared by others. There are a wide range of retrained models available on SaprotHub.
data_type indicates the kind of data you're using, which is determined by the dataset file you upload. You can find more details about the formats for different types of data in the provided instruction.

Step 2

Click the run button to apply the configs.

Step 3

After clicking the "Run" button, additional input boxes will appear.

Complete the input of additional information and upload files.

(Note: Do not click the "Run" button of the next cell before completing the input and upload.)

Step 4

Complete the input of training configs

batch_size depends on the number of training samples. If your training data set is large enough, we recommend using 32, 64,128,256, ..., others can be set to 8, 4, 2. (Note that you can not use a larger batch size if you the Colab default T4 GPU. Strongly suggest you subscribe to Colab Pro for an A100 GPU.)
max_epochs refers to the maximum number of training iterations. A larger value needs more training time. The best model will be saved after each iteration. You can adjust max_epochs to control training duration. (Note that the max running time of Colab is 12hrs for unsubscribed user or 24hrs for Colab Pro+ user)
learning_rate affects the convergence speed of the model. Through experimentation, we have found that 5.0e-4 is a good default value for base model Official pretrained SaProt (650M) and 1.0e-3 for Official pretrained SaProt (35M).

Step 5

Click the "Run" button to start training.

You can monitor the training process by these plots. After training, check the training results and the saved model.

How to use model for classification/regression prediction

Video

Task type

Classification Task
Regression Task
Amino Acid Classification Task
Pair Classification Task
Pair Regression Task

Model

Click here for detailed information on each model type.

Trained by yourself on ColabSaprot
Shared by peers on SaprotHub
Saved in your local computer
Multi-model on SaprotHub

Dataset

Data type	Interface	Input	Example
`Single AA Sequence`	An input box	`sequence`: the amino acid sequence	`sequence`: MEETMKLATM
`Single SA Sequence`	An input box	`sequence`: the structure-aware sequence	`sequence`: MdEvEvTvMpKpLpApTaMp
`Single UniProt ID`	An input box	`sequence`: the UniProt ID	`sequence`: O95905
`Single PDB/CIF structure`	Two input box and an upload button	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. `structure file`: the .pdb/.cif structure file	`type`: AF2 `chain`: A `structure file`: O95905.pdb
`Multiple AA Sequences`	An upload button	`file`: the .csv file
`Multiple SA Sequences`	An upload button	`file`: the .csv file
`Multiple UniProt IDs`	An upload button	`file`: the .csv file
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing three columns: `Sqeuence`, `type` and `chain` `structure files`: a .zip file containing all the structure files
`SaprotHub Dataset`	An input box	`Dataset ID`: SaprotHub Dataset ID	Find more datasets on SaprotHub

Instruction

Step 1

Complete the input and selection of Task Configs, and then

task_objective describes the goal of your task, like sorting protein sequences into categories or predicting the values of some protein properties.
use_model_from depends on whether you want to use a local model or a Huggingface model. If you choose Shared by peers on SaprotHub, please enter the Hugging Face model ID in the input box. If you choose Local Model, simply select your local model from the options. Additionally, there's a wide range of models available on SaprotHub.
data_type indicates the kind of data you're using, which determines the dataset file you should upload. You can find more details about the formats for different types of data in the provided instruction.

Step 2

Click the run button to apply the configs.

Step 3

After clicking the "Run" button, additional input boxes and upload button will appear.

Complete the input of additional information and upload files.

(Note: Do not click the "Run" button of the next cell before completing the input and upload.)

Step 4

Click the run button to start predicting. Check your results after finishing prediction.

How to use model for mutational effect prediction

Mutation Task

Single-site or Multi-site mutagenesis
Saturation mutagenesis

Model

Default model is Official pretrained SaProt (650M).

Mutation information

Here is the detail about the representation of mutation information:

mode	mutation information
Single-site mutagenesis	H87Y
Multi-site mutagenesis	H87Y:V162M:P179L:P179R

For Single-site mutagenesis, we use a term like "H87Y" to denote the mutation, where the first letter represents the original amino acid, the number in the middle represents the mutation site (indexed starting from 1), and the last letter represents the mutated amino acid,
For Multi-site mutagenesis, we use a colon ":" to connect each single-site mutations, such as "H87Y:V162M:P179L:P179R".

Mutation dataset

For Saturation mutagenesis, the mutation dataset is the same as the dataset used for classification/regression prediction tasks.
For Single-site or Multi-site mutagenesis, one more information are required: mutation.

Data type	Interface	Input	Example
`Single SA Sequence`	Two input box	`sequence`: the structure-aware sequence `mutation`: the mutation information	`sequence`: MdEvEvTvMpKpLpAp `mutation`: M1H:E2L:E3Q:T4A:M5P:K6Y:L7V:A8P
`Single UniProt ID`	Two input box	`sequence`: the UniProt ID `mutation`: the mutation information	`sequence`: O95905 `mutation`: H87Y:V162M:P179L
`Single PDB/CIF structure`	Three input box and an upload button	`type`: Indicate whether the structure file is a real PDB structure or an AlphaFold 2 predicted structure. For AF2 (AlphaFold 2) structures, we will apply pLDDT masking. The value must be either "PDB" or "AF2". `chain`: For real PDB structures, since multiple chains may exist in one .pdb file, it is necessary to specify which chain is used. For AF2 structures, the chain is assumed to be A by default. `structure file`: the .pdb/.cif structure file `mutation`: the mutation information	`type`: AF2 `chain`: A `structure file`: O95905.pdb `mutation`: H87Y:V162M:P179L
`Multiple SA Sequences`	An upload button	`file`: the .csv file containing two columns: `sequence` and `mutation`
`Multiple UniProt IDs`	An upload button	`file`: the .csv file containing two columns: `sequence` and `mutation`
`Multiple PDB/CIF Structures`	Two upload button	`file`: a .csv file containing four columns: `Sqeuence`, `type`, `chain` and `mutation` `structure files`: a .zip file containing all the structure files

Instruction

Step 1

Complete the selection of Task Configs.

mutation_task indicates the type of mutation task. You can choose from Single-site or Multi-site mutagenesis and Saturation mutagenesis.
data_type indicates the kind of data you're using, which determines the dataset file you should upload. You can find more details about the formats for different types of data in the provided instruction.

Step 2

Click the run button to apply the configs.

Step 3

After clicking the "Run" button, additional input boxes and upload button will appear.

For a single sequence, enter the sequence and the mutation information into the corresponding input fields. (Note that for Saturation mutagenesis, you won't see the Mutation input box.)

For multiple sequences, click the upload button to upload your dataset. (Note that for Saturation mutagenesis, you don’t need to provide mutation information in your dataset, which means only sequence column is required in the .csv dataset.)

Step 4

Click the run button to start predicting. Check your results after finishing prediction.

For a single sequence, the predicted score will be show in the output.
For multiple sequences, the predicted score will be saved in a .csv file.

How to use model for inverse folding prediction

Task config

method refers to the prediction method. It could be either argmax or multinomial.
- argmax selects the amino acid with the highest probability.
- multinomial samples an amino acid from the multinomial distribution.
num_samples refers to the number of output amino acid sequences.

Model

Default model is Official pretrained SaProt (650M).

Inverse folding dataset

PDB/CIF file

After generating the sequence

Predict the structure of generated sequence

Align proteins using TMalign

Instruction

Step 1

Click the run button to upload the structure file, which could be in the format of .pdb or .cif file.

Step 2

After clicking the "Run" button, additional input boxes and upload button will appear.

Step 3

After uploading the structure file, it will be transformed into AA sequence and structure sequence.

Use '#' to mask some amino acids for prediction.

Step 4

Choose the prediction method.

Step 5

Click the run button to make prediction.

How to contribute to SaprotHub

Join SaprotHub Organization

Before contributing to SaprotHub, you need to join the SaprotHub Huggingface Organization to gain write access to the subset of repos within the Organization that you have created.

Contribute to SaprotHub

You have two ways to contribute to SaprotHub:

Transfer your model to SaprotHub (Recommended)
Create a new model repository and upload model files

Transfer your model to SaprotHub (Recommended)

Once you have uploaded the model to your Huggingface repository using ColabSaprot, you can directly transfer your model to SaprotHub.

Create a new model repository and upload model files

You can manually create a new model repository on SaprotHub, and then upload the model files to this repository.

Files

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls

Quick start

Overview

Task

Task type

Classification and Regression prediction task

Zero-shot prediciton task

Dataset

What is SA(Structure-aware) Sequence

Data Type

How to find a SaprotHub Dataset

Scripts for dataset preparation

Model

Model type

How to find a model on SaprotHub

How to train your model

Video

Task type

Base model

Training dataset

Training config

Upload model

Instruction

How to use model for classification/regression prediction

Video

Task type

Model

Dataset

Instruction

How to use model for mutational effect prediction

Mutation Task

Model

Mutation information

Mutation dataset

Instruction

How to use model for inverse folding prediction

Task config

Model

Inverse folding dataset

After generating the sequence

Predict the structure of generated sequence

Align proteins using TMalign

Instruction

How to contribute to SaprotHub

Join SaprotHub Organization

Contribute to SaprotHub