This project runs from July 1, 2024, to August 30, 2024, in collaboration with the Lab of Systems Biology and Network Biology, Institute of Information Science, Academia Sinica, aiming to generate more heat-resistant detergent-compatible enzymes by fine-tuning the EvoDiff model and analyzing the generated sequences from both structural and sequence perspectives.
We collected sequences associated with detergent-compatible enzymes by identifying relevant EC numbers and microbial species from the literature. These sequences were retrieved from UniProt, aligned using MSA, and split into training and test sets. The training set is used as input for EvoDiff, where we incorporate a disulfide bond reward mechanism during fine-tuning to calculate training error. The test set is used to generate sequences, which are then analyzed using EpHod and TemStaPro to predict optimal pH and temperature ranges, identifying high-temperature microbes to optimize the fine-tuning process for generating heat-resistant enzymes.
Model | Details | Reference |
---|---|---|
Clustal Omega | Tool for multiple sequence alignment. | Sievers, Fabian and Desmond G. Higgins. “The Clustal Omega Multiple Alignment Package.” Methods in molecular biology 2331 (2021): 3-16. |
CD-Hit | Tool for clustering and comparing protein/nucleotide sequences. | Li, Weizhong and Adam Godzik. “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.” Bioinformatics 22 13 (2006): 1658-9. Fu, Limin, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li. “CD-HIT: accelerated for clustering the next-generation sequencing data.” Bioinformatics 28 (2012): 3150-3152. |
EvoDiff | Generates diverse protein sequences and predicts their structures using OmegaFold. | Alamdari, Sarah, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini and Kevin Kaichuang Yang. “Protein generation with evolutionary diffusion: sequence is all you need.” bioRxiv (2023): n. pag. |
OmegaFold | Predicts protein structure from sequence. | Wu, Rui Min, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, Jianzhu Ma and Jian Peng. “High-resolution de novo structure prediction from primary sequence.” bioRxiv (2022): n. pag. |
InterProScan | Scan motif for sequence. Released on 25 July 2024: InterProScan 5.69-101. | Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, John Maslen, Alex L. Mitchell, Gift Nuka, Sebastien Pesseat, Antony F. Quinn, Amaia Sangrador-Vegas, Maxim Scheremetjew, Siew-Yit Yong, Rodrigo Lopez and Sarah Hunter. “InterProScan 5: genome-scale protein function classification.” Bioinformatics 30 (2014): 1236-1240. |
TemStaPro | Predicts protein thermostability using embeddings generated by protein language models (pLMs). | Pudžiuvelytė, Ieva, Kliment Olechnovič, Egle Godliauskaite, Kristupas Sermokas, Tomas Urbaitis, Giedrius Gasiunas and Darius Kazlauskas. “TemStaPro: protein thermostability prediction using sequence representations from protein language models.” Bioinformatics 40 (2024): n. pag. |
EpHod | Predicts enzyme optimum pH from sequence data. | Gado, Japheth E., Matthew Knotts, Ada Y. Shaw, Debora S. Marks, Nicholas Paul Gauthier, Christensen Carsten Sander and Gregg T. Beckham. “Deep learning prediction of enzyme optimum pH.” bioRxiv (2023): n. pag. |
mkdir projects
cd projects
git clone https://github.com/Hlunlun/EzLLM.git
cd EzLLM
./setup.sh
cd projects
# download TemStaPro
git clone https://github.com/ievapudz/TemStaPro.git
cd TemStaPro
make all
# download EpHod
git clone https://github.com/jafetgado/EpHod.git
Refer to this link to install the latest version of InterProScan, or simply run the script below to install it:
./install_installproscan.sh
-
Collection from Literature:
Detergent-compatible enzymes were gathered from various scientific papers and compiled into a CSV file. This dataset categorizes enzyme data by types commonly found in detergents, including amylase, protease, mannanase, lipase, cellulase, and others. The dataset is available on Hugging Face: lun610200/detergent-papers. Note that not all papers specify the EC number of the enzymes. For instance, while protease is frequently mentioned across papers for its detergent functions, detailed EC numbers are often omitted. -
Sequence Collection from UniProt:
To study the evolutionary and functional relationships of detergent enzymes, sequence data was collected from UniProt using EC numbers and corresponding organisms or species mentioned in the literature. Since enzymes with the same EC number catalyze similar reactions, this approach facilitates the analysis of evolutionary relationships and functional similarities within specific organisms. Sequence data in FASTA format was obtained from UniProt based on the EC numbers and the organisms identified in the papers. -
Data Collection Process:
- By EC Number and Specific Organism:
Sequence data was gathered using the EC number and the specific organism. - By EC Number and Species:
The search was extended from specific organisms to species to increase the dataset size for conditional generation input.
- By EC Number and Specific Organism:
-
Clustering to Identify Representative Sequences:
Clustering was performed using CD-Hit with 90% sequence similarity to identify representative sequences. This simplifies sequence analysis and increases the likelihood of generating representative detergent-compatible sequences during conditional generation, as the representative sequences are positioned first in the MSA. Below is a comparison of the number of representative sequences versus the total number of sequences before and after clustering, using different datasets:- By EC Number and Specific Organism
- By EC Number and Species
-
To perform Multiple Sequence Alignments (MSA)
- Install Clustal Omega by
pip
or utilize the online version available at EMBL-EBI - Perform MSA for:
- Representative sequences with the same EC number.
- Representative sequences with the same microorganism.
These MSAs will be used as input for
conditional_generation_msa.py
andtrain-msa.py
. - Install Clustal Omega by
-
Motif Position Scanning:
Motif positions were identified using InterProScan with the Protein family (Pfam) database, as it contains the most extensive entry collection. Motif positions are conserved and functionally significant, and by fixing these positions during sequence generation, we aim to produce sequences that retain their original detergent functionality and are more evolutionarily aligned. -
Reference Values for pH and Temperature
To find the optimal pH and temperature values for enzyme activity, use the EC number and microorganism as keys on Brenda. These values can serve as reference points for pH and temperature ranges when generating sequences.
Data name | Detail |
---|---|
lun610200/detergent-papers | Papers categorized by amylase, cellulase, lipase, mannanase, protease, and others. These papers include records of EC numbers, organisms, optimal pH, and optimal temperature for all detergent-compatible enzymes. These records were then used to collect sequence data from UniProt. |
lun610200/detergent-motif | Sequence data categorized by different EC numbers, recording motif positions, pH optimum, and temperature optimum. |
lun610200/detergent-enzyme | This dataset is used for fine-tuning EvoDiff. It is split into training and test datasets, with a total of 644 representative detergent-compatible sequences. |
Collate the customized dataset using code in src/data_preprocess/
:
collate_data.ipynb
This notebook handles the entire data processing workflow. It includes collecting and downloading sequence data, scanning motifs, clustering sequences, and organizing representative sequence data. The notebook contains comprehensive markdown annotations explaining each step of the process.collect_data_webcrawl.py
This script defines functions for web scraping data from Brenda and UniProt. It allows for querying by EC number, organism, and species to search for reviewed or unreviewed sequence data from these sources. It also supports downloading FASTA files.create_datset.py
This script creates datasets from papers, motifs, and sequences for uploading to Hugging Face. You need to define asecrets.ini
file to store your personal Hugging Face API token.
- Accessing the Dataset:
The sequence data for the detergent-compatible enzyme dataset is available on the Hugging Face Hub: lun610200/detergent-enzyme - Loading the Dataset:
from datasets import load_dataset ds = load_dataset("lun610200/detergent-enzyme")
- Run the Training Script
cd src/training/ python train.py --gpus 0 --random-seed 42 --lr 1e-4 --epochs 60 --train-batch-size 4 --warmup-steps 10 --save-steps 10 --reweight True
- Prameter
gpus
: Index of the GPU to use for training.random-seed
: Seed for random number generators to ensure reproducibility.lr
: Learning rate for the optimizer.train-batch-size
,test-batch-size
,validation-batch-size
: Batch sizes for loading data during training, evaluation, and testing.save-stpes
: The interval of steps after which the model is saved. If not specified, the program will save only the best model during training.reweight
: By default, this is set to True, meaning the loss will be reweighted using the Optimal Automatic Differentiation Method (OADM). If set to False, the model will use standard cross-entropy loss for weight updates.
- Run the Training Script
cd src/training/ python train-msa.py --gpus 0 --random-seed 42 --lr 1e-4 --epochs 60 --train-batch-size 4 --warmup-steps 10 --save-steps 10 --reweight True
- Parameter
Refer to evodiff by Microsoft, which is a framework designed for evolutionary protein sequence generation and analysis. We can utilize this framework to generate detergent-compatible sequences by running the conditional_generation.py
and conditional_generation_msa.py
scripts.
run
python src/main.py
python conditional_generation_msa.py --cond-task scaffold --pdb A0A0A0PHP9 --num-seqs 1 --start-idx 0 --end-idx 10
python conditional_generation.py --cond-task scffold --pdb A0A0A0PHP9 --num-seqs 1 --start-idx 0 --end-idx 10
run example.ipynb
to generate sequence by using finetuned model and dataset already prepared to generate detergent-compatible sequence.
-
Structure Prediction and Evaluation:
- Objective:
Determine the structural reliability of generated sequences using OmegaFold. - Metrics:
RMSD and pLDDT are used to evaluate the accuracy and confidence of the predicted structures. - Criteria for Success:
Sequences meeting the criteria of RMSD < 1 and pLDDT > 70 are considered for further analysis, ensuring that only high-quality structures are used.
- Objective:
-
Activity Prediction and Grouping:
- Objective:
Assess the functional properties of the successful sequences. - Method:
Utilize EpHod for predicting optimal pH and TemStaPro for assessing thermal stability. - Analysis:
Categorize predictions by species to identify patterns and variations in pH and temperature preferences across different species.
- Objective:
-
Visualization and Interpretation:
- Objective:
To visually identify which species' sequences exhibit higher heat tolerance. - Visualization Tools:
Use plots such as violin, histograms, box plots, or scatter plots to illustrate the ranges and distributions. - Expectation:
The goal is to achieve a higher temperature range, which would indicate a greater tolerance to heat, aligning with our project objectives for generating heat-resistant enzymes.
- Objective:
- Execute
src/main.py
. This script contains a subprocess that runssrc/rmsd_analysis.py
to calculate RMSD and pLDDT values for the generated sequences. - For more detailed information about the process and parameters used, please refer to evodiff
- Execute the script to plot pH and temperature distribution
python src/pH_temp_analysis/pH_temp_analysis.py
- The generated data will be collected, grouped by species, and plots will be saved to the default path
plot/
.
pH value | Temperature Range | |
---|---|---|
Pretrained EvoDiff | ||
Fine-tuned EvoDiff |
This will initiate the web server and make the application accessible via your web browser.
python app/main.py
Expected Output After executing the command, you should see output indicating that the Flask server is running, typically on http://127.0.0.1:5000/. You can access the application by entering this URL in your web browser.
The front-end of this application is built using JavaScript, which is responsible for controlling the user interface and handling user interactions.
Technologies Used:
- HTML/CSS for structure and styling
- JavaScript for dynamic functionality and UI control
File description
static/
: Contains css/, img/, and js/ related to front-end assets.templates/
: Contains .html files corresponding to different topics such as home, results, and about.
The back-end of the application is built using Flask, a lightweight web framework for Python. This component is responsible for handling file uploads and generating sequences based on the uploaded files.
Technologies Used:
- Flask for creating the web server and managing requests
- Python for back-end logic and processing
File description
uploads/
: Files uploaded through the drop zone will be stored here.main.py
: Flask application to process uploaded files.
By following the instructions above, you can set up and run the application locally. Feel free to explore and modify the code to suit your needs! Feel free to modify any sections or add additional details specific to your application!