Mass spectrometry, also called mass spec, is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio.
from Wikipedia
Keep updating the awesome machine-learning papers and codes related to small molecules mass spectrometry. Please notice that awesome lists are curations of the best, not everything. Contributes are always welcome!
- Databases
- Papers
- Survey/Review papers
- Discussions in databases
- Discussions in pre-train models
- Small molecular representation learning
- Mass spectrometry-related properties prediction
- Mass spectra representation learning and matching
- Chemical formula prediction from mass spectra
- Mass spectra peak annotation/assignment
- Machine learning in small molecules chromatography
- Related awesome lists
- OC20 & OC22: The Open Catalyst Project focuses on using AI to find new renewable energy storage catalysts, releasing the OC20 and OC22 datasets with 1.3 million molecular relaxations from 260 million DFT calculations for research support.
- QM9: This dataset includes the computed geometric, energetic, electronic, and thermodynamic properties of 134,000 stable small organic molecules composed of CHONF.
- GEOM: This dataset features 37 million molecular conformations for over 450,000 molecules, generated using advanced sampling and semi-empirical density functional theory (DFT).
- MD17 & MD22: The MD22 benchmark dataset includes molecular dynamics trajectories of seven biomolecular and supramolecular systems, with atom counts ranging from 42 to 370, sampled at 400-500 K with 1 fs resolution, and energy and forces calculated using PBE+MBD theory.
- PCQM4Mv2: PCQM4Mv2 is a quantum chemistry dataset derived from the PubChemQC project, focusing on the ML task of predicting DFT-calculated HOMO-LUMO energy gaps of molecules using their 2D graphs, a significant task due to the expense of obtaining 3D equilibrium structures.
- MoleculeNet: MoleculeNet is a benchmark for testing machine learning methods on molecular properties, featuring over 700,000 compounds from multiple databases, integrated into the DeepChem package, and evaluates model performances using metrics like AUC-ROC, AUC-PRC, RMSE, and MAE.
- MassSpecGym: MassSpecGym is a benchmark for the discovery and identification of molecules containing high-quality 231K MS/MS spectra for 19K molcules.
- NIST23: The NIST MS/MS Library 2023 is a collection of MS/MS spectra and search software. It contains 2,374,064 MS/MS spectra from 399,267 small molecules.
- MoNA: MoNA currently contains 2,061,612 mass spectral records from experimental and in-silico libraries, as well as from user contributions.
- GNPS: GNPS is a web-based mass spectrometry ecosystem that aims to be an open-access knowledge base for the community-wide organization and sharing of raw, processed, or annotated fragmentation mass spectrometry data (MS/MS).
- HMDB 5.0: The Human Metabolome Database (HMDB) Version 5.0 is an extensive and freely accessible electronic resource that contains 220,945 metabolite entries present in the human body and their experimental MS/MS spectra.
- SMRT: This dataset presents an experimentally acquired reverse-phase chromatography retention time dataset, covering up to 80,038 small molecules.
- RepoRT: RepoRT currently contains 373 datasets, 8,809 unique compounds, and 88,325 retention time entries measured on 49 different chromatographic columns using various eluents, flow rates, and temperatures.
- AllCCS: This collection includes more than 5,000 experimental CCS records and approximately 12 million calculated CCS values for over 1.6 million small molecules.
- AllCCS2: Compared to AllCCS, AllCCS2 incorporates newly available experimental CCS data, including 10,384 records from 4,326 compounds. After standardization, 7,713 unified CCS values with confidence scores were added.
- METLIN-CCS: The METLIN-CCS database includes collision cross section (CCS) values derived from IMS data for more than 27,000 molecular standards across 79 chemical classes.
- CCSBase: CCSbase is an integrated platform consisting of a comprehensive database of CCS measurements taken from a variety of sources and a high-quality and high-throughput CCS prediction model trained with this database using machine learning. Website
- [J. Am. Soc. Mass Spectrom. 2024] Nguyen, Julia, et al. Advancing the Prediction of MS/MS Spectra Using Machine Learning
- [IJCAI 2023] Xia, Jun, et al. A Systematic Survey of Chemical Pre-trained Models
- [TrAC 2021] Debus, Bruno, et al. Deep learning in analytical chemistry
- [J. Cheminform. 2013] Scheubert, Kerstin, et al. Computational mass spectrometry for small molecules
- [Anal. Chem. 2024] Hoang, Corey, et al. Tandem Mass Spectrometry across Platforms
- [bioRxiv 2024] Kretschmer, Fleming, et al. Small molecule machine learning: All models are wrong, some may not even be useful
- [JCIM 2023] Zhang, Ziqiao, et al. Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery?
- [NeurIPS 2022] Sun, Ruoxi, et al. Does GNN Pretraining Help Molecular Representation?
According to the information embedded in the model, the molecular representation learning models are categorized as point-based (or quantum-based) methods, graph-based methods, and sequence-based methods. Because the number of graph-based methods is huge, they are further divided into self-supervised learning and supervised learning manners. It is worth noting that the difference between point-based (or quantum-based) methods and graph-based methods is if bonds (i.e. edges) are included in the encoding.
Point-based (or quantum-based) methods
- [ICLR 2023] Zhou, Gengmo, et al. Uni-mol: A universal 3d molecular representation learning framework [code]
- [PMLR 2021] Schütt, Kristof, et al. Equivariant message passing for the prediction of tensorial properties and molecular spectra [code]
- [NeurIPS 2017] Schütt, Kristof, et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions [code]
Self-Supervised Learning:
- [Brief. Bioinformatics 2024] Zhen, Wang, et al. BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation [code]
- [Bioinformatics 2023] [3DGCL] Moon, Kisung, et al. 3D graph contrastive learning for molecular property prediction [code]
- [ICLR 2023] [Mole-BERT] Xia, Jun, et al. Mole-bert: Rethinking pre-training graph neural networks for molecules [code]
- [ICLR 2023 (spotlight)] [GNS TAT] Zaidi, Sheheryar, et al. Pre-training via denoising for molecular property prediction [code]
- [ICLR 2023] [GeoSSL-DDM] Liu, Shengchao, et al. Molecular geometry pretraining with se (3)-invariant denoising distance matching [code]
- [ICLR 2022] [GraphMVP] Liu, Shengchao, et al. Pre-training molecular graph representation with 3d geometry [code]
- [NeurIPS 2021] [MGSSL] Zhang, Zaixi, et al. Motif-based graph self-supervised learning for molecular property prediction [code]
- [NeurIPS 2020] [GROVER] Rong, Yu, et al. Self-supervised graph transformer on large-scale molecular data [code]
- [ICLR 2020] [InfoGraph] Sun, Fan-Yun, et al. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization [code]
Supervised Learning
- [AAAI 2023] [Molformer] Wu, Fang, et al. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs [code]
- [NeurIPS 2022] [ComENet] Wang, Limei, et al. ComENet: Towards Complete and Efficient Message Passing for 3D Molecular Graphs [code (implemented in DIG library)]
- [ICLR 2022] [GNS+Noisy Nodes] Godwin, Jonathan, et al. Simple GNN regularisation for 3D molecular property prediction & beyond [codes]
- [ICLR 2022] [MolR] Wang, Hongwei, et al. Chemical-reaction-aware molecule representation learning [code]
- [ICLR 2022] [SphereNet] Liu, Yi, et al. Spherical message passing for 3d graph networks [code (implemented in DIG library)]
- [Nat. Mach. Intell. 2022] [GEM] Fang, Xiaomin, et al. Geometry-enhanced molecular representation learning for property prediction [code]
- [NeurIPS 2021] [GemNet] Gasteiger, Johannes, et al. Gemnet: Universal directional graph neural networks for molecules [code]
- [NeurIPS 2020] [DimeNet++] Klicpera, Johannes, et al. Fast and uncertainty-aware directional message passing for non-equilibrium molecules [code]
- [ICLR 2020] [DimeNet] Gasteiger, Johannes, et al. Directional message passing for molecular graphs [code]
- [Chem. Mater 2019] [MEGNet] Chen, Chi, et al. Graph networks as a universal machine learning framework for molecules and crystals [preprint] [code]
- [PMLR 2017] Gilmer, Justin, et al. Neural message passing for quantum chemistry [code]
- [NeurIPS 2015] [Neural FPs] Duvenaud, David K., et al. Convolutional networks on graphs for learning molecular fingerprints [code]
Other Related Works
- [NeurIPS 2020] You, Yuning, et al. Graph contrastive learning with augmentations [code]
- [ICLR 2020] Hu, Weihua, et al. Strategies for pre-training graph neural networks [code]
- [Patterns 2022] [SELFIES] Krenn, Mario, et al. SELFIES and the future of molecular string representations [code]
- [Nat. Mach. Intell. 2022] [MolFormer] Ross, Jerret, et al. Large-scale chemical language representations capture molecular structure and properties [code]
- [Chem. Sci. 2022] [R-SMILES] Zhong, Zipeng, et al. Root-aligned SMILES: a tight representation for chemical reaction prediction [code]
- [BCB 2019] [SMILES-BERT] Wang, Sheng, et al. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction [code]
Tandem mass spectra prediction predicton
- [Anal. Chem. 2024] [PPGB_MS2] Zheng, Fujian, et al. Predicting Tandem Mass Spectra of Small Molecules Using Graph Embedding of Precursor-Product Ion Pair Graph [code]
- [Anal. Chem. 2023] Wang, Fei, et al. Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances [code]
- [Nat. Mach. Intell. 2023] Goldman, Samuel, et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers [code]
- [Nat. Mach. Intell. 2024] Young, Adamo, et al. Tandem mass spectrum prediction for small molecules using graph transformers [code]
- [NeurIPS 2023] Goldman, Samuel, et al. Prefix-tree decoding for predicting mass spectra from molecules [code]
- [Bioinformatics 2023] Hong, Yuhui, et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations [code]
- [Anal. Chem. 2021] Wang, Fei, et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification [code]
- [ACS Cent. Sci. 2019] Wei, Jennifer N., et al. Rapid prediction of electron–ionization mass spectrometry using neural networks [code]
- [Bioinformatics 2024] [RT-Transformer] Xue, Jun, et al. RT-Transformer: Retention time prediction for metabolite annotation to assist in metabolite identification [code]
- [J. Chromatogr. A 2023] [DeepGCN-RT] Kang, Qiyue, et al. Deep graph convolutional network for small-molecule retention time prediction [code]
- [Anal. Chem. 2021] [GNN-RT] Yang, Qiong, et al. Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification [code]
- [Anal. Chem. 2020] [Retip] Bonini, Paolo, et al. Retip: retention time prediction for compound annotation in untargeted metabolomics [code]
- [Nat. Commun 2019] Domingo-Almenara, Xavier, et al. The METLIN small molecule dataset for machine learning-based retention time prediction [code]
Collision cross section prediction
- [Anal. Chem. 2024] de Cripan, et al. Predicting the Predicted: A Comparison of Machine Learning-Based Collision Cross-Section Prediction Models for Small Molecules
- [Anal. Chem. 2022] [AllCCS2] Zhang, Haosong, et al. AllCCS2: Curation of Ion Mobility Collision Cross-Section Atlas for Small Molecules Using Comprehensive Molecular Representations [code]
- [Anal. Chem. 2022] [CCSP 2.0] Rainey, Markace A., et al. CCS Predictor 2.0: An open-source jupyter notebook tool for filtering out false positives in metabolomics [code]
- [Nat. Commun 2020] [AllCCS] Zhou, Zhiwei, et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics [code]
- [Anal. Chem. 2019] [DeepCCS] Plante, Pier-Luc, et al. Predicting ion mobility collision cross-sections using a deep neural network: DeepCCS [code]
- [Anal. Chem. 2023] [CLERMS] Guo, Hao, et al. Contrastive learning-based embedder for the representation of tandem mass spectra [code]
- [Nat. Commun 2023] [FastEI] Yang, Qiong, et al. Ultra-fast and accurate electron ionization mass spectrum matching for compound identification with million-scale in-silico library [code]
- [PLoS Comput. Biol. 2021] Huber, Florian, et al. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships [code]
- [J. Cheminform. 2021] Huber, Florian, et al. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra [code]
- [Anal. Chem. 2019] [DeepMASS] Ji, Hongchao, et al. Deep MS/MS-aided structural-similarity scoring for unknown metabolite identification [code]
- [JCIM 2023] Goldman, Samuel, et al. MIST-CF: Chemical formula inference from tandem mass spectra [code]
- [Nat. Methods 2023] [BUDDY] Xing, Shipei, et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation [code1] [code2]
- [Nat. Methods 2019] [SIRIUS 4] Dührkop, Kai, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information [code]
- [J. Cheminform. 2016] Ruttkies, Christoph, et al. MetFrag relaunched: incorporating strategies beyond in silico fragmentation [website]
- [Anal. Chem. 2014] Ma, Yan, et al. MS2Analyzer: A software for small molecule substructure annotations from accurate tandem mass spectra [website]
- [Nucleic Acids Res. 2014] Allen, Felicity, et al. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra [website] CFM-ID is designed for three tasks: spectrum prediction, peak assignment, and compound identification.
Mass spectrometry is often coupled with chromatographic techniques, such as GC-MS (gas chromatography-mass spectrometry) or LC-MS (liquid chromatography-mass spectrometry). In these combined techniques, the chromatographic method separates the compounds, and then the mass spectrometer analyzes each separated compound for identification and quantification.
- [Anal. Chem. 2024] [3DMolCSP] Hong, Yuhui, et al. Enhanced Structure-Based Prediction of Chiral Stationary Phases for Chromatographic Enantioseparation from 3D Molecular Conformations [code]
- [Nat. Commun 2023] [QGeoGNN] Xu, Hao, et al. Retention time prediction for chromatographic enantioseparation by quantile geometry-enhanced graph neural network [code]
- [J. Sep. Sci. 2018] Piras, Patrick, et al. Modeling and predicting chiral stationary phase enantioselectivity: An efficient random forest classifier using an optimally balanced training dataset and an aggregation strategy
- [J. Chromatogr. A 2016] Sheridan, Robert, et al. Toward structure-based predictive tools for the selection of chiral stationary phases for the chromatographic separation of enantiomers
- Awsome Mass Spectra Libraries: This repository contains the latest libraries for mass spectral data and related properties.
- Awesome Small Molecule Machine Learning: This repository focuses on machine learning topics related to small molecules.
- Awesome Cheminformatics: This repository concentrates on computer-based methods in chemistry.
- Awesome Python Chemistry: This repository is dedicated to Python-based frameworks, libraries, software, and resources in the field of Chemistry.
- Awesome DeepBio & deeplearning-biology: These repositories focus on deep learning methods in biology.
- awesome-pretrain-on-molecules