ChemBERTa

ChemBERTa: A collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. To be presented at Baylearn and the Royal Society of Chemistry's Chemical Science Symposium.

Tutorial
ArXiv ChemBERTa-2 Paper
Arxiv ChemBERTa Paper
Poster
Abstract
BibTex

License: MIT License

Right now the notebooks are all for the RoBERTa model (a variant of BERT) trained on the task of masked-language modelling (MLM). Training was done over 10 epochs until loss converged to around 0.26 on the ZINC 250k dataset. The model weights for ChemBERTA pre-trained on various datasets (ZINC 100k, ZINC 250k, PubChem 100k, PubChem 250k, PubChem 1M, PubChem 10M) are available using HuggingFace. We expect to continue to release larger models pre-trained on even larger subsets of ZINC, CHEMBL, and PubChem in the near future.

This library is currently primarily a set of notebooks with our pre-training and fine-tuning setup, and will be updated soon with model implementation + attention visualization code, likely after the Arxiv publication. Stay tuned!

I hope this is of use to developers, students and researchers exploring the use of transformers and the attention mechanism for chemistry!

Citing Our Work

Please cite ChemBERTa-2's ArXiv paper if you have used these models, notebooks, or examples in any way. The BibTex is available below:

@article{ahmad2022chemberta,
  title={Chemberta-2: Towards chemical foundation models},
  author={Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  journal={arXiv preprint arXiv:2209.01712},
  year={2022}
}

Example

You can load the tokenizer + model for MLM prediction tasks using the following code:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

#any model weights from the link above will work here
model = AutoModelWithLMHead.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
.github		.github
chemberta		chemberta
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
DeepChemDev.ipynb		DeepChemDev.ipynb
LICENSE		LICENSE
README.md		README.md
atom_weight_visualization.ipynb		atom_weight_visualization.ipynb
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChemBERTa

Citing Our Work

Example

Todo:

About

Releases 1

Sponsor this project

Packages

Contributors 5

Languages

License

seyonechithrananda/bert-loves-chemistry

Folders and files

Latest commit

History

Repository files navigation

ChemBERTa

Citing Our Work

Example

Todo:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Sponsor this project

Packages 0

Contributors 5

Languages

Packages