MoleculeClassification is a molecule properties classifier based on neural networks. The application provides a CLI and REST interface and a Docker image ready to be deployed. The repository implements two predictive models and the README gives a benchmark of performance and computing time.
Two models have been imagined and implemented with Keras.
The first model exploits a vector of characteristics of molecules and the second model exploits a convolutional neuron network.
The primary evaluation metric chosen is accuracy. To evaluate the model, we also evaluate the recall and precision.
The minimized loss function is the binary cross-entropy.
If the number of samples is very low, we apply a data augmentation strategy.
For each molecule defined in the training set, other SMILES are generated which represent the same molecule.
When the dataset is badly balanced, a low-sampling strategy is applied (we select as many elements in the majority class as the minority class).
The training data set is split into 4 folds. They will be used for hyperparameter research.
To avoid any bias, the augmented SMILES from the same initial molecule are grouped together within the same fold.
For each set of hyperparameters, the model is trained 4 times using 3 folds for training (in yellow) and one fold for validation (in green). The accuracy of the model is given by the average accuracy of the validation sets.
The model exploits a fully-connected neural network that learns the relationship between the ECFP characteristics of molecules and the presence of a certain property.
The ECFP characteristics are defined on a binary vector of 2048 columns. Only columns that take more than one value are kept for dimensionnality-reduction purpose.
The proposed neural network offers three layers of non-linear activations. Bayesian optimization selects the size of the 2 hidden layers (between 4 and 64 neurons) and the activation function (either relu or swish).
A dropout is added after activations of hidden layers to reduce overfitting. The dropout rate is also found by automated tuning.
Hyperparameters are searched by Bayesian optimization using keras-tuner. Hyperparameters are selected on the accuracy of consolidated models.
Depending on the number of hyperparameter trials, the model scores between 60-70%. This accuracy should be improved, and this is the objective of the second model.
The convolutional neural network model is inspired by the spatial organization of the molecule to predict the property.
We represent a molecule by a matrix of fixed size. For this, we one-hot-encode the SMILES thanks to smiles2vec.
For the two SMILES here (they are not real molecules), we create this representation:
The matrix has as many columns as the vocabulary of SMILES (counting atoms, parentheses or even chemical bonds) and as many rows as the size of the largest SMILES in the training dataset.
The matrix is used to find convolution filters that detect structures within molecules.
Here is an example of a filter (with stride = 1 and filter size = 2) we move on one vectorized input:
The filter is only applied over the lines with the full width.
The neural network is similar to the previous one. The final activation function is a sigmoid and there is a fully-connected part (hidden layer 3). The big difference is the succession of two convolutional layers that learn full-width convolution filters.
The input has a volume (V, C, 1) with :
- V: the size of the vector, defined by default to 90. In the training set used, the longest vector was 75. 90 was defined to support predictions on longer SMILES.
- C: the SMILES vocabulary
The filters hyperparameters such as the filter size and the number of kernels is learned with bayesian optimization, the same way it's done in the model #1.
The predictive performance of the models is tested with a private CSV dataset of 4999 lines. Each line represents a molecule and gives:
- The SMILES of the molecule, in textual format (column "smiles")
- The presence or not of the property to predict (column "P1")
The dataset is randomly divided into two parts:
- The training set contains 4499 rows (90%) of the initial dataset
- The test set contains 500 rows (10%) of the initial dataset
The cross-validation strategy is applied on the training set. During all the training, the model remains agnostic of the test data. The test set is used only when evaluating the model performance, with the "evaluate" module.
The benchmark is performed on the following configuration:
- CPU: Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz 2.30 GHz
- RAM: 16,0 Go
- System type: 64-bit operating system, x64 processor
Model | Accuracy | Precision | Recall | F1-score | Time to train |
---|---|---|---|---|---|
#1 (ECFP-FCNN) | 0.69 | 0.91 | 0.69 | 0.79 | < 10 min |
#2 (Smiles-CNN) | 0.75 | 0.95 | 0.75 | 0.84 | < 10 min |
Model | Accuracy | Precision | Recall | F1-score | Time to train |
---|---|---|---|---|---|
#1 (ECFP-FCNN) | 0.63 | 0.85 | 0.66 | 0.74 | < 10 min |
#2 (Smiles-CNN) | 0.66 | 0.85 | 0.70 | 0.77 | < 10 min |
The use of data augmentation is debatable. If the method is useful in some papers, I have observed strong degradations of the performances when using an important data augmentation (rate of +600%).
For example, on a CNN, the data augmentation gives an accuracy of 0.62 on the training set and 0.59 on the test set against 0.75 and 0.66 without using data augmentation.
Moreover, data augmentation multiplies the tuning times by 600%.
💡 However, the precision is increased on the test set: we go from a precision of 0.85 (without data augmentation) to 0.93 (with data augmentation). Depending on the objective of the prediction of molecules, it may be interesting to use a data augmentation method. If the property to be predicted is considered an anomaly (e.g. toxicity), you need to select your good data augmentation rate. By default, it is disabled. To enable it, see the RANDOM_SMILES
variable in dataset.py
.
-
Research shows very good performances with recurrent neural networks, especially of the LTSM type. It would be interesting to use this type of neural networks instead of the CNN proposed. Moreover, some other models like graphs CNN are also very interesting.
-
The models are trained without GPU and without using cloud computing resources. The models were chosen to be fast to tune with a single CPU.
-
The models were designed, implemented and tested in approximately 30 hours, under high time constraints. More time would be needed to improve the performance of the models.
The installation of the application is done using pip from the root of the repository:
pip install .
For development purpose:
pip install -e .
A Dockerfile is defined and allows the user to create a docker image:
docker build . -t servier
When the application is installed, it can be used by means of the "servier" command.
Train a model:
servier --input-table dataset_single.csv --model 1 train
Evaluate the model:
servier --input-table dataset_single_test.csv --model 1 evaluate
Make predictions:
servier --input-table dataset_single_test.csv --model 1 predict
A REST API exposes the prediction module. For this, the HTTP server must be instantiated:
servier --input-table dataset_single.csv --model 1 train
servier serve
The exposed endpoint is the following:
GET /predict?smiles=YOUR_ENCODED_SMILES_HERE
Note: as for any URL request, you should encode the SMILES:
Before encoding | After encoding |
---|---|
Nc1ccc(C(=O)O)c(O)c1 | Nc1ccc%28C%28%3DO%29O%29c%28O%29c1 |
Only the model 1 is supported for the API usage.
The Docker image provides a distribution that allows you to run the application. A docker-compose interface allows you to instantiate a container that can perform the same commands described above (CLI section).
The image does not contain the inputs. A shared mount point is dynamically achieved between the host and the container. Another shared mount point is achieved for the models parameters save.
The input mount point must be created by the user at the root of the repository:
mkdir data
mv train.csv data/
mv test.csv data/
The model mount point is automatically created.
The following environment variables should be defined by the user:
Variable | Description |
---|---|
COMMAND | Command to execute (either train, predict, evaluate or serve) |
CSV | Input table name to use for the training, the evaluation or the prediction. The file should be defined in the data directory |
MODEL | The model to use (either 1 or 2) |
SMILES | The smiles to predict |
The COMMAND variable must be defined for any execution. Other variables are functions of the COMMAND.
The container is created the following way:
# Creating the container
docker-compose build
Then, it is possible to use it with the following Linux compatible commands:
# Training from input_train.csv
COMMAND=train CSV=input_train.csv MODEL=1 docker-compose up
# Evaluating from input_test.csv
COMMAND=evaluate CSV=input_test.csv MODEL=1 docker-compose up
# Serving
COMMAND=serve MODEL=1 docker-compose up