Microarrays help to study the expression of thousands of genes simultaneously. The genes responsible for the tumour growth can be identified by analyzing changes in the gene expressions related to the tumour growth with microarray data from the normal and the tumour samples. There is a need for highly efficient computational techniques to analyze these large numbers of gene expressions and find out the most significant differentially expressed genes related to the particular disease. Many computational methods have difficulties in selecting the optimal set of genes because of the small number of samples compared to the thousands of genes. For machine learning classifiers to accurately classify the tumour and the normal samples, gene selection is a very important step.
1. Differential expression analysis using independent t-test
2. Recursive feature elimination using SVC classifier
1. Support Vector machine classifier
2. Logistic regression
3. Linear Discriminant analysis
4. Quaderatic discriminant analysis
5. Decision Tree
6. Gaussian naive bayes
7. Random forest
8. Gaussian process classifier
9. Adaboost
10. XGBoost
Tested the classifiers based on following Evaluation matrics on Microoaray data containing 228 samples from different independent experiments
The evaluation metrics used were –
1. ROC curve
2. Precision recall curve
3. Confusion matrix
4. Accuracy
5. Area under the curve
6. F1 score
7. Average precision
8. Log loss
Folder containing all the trained models. These models are trained on 400 samples from microarray data from 8 independent experiments for clear cell Renal cell carcinoma.
This folder contains the test results.
This python file contains the codes for data preparation i.e. feature extraction.
This python file contains the codes for training ML classifiers
This python file contains the codes for testing and evaluation classifiers
This python file contains the functions used in training and testing
The dataset should be in the following shape for these classiers
columns - gene names or probe ids
rows - samples
last column should be the 'labels' containg labels as 'tumor' for tumor samples and 'normal' for normal samples.
These classifiers are traind on the whole genome clear cell renal cell carcinoma microarray meta-dataset.
Using and further training of these clasifiers on more diverse datasets for different diseases is encouraged.