Author: Jonathan Fetterolf
Malaria Prediction Web Application
Important Note: The main part of this notebook, Predicting_Malaria.ipynb was run and executed in Google Colaboratory. There are instructions to recreate the notebook in that file.
According to the latest World Malaria Report published by the World Health Organization, there were 247 million cases of malaria in 2021 compared to 245 million cases in 2020. The estimated number of malaria deaths stood at 619,000 in 2021 compared to 625,000 in 2020. An early diagnosis and subsequently early treatment of malaria will help doctors practicing in areas with high rates of malaria infection and malaria deaths. Four African countries accounted for over half of all malaria deaths worldwide: Nigeria (31.3%), the Democratic Republic of the Congo (12.6%), United Republic of Tanzania (4.1%) and Niger (3.9%).
The WHO | Regional Office for Africa recognizes that malaria is going undiagnosed and subsequently untreated in areas where the parasite is prevalent and the resources to diagnose and treat it are the lowest. The WHO wants to create a model that can accurately predict whether or not a cell from stained blood smear is infected with malaria in order to more effectively diagnose and treat malaria in the population.
(Auxiliary data exploration can be found in this notebook.)
This application can save lives. According to the CDC: in an ideal situation malaria treatment should not be initiated until the diagnosis has been established by laboratory testing. “Presumptive treatment”, i.e., without prior laboratory confirmation, should be reserved for extreme circumstances, such as strong clinical suspicion of severe disease in a setting where prompt laboratory diagnosis is not available. Doctors will still be needed to take blood and provide treatments. Histologists will still be required to prepare slides and confirm the diagnoses. This technology will simply make their operations more effiecient and allow them to dianose and treat more patients.
Malaria parasites can be identified by examining under the microscope a drop of the patient’s blood, spread out as a “blood smear” on a microscope slide. Prior to examination, the specimen is stained (most often with the Giemsa stain) to give the parasites a distinctive appearance. This technique remains the gold standard for laboratory confirmation of malaria. However, it depends on the quality of the reagents, of the microscope, and on the experience of the laboratorian.
In the case of identifying cells parasitized by malaria, the Giemsa stain is particularly useful because the stain binds to the parasite's chromatin and makes it stand out under a microscope.
The CDC states that Malaria must be recognized promptly in order to treat the patient in time and to prevent further spread of infection in the community via local mosquitoes. Malaria should be considered a potential medical emergency and should be treated accordingly. Delay in diagnosis and treatment is a leading cause of death in malaria patients in the United States.
When considering the diagnosis of malaria, false negatives are more costly than false positives for a few reasons:
- Treatment is relatively cheap (USD $3-6 as of 2013)
- Side effects are minimal
- Undiagnosed malaria can lead community transmission and eventually to death
Recall will be a very important metric when evaluating the models as the goal is minimizing false negatives.
The data originally comes from the National Institute of Health's National Library of Medicine (NLM - NIH). It can be found at TensorFlow or Kaggle. The data consists of 27,558 cell images with equal instances of parasitized and uninfected cells from the thin blood smear slide images of segmented cells. Having equal samples is important in the training of this model to avoid class bias in predictions generated by the model.
Note: I have constructed smaller datasets to require less processing power while running the notebook. These datasets also have equal instances of parasitized and uninfected cells.
I have also brought in auxiliary data that is not used in the modeling process. It's used to generate statistics and visualizations about malaria cases and deaths from around the world. This data is provided by the WHO and can be found in the following places:
Resizing images normalizes the input sizes which will regularize the training process while rescaling images helps the CNN to learn more effectively.
Using this data augmentation will help avoid overfitting by creating unseen training examples from the existing ones, thereby increasing the size of the training dataset.
The data I use for this problem is evenly balanced. A baseline model, choosing all cells to 'Uninfected' results in an accuracy of 50%.
I decided to build and train a Convolutional Neural Network (CNN) for this problem because it effectively learns from spatial features in images such as edges, corners, and textures. The CNN classifies the images based on these features and is typically very successful in image classification problems like this.
Parameters
- Optimizer:
adam
- Loss:
binary crossentropy
- Metrics:
accuracy
,false negatives
- Total params:
6,479,873
- Trainable params:
6,479,873
- Non-trainable params:
0
This model has the same structure but adds in a data augmentation layer which will peform a random flip and random rotation on the image.
Parameters
- Optimizer:
adam
- Loss: binary
crossentropy
- Metrics:
accuracy
,false negatives
- Total params:
6,479,873
- Trainable params:
6,479,873
- Non-trainable params:
0
Parameters
- Optimizer:
adam
- Loss: binary
crossentropy
- Metrics:
accuracy
,false negatives
- Total params:
6,747,265
- Trainable params:
6,744,897
- Non-trainable params:
2,368
Parameters
- Optimizer:
adam
- Loss: binary
crossentropy
- Metrics:
accuracy
,false negatives
- Total params:
1,246,305
- Trainable params:
1,246,305
- Non-trainable params:
0
Back to structure of Model 2 but increasing number of epochs.
Parameters
- optimizer:
adam
- loss: binary
crossentropy
- metrics:
accuracy
,false negatives
- Total params:
6,479,873
- Trainable params:
6,479,873
- Non-trainable params:
0
Parameters
- optimizer:
adam
- loss:
binary crossentropy
- metrics:
accuracy
,false negatives
- Total params:
67,373,441
- Trainable params:
67,373,441
- Non-trainable params:
0
Using structure from Model 2, training on over 19,000 images. Validated with 5,500 images.
Parameters
- optimizer:
adam
- loss: binary
crossentropy
- metrics:
accuracy
,false negatives
- Total params:
6,479,873
- Trainable params:
6,479,873
- Non-trainable params:
0
Tested with 2,700 unseen images with results of:
- Accuracy:0.9655172228813171
- Precision:0.9766213893890381
- Recall:0.9536082744598389
- I would like to collect more data and retrain the model.
- Create a new feature for the application. This will allow the user to submit an image of an entire blood smear with many blood cells, split that image into separate images of individual cells that can be used as input to the model.
- The model will now be able to deliver estimated parasitic burden which is used by clinicians to make decisions regarding treatment for malaria cases.
This new tool will rapidly and accurately diagnose potential cases of Malaria, estimate parasitic burden, and will allow for the early treatment of more malaria cases, greatly reducing community transmission and saving lives around the world.
├── application
│ ├── pages
│ │ ├── 2_Data_Summary.py
│ │ ├── 3_Model_Prediction.py
│ ├── model5.h5
│ └── requirements.txt
├── data
│ ├── Unseen Data
│ ├── confirmed_cases_malaria.csv
│ ├── estimated_cases_malaria.csv
│ └── estimated_deaths_malaria.csv
├── images
│ ├── conf_case_by_year.jpeg
│ ├── est_case_by_year.jpeg
│ ├── est_death_by_year.jpeg
│ ├── example_data.jpeg
│ ├── header.jpeg
│ ├── image_augmentation.jpeg
│ ├── jf.jpeg
│ ├── mal_cells.jpg
│ └── map_conf_cases.jpeg
├── .gitignore
├── LICENSE
├── Predicting_Malaria.ipynb
├── README.md
├── auxiliary_data.ipynb
└── predicting_malaria_slides.pdf