West Nile Virus Predictions

Problem Statement

This problem comes from comes from a 2015 Kaggle, details here. . The official problem statement for this Kaggle is, “A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD [Chicago Public Health Department] more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus.”

Background

West Nile Virus (WNV) is an infectious disease that was discovered 1937 in the West Nile region of Uganda. It was first seen in the United States in 1999 and is spread through infected mosquitos. Common symptoms include mild fever, headaches, body aches, and skin rashes. While most people don't experience serious symptoms, WNV can be life-threatening if it enters the brain, causing encephalitis. Unfortunately there are no vaccines or treatments available, so the best way way to prevent getting WNV is to avoid mosquito bites.

The City of Chicago and the Chicago Public Health Department (CPHD) have resorted to spraying areas when WNV is detected to kill the infected mosquites and prevent the virus' spread. They have set up traps around the city to capture mosquitos and send them off for testing to determine if WNV is present.

To learn more about the distribution of WNV in our dataset, see our Tableau Public Story here

Goal

Using advanced data modeling techniques, we want to predict when and where different species of captured mosquitos will test positive for West Nile Virus based on past data and weather information. Ultimately, we want to find ways we can use these predictions to help the City of Chicago and CPHD allocate resources better. The effectiveness of our predictions will determine what kinds of things our predictions can be used for

Methods

We were given data from the years 2007, 2009, 2011, and 2013, including when and where mosquitos were captures in Chicago's traps, what species they were, and whether or not they they tested positive. We were also given weather data from two weather stations in Chicago (see data dictionaries below for more information) for 2007-2014. Some data from where City of Chicago sprayed mosquitos in 2011 and 2013 were also provided, but were not used for analysis because it didn't include enough data. We cleaned the data, averaged the data from the two weather stations, and merged the weather data and mosquito capture data for analysis. After that, we build several models, including Logstic Regression, Naive Bayes, and various Decision Tree-based models (including Random Forest, Adaboost, Extreme Gradient Boosting). Because the competition required probability estimates to be included in the final output, we did not attempt methods such as K-Nearest Neighbors and Support Vector Classification, which only categorize and do not estimate category probabilities. We judged our model based on an Area Under the receiver operating characteristic (ROC) Curve (ROC AUC score) instead of overall accuracy because the presence of WNV was a rare event (chances around 5%). The best model was an XGBoost model, so this was our final model.

Results

Using XGBoost, we were able to achieve a cross-validated ROC AUC score 0.85, which is considered a good score (the score ranges from 0.5, the equivalent of random guessing, to 1.0, always correct). To help understand this score, I will illustrate with an example point on the ROC curve: with a True Positive Rate (TPR) of 80%, our False Positive Rate (FPR) would be around 25%; that is, when we correctly predict 80% of the WNV-positive cases, we will incorrectly predict 25% of the WNV-negative cases to be positive. Because of the imbalance of prevalence, this amounts to a 99% precision for WNV-negative (99% of cases predicted to be WNV-negative are correct) and a 15% precision for WNV-positive (15% of cases predicted to be WNV-positive are correct).

Recommendation

Because even a small False Positive Rate equates to a large number of false positives (there are over 2500 mosquito specimens tested each year, and 95% are WNV-negative), it is not reasonable to use our predictions to decide where to spray for mosquitos. Spraying is too expensive and potentially harmful to spray where it is not needed. However, what our predictions can do for the City of Chicago and CPHD is to help determine which specimens need to be sent in for testing and which don't. If a particular specimen is predicted to have a near-zero probability of testing positive for WNV, the city can save money by not sending it in for testing. At an estimated $500 per lab test, this can save the city a lot of money. The table below outlines the estimated yearly savings at different thresholds, based on five-fold cross-validated predictions on in-sample data:

TPR/Recall (% of actual WNV-positive cases predicted correctly)	Threshold (probability needed to be predicted WNV-positive)	Estimated # of Traps to NOT be Tested Per Year (predicted WNV-negative)	Estimated Annual Savings (based on $500 per test)
80%	5.6%	1,889	$944,500
85%	4.1%	1,729	$864,250
90%	3.0%	1,547	$773,625
95%	2.0%	1,321	$660,250
100%	0.1%	366	$183,000

We recommend that City of Chicago and CDPH aim for a True Positive Rate of 95%. That means that only if a specimen has an estimated probability greater than 2% of being WNV-positive, they should send it for testing. This would limit the number of WNV-positive specimens being missed while also saving a great deal of money.

Risks and Assumptions

The results of this study depend greatly on certain assumptions, chief among them being that WNV prevalence will not change dramatically from year to year. If prevalence goes up a lot one year, we would have to adjust our calculations and it's likely that more specimens would need to be tested. This would decrease savings. Another big assumption is that the distribution of WNV across the city of Chicago will not change dramatically; that is, no major changes will occur that increase the likelihood of WNV being present in one area that it hasn't been before. For the four years in the study, there are certain areas where WNV is more common, and this must remain consistent for the results to be valid. The City of Chicago might consider periodic testing of specimens from areas where WNV was previously not seen just in case. Again, this would decrease cost savings but may catch these changes more quickly. Finally, if lab costs for testing WNV specimens change greatly, the savings will change along with it. These risks and assumptions should be kept in mind as decisions get made moving forward

Data Dictionaries

Weather Data from May 1, 2007 to Oct 31, 2014 from NOOA

Column Name	Data Type	Description	Notes
Station	Integer	Which station the data in the row come from	Station 1 or Station 2
Date	DateTime	Date the row's data come from
Tmin	Integer	Minimum temperature that date
Tmax	Integer	Maximum temperature that date
Tavg	Integer	Average temperature across the day
Depart	Integer	Difference from normal for that day of the year	Only available for Station 1
DewPoint	Integer	Average Dew Point Temperature	Temperature where water vapor starts to condense out of the air
WetBulb	Integer	Average Wet Bulb Temperature	Adiabatic saturation temperature (more info here)
Heat	Integer	65 - Tavg (if Tavg <= 65)
Cool	Integer	Tavg - 65 (if Tavg > 65)
Sunrise	Time	Sunrise time in military time	Only availble for Station 1
Sunset	Time	Sunset time in military time	Only availble for Station 1
CodeSum	List	Code(s) for various weather conditons	e.g. FG = Fog, HZ = Haze
Depth	Integer	Precipitation Depth, if applicable (else 0)
SnowFall	Float	Snowfall (inches)
PrecipTotal	Float	Rain (inches)
StnPressure	Float	Average station pressure
SeaLevel	Float	Sea level pressure (inches of Hg)
ResultSpeed	Float	Resultant wind speed (mph)	Resulant wind = vector sum of wind speeds and directions
ResultDir	Integer	Resultant wind direction (degrees)	Resulant wind = vector sum of wind speeds and directions
AvgSpeed	Integer	Average wind speed

Source: Kaggle documentation

Trap Data

Column Name	Data Type	Description
Date	DateTime	Date the row's data come from
Address	String	approximate address of the location of trap. This is used to send to the GeoCoder.
Species	String	Species of mosquito for that row of data
Block	Integer	Block Number of trap address
Street	String	Street name of trap address
Trap	String	Trap ID
AddressNumberAndStreet	String	Address and street of the trap
Latitude	Float	Trap latitude
Longitude	Float	Trap longitude
AddressAccuracy	Integer	accuracy of trap address returned from GeoCoder
NumMosquitos	Integer	Number of mosquitos of a particular species found in a particular trap
WnvPresent	Boolean	Whether or not West Nile Virus was found in the sample

Source: Kaggle competition

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
assets		assets
.gitignore		.gitignore
README.md		README.md
West Nile Virus Predictions.ipynb		West Nile Virus Predictions.ipynb
West Nile Virus Presentation.pptx		West Nile Virus Presentation.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

West Nile Virus Predictions

Problem Statement

Background

Goal

Methods

Results

Recommendation

Risks and Assumptions

Data Dictionaries

About

Releases

Packages

Languages

cpleasants/west-nile-virus-predictions

Folders and files

Latest commit

History

Repository files navigation

West Nile Virus Predictions

Problem Statement

Background

Goal

Methods

Results

Recommendation

Risks and Assumptions

Data Dictionaries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages