DrivenData is currently holding a competition, where we try to predict Dengue fever cases in two cities. Dengue is a neglected tropical disease, so-called because it primarily harms poor people in tropical zones. If you had the good fortune to have been born in a developed nation, you probably have not had Dengue, nor would you know anyone who had that disease.
This is a difficult problem to solve due to the high level of randomness in how Dengue fever cases are confirmed. First, a mosquito has to become infected with Dengue fever, then it has to bite a person with no prior immunity against that viral strain. We have a dataset containing various metrics that serve as proxies for how likely a Dengue fever transmission is, including temperature ranges, date, precipitation and humidity, and vegetation coverage. To start, let’s take a look at our training data set to see how the past number of confirmed cases varies over time.
On the bright side, we have a very low risk of leakage, as the dataset is rather small, and does not contain any parameters which would provide 1–1 correspondence onto the number of cases. This is great news in one sense because leakage is an excellent way to fool yourself into believing your model is better than it actually is! Let’s move onto the target itself.
As the competition uses mean absolute error for scoring submissions, we will be using that as well. We need a baseline, or average to review our future model performance. This means that we will check the mean absolute error for predicting the median value for each test case.
Starting out, we get a baseline error of 19.88, so that is the number to beat with future models.
To start our analysis, let’s try fitting a linear model to quickly see how much learning could potentially be done from the information. Our target is a number, making this a regression problem.
The most salient features for a ridge regression model turn out to be the absolute and relative humidity and dew point temperature. This makes sense, as we would normally associate mosquitos with hot and humid places, where they can thrive. Running this model gives us a mean absolute error of 22 on our validation data, which does not beat our baseline. Let’s check to see if the relationship is actually linear between the top three features, or if we should use a non-linear model for prediction.
Based on our 3 most important features, our assumptions that go into a linear model are being violated. In particular, while the relationship appears to be somewhat linear for large numbers of cases, the data is not homoscedastic or maintaining a constant variance throughout the levels of cases. For this reason, we will move onto a model that does not require that assumption, starting with a random forest model. Let’s look at a few of the predictions made by that model to get a sense of how the model is built. First, let’s look at the middle of the year, where cases tend to be higher. Note, we need to convert dates into numbers, but this observation is from the middle of the year.
Next, we can look at how the model treats cases at the end of the year, where cases are no longer peaking. See how the week start date is a strong predictor of the number of cases during certain times of the year.
Ultimately, we get a mean absolute error of 14 on our validation data, which does beat the baseline, but I think that we can improve on this.
One final model we can try is XGBoost, which should offer a low mean squared error as we work on predicting relatively few examples. The reason for this is that stacked models should make our predictions more robust to noise in the training data, as opposed to previous models, which may learn that noise, and become less accurate on the test set.
After trying this algorithm, we achieve a mean squared error of 5 on the validation data, but an error of 26 on the testing data, far from the current best of 10 in the competition, but with no feature engineering, this is the best performing model that we have made so far.
The next step here would be to engineer new features, based on our understanding of the dataset, along with further hyperparameter tuning to ensure that we get optimal performance out of our algorithm.
The broader point is that Dengue fever is a debilitating disease, which mostly does not exist in the developed world. While predicting the number of cases reported for Dengue is important, preventing these cases from happening at all could prevent thousands of deaths and extended suffering in children every year. The great part about Driven Data competitions is that it allows data science to be used for tasks that are often underserved in the real world. If you know data science yourself, I encourage you to beat my score here:
Make a connection: