WINNING A HACKATHON

ChallengeDecision Tree EnsemblesLessons Learned

CONTEXT AND OBJECTIVE

In the context of the MIT Applied Data Science Program, a hackathon was organized during the last week, just before the Completion ceremony. This was an opportunity to connect with and compete alongside skilled fellow students, apply what was learned, and simply have fun! From a cohort of 400 students, about 80 enrolled in the hackathon, myslef included. I ended up winning the competition.

While I am not allowed to share the details of the Data Science case we had to work on, suffice to say that the goal was to solve a binary classification problem (overall satisfaction of Shinkansen travelers), with a tabular dataset consisting mostly in categorical features.

WHAT WAS DONE

I started the competition with a plan in mind:

Do a quick first iteration, from end-to-end, to get a good overview of the problem.
Do an EDA, improve data preprocessing, pick a good model and fine-tune it.
Explore additional ideas to improve the solution, and hopefully gain a few positions on the leaderboard.

Steps 1 and 2 got me to 3rd place, and step 3 to 1st place.

Quick first iteration

Load the data and perform a quick overview (features, data types, number of records, proportion of positive vs negative cases), create a basic pipeline that fixes obvious problems (missing data) and encodes the categorical features, and train a simple model. As this was a binary classification problem, I used logistic regression during this first iteration. The accuracy wasn’t great, but the goal was to go through the complete problem once, to better grasp its nuances.

Perform a more in-depth analysis

This involved spending quite some time on EDA, to better understand each feature, and the realtionship they had with one another. This allowed fixing features with problematic (i.e., extremely skewed) distributions, to apply more nuanced imputation technics (mean, median or mode are not always the best choices), and to take note of interesting feature combinations that could be engineered. I also tried more advanced models based on decision trees (they very often work great with tabular data), picked the most promizing one and spent time fine-tuning it. A lesson learned is that scaling the data and/or using PCA to reduce dimensionality and colinearity doesn’t always helps.

Final optimization

This final step was probably the most time consuming. In a real-world scenario, I would have been happy with the solution from step 2, which performed well, was computationaly efficient and provided some level of explainability. But the beauty of competitions is that the score is all that matters and anything that can provide a gain of 0.01% accuracy should be used! I tried many technics. Things that didn’t work: further fine-tuning the model from step 2 or manually building more complex models (e.g., voting ensembles). Things that worked: engineering additional features (typically sums or ratios of features that look interesting), exploring more complex models using an automated framework (MLJAR). Lessons learned: this final step can only help you gain a few extra ranks so you need to start from a sound solution, and you will typically end up with a complex model (a stacked ensemble in this case) that you probably wouldn’t use in production on a real business problem, but is still a great opportunity to learn about more advanced ML algorithms.