The annual ritual of preparing March Madness brackets for the NCAA basketball tournament has led people to try all kinds of ridiculous approaches: Pick your favorite school or favorite conference; focus on team mascots or even uniform colors. Data scientists have a different approach.
Check out the “March Machine Learning Mania” competition over at Kaggle, an online platform for predictive modeling and analytics competitions. A major player in the Big Data world, Kaggle is hosting a competition sponsored by Intel (INTC) to see how well machine learning and statistical techniques can improve the ability to forecast NCAA tournament winners. More than 200 different teams have already entered, with more to come. The contest is broken down into two parts:
1) Creating an algorithmic model to predict the results of the past five annual tournaments.
2) Trying out that model in real time to predict the results of the 2014 tournament kicking off March 18.
Coming up with a reliable predictive model to sift and weight all manner of basketball stats to predict which team will be cutting down the net on April 7 at AT&T (T) Stadium in Arlington, Tex., is a admittedly difficult thing to do. Since January, competitors have been analyzing previous tournaments to back-test, tweak, and optimize their software, according to Will Cukierski and Jeff Sonas, two of the competition administrators.
The real, 64-team NCAA tournament has 63 games, and in normal bracket competitions you’re asked to make 63 game-winning picks. (Yes, there are actually 68 teams, but most brackets ignore the four “first round” games added by the NCAA a few years ago.) One problem with that: Let’s say you pick UCLA to win in round one, but the Bruins actually end up losing; now your second round game won’t feature the expected matchup, making it impossible to assess that pairing of teams.
Kaggle’s competition isn’t about winning an office pool but rather about developing a predictive model that can analyze a lot of data in a variety of scenarios. That’s why the online platform has asked its competitors to submit predictions across an entire matrix of possible games. Theoretically, any one of the 64 teams could play any other team at some point in the tournament, so Kaggle wants predictions for every possible matchup—2,016 predictions in all (64 times 63 divided by 2).
Furthermore, Kaggle doesn’t ask just “who is going to win?” It also asks for a measure of confidence in your choice: a percentage representing the likelihood, or odds, that a team will win. Competitors don’t get to say, “Virginia will beat Delaware,” but rather, “I think Virginia has a 78 percent chance of beating Delaware.” It’s the 78 percent that matters. This is how they score you, based on the combination of your confidence and accuracy: If you say a team has a 100 percent chance of winning but end up getting it wrong, the scoring system is going to dock you very hard. Your highly confident wrong answers are what will burn you. Sonas says this is “a more scientific approach” to looking at the tournament.
Think of this as portfolio allocation and risk diversification, as in the stock market. Picks with 90 percent confidence are high risk/high reward in terms of the points you can gain or lose. Going with a 51 percent pick means you don’t have confidence in your choice: You’re effectively guessing, and you’re admitting that. You won’t get a lot of credit for being right, nor will you lose much for being wrong. The scoring system focuses on matching your confidence with your accuracy. Are you right when you are sure you will be? It’s this strategy that competitors are trying to optimize. What factors will matter, and how much do you want to bet on your confidence?
When it comes to designing a powerful predictive model, Cukierski says the best technique is “stacking”—taking several techniques and mixing them together. A competitor might have one model focusing only on giving a higher win probability to better seeds. He might have a second model that considers a defensive metric, such as rebounding. And he might have a third model weighting the probability based on the two teams’ out-of-conference strength of schedule. Each of these models by itself won’t be as helpful as combining them into one stacked model.
The most important subsequent problem is finding the right balance among those three models: Is it 50-30-20 or 80-10-10? That’s where the data science comes in: getting the right blend of factors. This is also why Kaggle is allowing competitors to test their models on the last five years of tournament results, to see what tweaks they’ll need to make before the tournament begins next week. Sonas says they provide competitors with 18 years of back data, allowing people to come up with their own rankings of teams, trying to beat more established concepts, such as the Rating Percentage Index, or RPI.
In the first part of the competition—analyzing the last five tournaments—Kaggle has been posting results, but they don’t mean all that much. Cukierski is pretty sure the top leaders are all cheating, either intentionally or inadvertently. He points out that it’s easy to create a model that overfits past data, or uses data that existed only after the games were played, data that would not have been available to make predictions in advance.
After the NCAA tournament selections are announced on March 16, the real competition begins. Kaggle contest participants will fire up their models and make their predictions. Cukierski says once the final submissions are made and tournament play begins, competitors will begin to discuss the types of strategies they took.
He says you’ll probably see a mix of approaches: those with “domain expertise” (or knowledge about basketball itself), and strategies that are “abstracted from reality” (focused on optimizing the percentages, balancing the risk/reward, and handling the volatility of high confidence). Sonas says that based on his past experience running competitions in other fields, he thinks the winners will more likely be analytical mathematicians, as opposed to somebody with deep basketball knowledge.
Cukierski thinks the ultimate winner will probably have to go with a high-confidence approach. This is the same as a high-volatility approach to picking stocks. Most likely, a high-volatility portfolio can’t hold up in the long run, but for a single individual trying to win one contest, the possibility of high reward is worth the risk. Approaches with lower-confidence (like a lower-volatility stock portfolio) could be more stable and hold up over many years, but that may not be the single-best winning strategy for 2014.
Sonas and Cukierski both point out that because the Kaggle competition scores every game equally, competitors who falter in early rounds could still come from behind and win later. That means there should be plenty of action for early losers, unlike in the real tournament. As the Kaggle strategies are publicly revealed and the game outcomes change the results, we’ll be able to follow up with the approaches that were most successful.
Next up in our March Madness Data Series, we’ll look at how some of the better-known experts in the field rank teams relative to other teams.