Professional Documents
Culture Documents
Kaggle Competitions
Kaggle Competitions
Its often been said that competition brings out the best in us. We are all attracted to contests;
our passion for competing seems hardwired into our souls. Apparently, even predictive modelers
find the siren song of competition irresistible.
Thats what a small Australian firm named Kaggle has discovered when given the chance, data
scientists love to duke it out, just like everyone else. Kaggle describes itself as an innovative
solution for statistical/analytics outsourcing. Thats a very formal way of saying that Kaggle
manages competitions among the worlds best data scientists.
Heres how it works: Corporations, governments and research laboratories are confronted with
complex statistical challenges. They describe the problems to Kaggle and provide datasets. Kaggle
converts the problems and the data into contests that are posted on its web site. The contests
feature cash prizes ranging in value from $100 to $3 million. Kaggles clients range in size from tiny
startups to multinational corporations such as Ford Motor Company and government agencies
like NASA.
The idea is that someone comes to us with a problem, we put it up on our website, and then
people from all over the world can compete to see who can produce the best solution, says
Andrew Goldbloom, Kaggles founder and CEO.
In essence, Kaggle has developed a remarkably effective global platform for crowdsourcing thorny
analytic problems. Whats especially attractive about Kaggles approach is that it is truly a win-win
scenario contestants get access to real-world data (that has been carefully anonymized to
eliminate privacy concerns) and prize sponsors reap the benefits of the contestants creativity.
It is not surprising that many Kaggle contestants use programs or packages written in R, the opensource programming language designed specifically for data analysis. Created by two university
professors in New Zealand, R has emerged as the lingua franca of statistical analysts worldwide.
Because R enables analysts to visualize and model data very rapidly, it has become a favorite for
handling the kind of extremely large, complex data that have become increasingly common in
todays networked global economy.
R is also uniquely suited for competitions such as those managed by Kaggle. Thats because the
competitions tend to focus on prototyping and modeling, rather than on execution.
R is a really powerful prototyping tool. It has so many packages, that just about anything you
could think to try is readily available. So in that regard, it gives participants the opportunity to
experiment with techniques that would otherwise be cumbersome to implement, says
Goldbloom.
Jeremy Howard, a highly successful Kaggle contestant, was introduced to Goldbloom at an R user
group meeting in Melbourne. Goldbloom asked if he was the same Jeremy Howard whose name
appeared so frequently on Kaggles lists of leading contenders. We got to talking and it turned
out we were kindred spirits, says Howard, who now serves as Kaggles chief data scientist.
At a recent R user group meeting, Howard talked to a packed room about his winning strategies.
R is great for running six different models and seeing what works, says Howard. R is definitely
an important tool in a data miners arsenal.
Because R is an open-source project, there are literally thousands of free R packages available for
downloading. Many of them, notes Howard, include cutting-edge statistical analytics. You can
jump into R, try something and find out quickly if it works for you. Thats really nice.
Kaggle was recently featured in a Wall Street Journal article focusing on a $3 million prize offered
by Heritage Provider Network Inc., a California-based physicians group. The prize will be awarded
to the data analyst who can develop the best model for predicting the number of days a patient is
likely to spend in the hospital over the next year, according to the Journal. Kaggle is handling the
competition, which is the largest yet of its kind.
The scope and scale of the Heritage competition isnt likely to scare off any of Kaggles die-hard
contestants. Competitions are a platform for data scientists to test the robustness of their
algorithms and theories, says Ming-Hen Tsai, a Kaggle contestant and former undergrad at
National Taiwan University. The emergence of data mining competitions such as those run by
Kaggle helps data scientists share their knowledge more openly and effectively, says Ming-Hen.
And the availability of open-source packages and algorithms from the worldwide R community
makes it easier to see beneath the hood of highly complex analytic processes, he notes.
Revealing the methods for solving analytic problems is important for advancing the data mining
community, says Ming-Hen. Although many experiments result in papers, there arent enough
comprehensive studies of the statistical methods used. Many papers just report results on specific
sets of data.
The lack of transparency makes it more difficult to recreate experiments, which in turn slows the
advancement of data mining as a science. I think we should have more open-source software
implementing state-of-the-art algorithms, says Ming-Hen. Let the performance speak for itself,
and the world can judge which is best.
Kaggle recently ran a competition to create the best recommendation system for R packages.
When he prepared the data for the contest, Howard used a variety of statistical methods
including programs written in R.
Now, Kaggle contestants have an additional weapon to deploy in their quest for victory. Kaggle
has partnered with Revolution Analytics to provide Revolution R Enterprisefree of charge to
Kaggle competitors for use in the competitions. Because this enhanced distribution of R scales to
the Big Data problems now a part of many Kaggle contests, it means that competitors can use R
for big-data problems, and that competition sponsors can implement those models in production
settings thanks to the commercial-grade enhancements of Revolution R Enterprise.
Reflecting the field of statistics itself, Kaggle has generated unexpected results. It has become an
informal recruiting ground for companies looking for the best and brightest data analysts.
When youre talking to a prospective employer, you can say, Look at my Kaggle profile. Its by
far the best reputation tool in data science, says Howard. People who are successful in Kaggle
competitions can work anywhere they want to in this field.