You are on page 1of 4

WHITE PAPER

R Competition Brings Out the Best in


Data Analytics
R Provides a Winning Edge for Competitive Data Scientists
By David Smith

Its often been said that competition brings out the best in us. We are all attracted to contests;
our passion for competing seems hardwired into our souls. Apparently, even predictive modelers
find the siren song of competition irresistible.
Thats what a small Australian firm named Kaggle has discovered when given the chance, data
scientists love to duke it out, just like everyone else. Kaggle describes itself as an innovative
solution for statistical/analytics outsourcing. Thats a very formal way of saying that Kaggle
manages competitions among the worlds best data scientists.
Heres how it works: Corporations, governments and research laboratories are confronted with
complex statistical challenges. They describe the problems to Kaggle and provide datasets. Kaggle
converts the problems and the data into contests that are posted on its web site. The contests
feature cash prizes ranging in value from $100 to $3 million. Kaggles clients range in size from tiny
startups to multinational corporations such as Ford Motor Company and government agencies
like NASA.
The idea is that someone comes to us with a problem, we put it up on our website, and then
people from all over the world can compete to see who can produce the best solution, says
Andrew Goldbloom, Kaggles founder and CEO.
In essence, Kaggle has developed a remarkably effective global platform for crowdsourcing thorny
analytic problems. Whats especially attractive about Kaggles approach is that it is truly a win-win
scenario contestants get access to real-world data (that has been carefully anonymized to
eliminate privacy concerns) and prize sponsors reap the benefits of the contestants creativity.
It is not surprising that many Kaggle contestants use programs or packages written in R, the opensource programming language designed specifically for data analysis. Created by two university
professors in New Zealand, R has emerged as the lingua franca of statistical analysts worldwide.
Because R enables analysts to visualize and model data very rapidly, it has become a favorite for
handling the kind of extremely large, complex data that have become increasingly common in
todays networked global economy.
R is also uniquely suited for competitions such as those managed by Kaggle. Thats because the
competitions tend to focus on prototyping and modeling, rather than on execution.

Copyright 2011 Revolution Analytics

Competition Brings Out the Best in Data Analytics

R is a really powerful prototyping tool. It has so many packages, that just about anything you
could think to try is readily available. So in that regard, it gives participants the opportunity to
experiment with techniques that would otherwise be cumbersome to implement, says
Goldbloom.
Jeremy Howard, a highly successful Kaggle contestant, was introduced to Goldbloom at an R user
group meeting in Melbourne. Goldbloom asked if he was the same Jeremy Howard whose name
appeared so frequently on Kaggles lists of leading contenders. We got to talking and it turned
out we were kindred spirits, says Howard, who now serves as Kaggles chief data scientist.
At a recent R user group meeting, Howard talked to a packed room about his winning strategies.
R is great for running six different models and seeing what works, says Howard. R is definitely
an important tool in a data miners arsenal.
Because R is an open-source project, there are literally thousands of free R packages available for
downloading. Many of them, notes Howard, include cutting-edge statistical analytics. You can
jump into R, try something and find out quickly if it works for you. Thats really nice.
Kaggle was recently featured in a Wall Street Journal article focusing on a $3 million prize offered
by Heritage Provider Network Inc., a California-based physicians group. The prize will be awarded
to the data analyst who can develop the best model for predicting the number of days a patient is
likely to spend in the hospital over the next year, according to the Journal. Kaggle is handling the
competition, which is the largest yet of its kind.
The scope and scale of the Heritage competition isnt likely to scare off any of Kaggles die-hard
contestants. Competitions are a platform for data scientists to test the robustness of their
algorithms and theories, says Ming-Hen Tsai, a Kaggle contestant and former undergrad at
National Taiwan University. The emergence of data mining competitions such as those run by
Kaggle helps data scientists share their knowledge more openly and effectively, says Ming-Hen.
And the availability of open-source packages and algorithms from the worldwide R community
makes it easier to see beneath the hood of highly complex analytic processes, he notes.
Revealing the methods for solving analytic problems is important for advancing the data mining
community, says Ming-Hen. Although many experiments result in papers, there arent enough
comprehensive studies of the statistical methods used. Many papers just report results on specific
sets of data.
The lack of transparency makes it more difficult to recreate experiments, which in turn slows the
advancement of data mining as a science. I think we should have more open-source software
implementing state-of-the-art algorithms, says Ming-Hen. Let the performance speak for itself,
and the world can judge which is best.
Kaggle recently ran a competition to create the best recommendation system for R packages.
When he prepared the data for the contest, Howard used a variety of statistical methods
including programs written in R.

Copyright 2011 Revolution Analytics

Competition Brings Out the Best in Data Analytics

Now, Kaggle contestants have an additional weapon to deploy in their quest for victory. Kaggle
has partnered with Revolution Analytics to provide Revolution R Enterprisefree of charge to
Kaggle competitors for use in the competitions. Because this enhanced distribution of R scales to
the Big Data problems now a part of many Kaggle contests, it means that competitors can use R
for big-data problems, and that competition sponsors can implement those models in production
settings thanks to the commercial-grade enhancements of Revolution R Enterprise.
Reflecting the field of statistics itself, Kaggle has generated unexpected results. It has become an
informal recruiting ground for companies looking for the best and brightest data analysts.
When youre talking to a prospective employer, you can say, Look at my Kaggle profile. Its by
far the best reputation tool in data science, says Howard. People who are successful in Kaggle
competitions can work anywhere they want to in this field.

Revolution R EnterpriseAvailable free to Kaggle Competitors


Through a new partnership with Revolution Analytics, participants in current Kaggle competitions
can now download and use a FREE, full-featured version of Revolution R Enterprise software to
create their submissions. Built upon the powerful open source R language, this advanced
analytics software brings higher performance, 'Big Data' scalability, and greater productivity to
Rat a fraction of the cost of traditional statistics products.
Kaggle competitors can download Revolution R Enterprise by registering at
http://info.revolutionanalytics.com/Kaggle.html

Copyright 2011 Revolution Analytics

Competition Brings Out the Best in Data Analytics

About David Smith


David is the Vice President of Marketing at Revolution Analytics, the leading commercial provider
of software and support for the open source R statistical computing language. David is the coauthor, with Bill Venables, of the official R manual An Introduction to R. He is also the editor of
Revolutions (http://blog.revolutionanalytics.com), the leading blog focused on R language, and
one of the originating developers of ESS: Emacs Speaks Statistics. You can follow David on Twitter
as @revodavid

About Revolution Analytics


Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led
by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high
performance, productivity, and enterprise readiness to open source R, the most powerful
statistics language in the world.
In the last 10 years, R has exploded in popularity and functionality and has emerged as the data
scientists tool of choice. Today R is used by over 2 million analysts worldwide in academia and at
cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for
the demands and requirements of all business environments, Revolution R Enterprise builds on
open source R with innovations in big data analysis, integration and user experience.
The companys flagship Revolution R product is available both as a workstation and server-based
offering.
Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs
of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R
Workstation offers productivity and development tools for individuals and small teams that need
to build applications and analyze data.
Revolution Analytics is committed to fostering the growth of the R community. The company
sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses
of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation
of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North
Bridge Venture Partners and Intel Capital.
Please visit us at www.revolutionanalytics.com

Copyright 2011 Revolution Analytics

You might also like