Big Data Now

O`Rcilly Mcdia
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Big Data Now
Ly O`Reilly Meuia
PuLlisheu Ly O`Reilly Meuia, Inc., 1005 Giavenstein Highway Noith, SeLastopol,
CA 95+72.
O`Reilly Looks may Le puichaseu loi euucational, Lusiness, oi sales piomotional
use. Online euitions aie also availaLle loi most titles (http://ny.sajariboo|son
|inc.con). Foi moie inloimation, contact oui coipoiate/institutional sales uepait-
ment: (S00) 99S-993S oi corporatc¿orci||y.con.
Printing History:
SeptemLei 2011: Fiist Euition.
Nutshell HanuLook, the Nutshell HanuLook logo, anu the O`Reilly logo aie iegis-
teieu tiauemaiks ol O`Reilly Meuia, Inc. Big Data Now anu ielateu tiaue uiess aie
tiauemaiks ol O`Reilly Meuia, Inc.
Many ol the uesignations useu Ly manulactuieis anu selleis to uistinguish theii
piouucts aie claimeu as tiauemaiks. Vheie those uesignations appeai in this Look,
anu O`Reilly Meuia, Inc., was awaie ol a tiauemaik claim, the uesignations have
Leen piinteu in caps oi initial caps.
Vhile eveiy piecaution has Leen taken in the piepaiation ol this Look, the puLlishei
anu authois assume no iesponsiLility loi eiiois oi omissions, oi loi uamages ie-
sulting liom the use ol the inloimation containeu heiein.
ISBN: 97S-1-++9-3151S-+
1316111277
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Data Science and Data Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Vhat is uata science? 1
Vhat is uata science? 2
Vheie uata comes liom +
Voiking with uata at scale S
Making uata tell its stoiy 12
Data scientists 12
The SMAQ stack loi Lig uata 16
MapReuuce 17
Stoiage 20
Queiy 25
Conclusion 2S
Sciaping, cleaning, anu selling Lig uata 29
Data hanu tools 33
Hauoop: Vhat it is, how it woiks, anu what it can uo +0
Foui liee uata tools loi jouinalists (anu snoops) +3
VHOIS +3
Blekko ++
Lit.ly +6
Compete +7
The guiet iise ol machine leaining +S
Vheie the semantic weL stumLleu, linkeu uata will succeeu 51
Social uata is an oiacle waiting loi a guestion 5+
The challenges ol stieaming ieal-time uata 56
2. Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Vhy the teim ¨uata science¨ is llaweu Lut uselul 61
It`s not a ieal science 61
iii
It`s an unnecessaiy laLel 62
The name uoesn`t even make sense 62
Theie`s no uelinition 63
Time loi the community to ially 63
Vhy you can`t ieally anonymize youi uata 63
Keep the anonymization 65
Acknowleuge theie`s a iisk ol ue-anonymization 65
Limit the uetail 65
Leain liom the expeits 66
Big uata anu the semantic weL 66
Google anu the semantic weL 66
Metauata is haiu: Lig uata can help 67
Big uata: GloLal goou oi zeio-sum aims iace? 6S
The tiuth aLout uata: Once it`s out theie, it`s haiu to contiol 71
3. The Application of Data: Products and Processes . . . . . . . . . . . . . . . . . . . . 75
How the LiLiaiy ol Congiess is Luiluing the Twittei aichive 75
Data jouinalism, uata tools, anu the newsioom stack 7S
Data jouinalism anu uata tools 79
The newsioom stack S1
Biiuging the uata uiviue S2
The uata analysis path is Luilt on cuiiosity, lolloweu Ly action S3
How uata anu analytics can impiove euucation S6
Data science is a pipeline Letween acauemic uisciplines 92
Big uata anu open souice unlock genetic seciets 96
Visualization ueconstiucteu: Mapping FaceLook`s liienuships 100
Mapping FaceLook`s liienuships 100
Static ieguiies stoiytelling 103
Data science uemociatizeu 103
4. The Business of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Theie`s no such thing as Lig uata 107
Big uata anu the innovatoi`s uilemma 109
Builuing uata staitups: Fast, Lig, anu locuseu 110
Setting the stage: The attack ol the exponentials 110
Leveiaging the Lig uata stack 111
Fast uata 112
Big analytics 113
Focuseu seivices 11+
Demociatizing Lig uata 115
Data maikets aien`t coming: They`ie alieauy heie 115
An iTunes mouel loi uata 119
iv | Table of Contents
Data is a cuiiency 122
Big uata: An oppoitunity in seaich ol a metaphoi 123
Data anu the human-machine connection 125
Table of Contents | v
Foreword
This collection iepiesents the lull spectium ol uata-ielateu content we`ve puL-
lisheu on O`Reilly Rauai ovei the last yeai. Mike Loukiues kickeu things oll
in ]une 2010 with ¨Vhat is uata science?¨ anu liom theie we`ve puisueu the
vaiious thieaus anu themes that natuially emeigeu. Now, ioughly a yeai latei,
we can look Lack ovei all we`ve coveieu anu iuentily a numLei ol coie uata
aieas:
Chapter 1÷The tools anu technologies that uiive uata science aie ol couise
essential to this space, Lut the vaiieu technigues Leing applieu aie also key to
unueistanuing the Lig uata aiena.
Chapter 2÷The oppoitunities anu amLiguities ol the uata space aie eviuent
in uiscussions aiounu piivacy, the implications ol uata-centiic inuustiies, anu
the ueLate aLout the phiase ¨uata science¨ itsell.
Chapter 3÷A ¨uata piouuct¨ can emeige liom viitually any uomain, incluu-
ing eveiything liom uata staitups to estaLlisheu enteipiises to meuia/jouinal-
ism to euucation anu ieseaich.
Chapter 4÷Take a closei look at the actions connecteu to uata÷the linuing,
oiganizing, anu analyzing that pioviue oiganizations ol all sizes with the in-
loimation they neeu to compete.
To Le cleai: This is the stoiy up to this point. In the weeks anu months aheau
we`ll ceitainly see impoitant shilts in the uata lanuscape. Ve`ll continue to
chionicle this space thiough ongoing Rauai coveiage anu oui seiies ol online
anu in-peison Stiata events. Ve hope you`ll join us.
÷Mac Slocum
Managing Euitoi, O`Reilly Rauai
vii
CHAPTER 1
Data Science and Data Tools
What is data science?
AnaIysis: The future beIongs to the companies and peopIe that turn data
into products.
Ly Mike Loukiues
Report sections
¨Vhat is uata science?¨ on page 2
¨Vheie uata comes liom¨ on page +
¨Voiking with uata at scale¨ on page S
¨Making uata tell its stoiy¨ on page 12
¨Data scientists¨ on page 12
Ve`ve all heaiu it: accoiuing to Hal Vaiian, statistics is the next sexy joL. Five
yeais ago, in Vhat is VeL 2.0, Tim O`Reilly saiu that ¨uata is the next Intel
Insiue.¨ But what uoes that statement mean? Vhy uo we suuuenly caie aLout
statistics anu aLout uata?
In this post, I examine the many siues ol uata science÷the technologies, the
companies anu the unigue skill sets.
1
What is data science?
The weL is lull ol ¨uata-uiiven apps.¨ Almost any e-commeice application is
a uata-uiiven application. Theie`s a uataLase Lehinu a weL liont enu, anu
miuulewaie that talks to a numLei ol othei uataLases anu uata seivices (cieuit
caiu piocessing companies, Lanks, anu so on). But meiely using uata isn`t
ieally what we mean Ly ¨uata science.¨ A uata application acguiies its value
liom the uata itsell, anu cieates moie uata as a iesult. It`s not just an application
with uata; it`s a uata piouuct. Data science enaLles the cieation ol uata piou-
ucts.
One ol the eailiei uata piouucts on the VeL was the CDDB uataLase. The
uevelopeis ol CDDB iealizeu that any CD hau a unigue signatuie, Laseu on
the exact length (in samples) ol each tiack on the CD. Giacenote Luilt a ua-
taLase ol tiack lengths, anu coupleu it to a uataLase ol alLum metauata (tiack
titles, aitists, alLum titles). Il you`ve evei useu iTunes to iip a CD, you`ve taken
auvantage ol this uataLase. Beloie it uoes anything else, iTunes ieaus the length
ol eveiy tiack, senus it to CDDB, anu gets Lack the tiack titles. Il you have a
CD that`s not in the uataLase (incluuing a CD you`ve maue youisell), you can
cieate an entiy loi an unknown alLum. Vhile this sounus simple enough, it`s
ievolutionaiy: CDDB views music as uata, not as auuio, anu cieates new value
in uoing so. Theii Lusiness is lunuamentally uilleient liom selling music, shai-
ing music, oi analyzing musical tastes (though these can also Le ¨uata piou-
ucts¨). CDDB aiises entiiely liom viewing a musical pioLlem as a uata pioL-
lem.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
2 | Chapter 1:Data Science and Data Tools
Google is a mastei at cieating uata piouucts. Heie`s a lew examples:
º Google`s Lieakthiough was iealizing that a seaich engine coulu use input
othei than the text on the page. Google`s PageRank algoiithm was among
the liist to use uata outsiue ol the page itsell, in paiticulai, the numLei ol
links pointing to a page. Tiacking links maue Google seaiches much moie
uselul, anu PageRank has Leen a key ingieuient to the company`s success.
º Spell checking isn`t a teiiiLly uillicult pioLlem, Lut Ly suggesting coiiec-
tions to misspelleu seaiches, anu oLseiving what the usei clicks in ie-
sponse, Google maue it much moie accuiate. They`ve Luilt a uictionaiy
ol common misspellings, theii coiiections, anu the contexts in which they
occui.
º Speech iecognition has always Leen a haiu pioLlem, anu it iemains uilli-
cult. But Google has maue huge stiiues Ly using the voice uata they`ve
collecteu, anu has Leen aLle to integiate voice seaich into theii coie seaich
engine.
º Duiing the Swine Flu epiuemic ol 2009, Google was aLle to tiack the
piogiess ol the epiuemic Ly lollowing seaiches loi llu-ielateu topics.
Flu trends
Goog|c was ab|c to spot trcnds in thc Swinc I|u cpidcnic rough|y two wcc|s
bcjorc thc Ccntcr jor Discasc Contro| by ana|yzing scarchcs that pcop|c wcrc
na|ing in dijjcrcnt rcgions oj thc country.
Google isn`t the only company that knows how to use uata. FaceLook anu
LinkeuIn use patteins ol liienuship ielationships to suggest othei people you
may know, oi shoulu know, with sometimes liightening accuiacy. Amazon
saves youi seaiches, coiielates what you seaich loi with what othei useis
seaich loi, anu uses it to cieate suipiisingly appiopiiate iecommenuations.
These iecommenuations aie ¨uata piouucts¨ that help to uiive Amazon`s moie
What is data science? | 3
tiauitional ietail Lusiness. They come aLout Lecause Amazon unueistanus that
a Look isn`t just a Look, a cameia isn`t just a cameia, anu a customei isn`t just
a customei; customeis geneiate a tiail ol ¨uata exhaust¨ that can Le mineu
anu put to use, anu a cameia is a clouu ol uata that can Le coiielateu with the
customeis` Lehavioi, the uata they leave eveiy time they visit the site.
The thieau that ties most ol these applications togethei is that uata collecteu
liom useis pioviues auueu value. Vhethei that uata is seaich teims, voice
samples, oi piouuct ieviews, the useis aie in a leeuLack loop in which they
contiiLute to the piouucts they use. That`s the Leginning ol uata science.
In the last lew yeais, theie has Leen an explosion in the amount ol uata that`s
availaLle. Vhethei we`ie talking aLout weL seivei logs, tweet stieams, online
tiansaction iecoius, ¨citizen science,¨ uata liom sensois, goveinment uata, oi
some othei souice, the pioLlem isn`t linuing uata, it`s liguiing out what to uo
with it. Anu it`s not just companies using theii own uata, oi the uata contiiL-
uteu Ly theii useis. It`s incieasingly common to mashup uata liom a numLei
ol souices. ¨Data Mashups in R¨ analyzes moitgage loieclosuies in Philauel-
phia County Ly taking a puLlic iepoit liom the county sheiill`s ollice, extiact-
ing auuiesses anu using Yahoo to conveit the auuiesses to latituue anu longi-
tuue, then using the geogiaphical uata to place the loieclosuies on a map
(anothei uata souice), anu gioup them Ly neighLoihoou, valuation, neigh-
Loihoou pei-capita income, anu othei socio-economic lactois.
The guestion lacing eveiy company touay, eveiy staitup, eveiy non-piolit, ev-
eiy pioject site that wants to attiact a community, is how to use uata ellectively
÷not just theii own uata, Lut all the uata that`s availaLle anu ielevant. Using
uata ellectively ieguiies something uilleient liom tiauitional statistics, wheie
actuaiies in Lusiness suits peiloim aicane Lut laiily well-uelineu kinus ol anal-
ysis. Vhat uilleientiates uata science liom statistics is that uata science is a
holistic appioach. Ve`ie incieasingly linuing uata in the wilu, anu uata sci-
entists aie involveu with gatheiing uata, massaging it into a tiactaLle loim,
making it tell its stoiy, anu piesenting that stoiy to otheis.
To get a sense loi what skills aie ieguiieu, let`s look at the uata lilecycle: wheie
it comes liom, how you use it, anu wheie it goes.
Where data comes from
Data is eveiywheie: youi goveinment, youi weL seivei, youi Lusiness paitneis,
even youi Louy. Vhile we aien`t uiowning in a sea ol uata, we`ie linuing that
almost eveiything can (oi has) Leen instiumenteu. At O`Reilly, we lieguently
comLine puLlishing inuustiy uata liom Nielsen BookScan with oui own sales
uata, puLlicly availaLle Amazon uata, anu even joL uata to see what`s hap-
pening in the puLlishing inuustiy. Sites like Inlochimps anu Factual pioviue
4 | Chapter 1:Data Science and Data Tools
access to many laige uatasets, incluuing climate uata, MySpace activity
stieams, anu game logs liom spoiting events. Factual enlists useis to upuate
anu impiove its uatasets, which covei topics as uiveise as enuociinologists to
hiking tiails.
Much ol the uata we cuiiently woik with is the uiiect conseguence ol VeL
2.0, anu ol Mooie`s Law applieu to uata. The weL has people spenuing moie
time online, anu leaving a tiail ol uata wheievei they go. MoLile applications
leave an even iichei uata tiail, since many ol them aie annotateu with geolo-
cation, oi involve viueo oi auuio, all ol which can Le mineu. Point-ol-sale
uevices anu lieguent-shoppei`s caius make it possiLle to captuie all ol youi
ietail tiansactions, not just the ones you make online. All ol this uata woulu
Le useless il we coulun`t stoie it, anu that`s wheie Mooie`s Law comes in. Since
the eaily `S0s, piocessoi speeu has incieaseu liom 10 MHz to 3.6 GHz÷an
inciease ol 360 (not counting incieases in woiu length anu numLei ol coies).
But we`ve seen much Liggei incieases in stoiage capacity, on eveiy level. RAM
has moveu liom $1,000/MB to ioughly $25/GB÷a piice ieuuction ol aLout
+0000, to say nothing ol the ieuuction in size anu inciease in speeu. Hitachi
maue the liist gigaLyte uisk uiives in 19S2, weighing in at ioughly 250 pounus;
now teiaLyte uiives aie consumei eguipment, anu a 32 GB micioSD caiu
weighs aLout hall a giam. Vhethei you look at Lits pei giam, Lits pei uollai,
oi iaw capacity, stoiage has moie than kept pace with the inciease ol CPU
speeu.
What is data science? | 5
1956 disk drive
Onc oj thc jirst conncrcia| dis| drivcs jron |BM. |t has a 5 MB capacity and
it`s storcd in a cabinct rough|y thc sizc oj a |uxury rcjrigcrator. |n contrast, a 32
GB nicroSD card ncasurcs around 5/8 x 3/8 inch and wcighs about 0.5 gran.
Photo: Mike Loukiues. Disk uiive on uisplay at IBM Almauen Reseaich
The impoitance ol Mooie`s law as applieu to uata isn`t just geek pyiotechnics.
Data expanus to lill the space you have to stoie it. The moie stoiage is availaLle,
the moie uata you will linu to put into it. The uata exhaust you leave Lehinu
whenevei you suil the weL, liienu someone on FaceLook, oi make a puichase
in youi local supeimaiket, is all caielully collecteu anu analyzeu. Incieaseu
stoiage capacity uemanus incieaseu sophistication in the analysis anu use ol
that uata. That`s the lounuation ol uata science.
So, how uo we make that uata uselul? The liist step ol any uata analysis pioject
is ¨uata conuitioning,¨ oi getting uata into a state wheie it`s usaLle. Ve aie
seeing moie uata in loimats that aie easiei to consume: Atom uata leeus, weL
seivices, micioloimats, anu othei newei technologies pioviue uata in loimats
that`s uiiectly machine-consumaLle. But olu-style scieen sciaping hasn`t uieu,
anu isn`t going to uie. Many souices ol ¨wilu uata¨ aie extiemely messy. They
6 | Chapter 1:Data Science and Data Tools
aien`t well-Lehaveu XML liles with all the metauata nicely in place. The loie-
closuie uata useu in ¨Data Mashups in R¨ was posteu on a puLlic weLsite Ly
the Philauelphia county sheiill`s ollice. This uata was piesenteu as an HTML
lile that was pioLaLly geneiateu automatically liom a spieausheet. Il you`ve
evei seen the HTML that`s geneiateu Ly Excel, you know that`s going to Le
lun to piocess.
Data conuitioning can involve cleaning up messy HTML with tools like Beau-
tilul Soup, natuial language piocessing to paise plain text in English anu othei
languages, oi even getting humans to uo the uiity woik. You`ie likely to Le
uealing with an aiiay ol uata souices, all in uilleient loims. It woulu Le nice il
theie was a stanuaiu set ol tools to uo the joL, Lut theie isn`t. To uo uata
conuitioning, you have to Le ieauy loi whatevei comes, anu Le willing to use
anything liom ancient Unix utilities such as awk to XML paiseis anu machine
leaining liLiaiies. Sciipting languages, such as Peil anu Python, aie essential.
Once you`ve paiseu the uata, you can stait thinking aLout the guality ol youi
uata. Data is lieguently missing oi incongiuous. Il uata is missing, uo you
simply ignoie the missing points? That isn`t always possiLle. Il uata is incon-
giuous, uo you ueciue that something is wiong with Lauly Lehaveu uata (altei
all, eguipment lails), oi that the incongiuous uata is telling its own stoiy, which
may Le moie inteiesting? It`s iepoiteu that the uiscoveiy ol ozone layei ue-
pletion was uelayeu Lecause automateu uata collection tools uiscaiueu ieau-
ings that weie too low
'
. In uata science, what you have is lieguently all you`ie
going to get. It`s usually impossiLle to get ¨Lettei¨ uata, anu you have no
alteinative Lut to woik with the uata at hanu.
Il the pioLlem involves human language, unueistanuing the uata auus anothei
uimension to the pioLlem. Rogei Magoulas, who iuns the uata analysis gioup
at O`Reilly, was iecently seaiching a uataLase loi Apple joL listings ieguiiing
geolocation skills. Vhile that sounus like a simple task, the tiick was uisam-
Liguating ¨Apple¨ liom many joL postings in the giowing Apple inuustiy. To
uo it well you neeu to unueistanu the giammatical stiuctuie ol a joL posting;
you neeu to Le aLle to paise the English. Anu that pioLlem is showing up moie
anu moie lieguently. Tiy using Google Tienus to liguie out what`s happening
with the Cassanuia uataLase oi the Python language, anu you`ll get a sense ol
the pioLlem. Google has inuexeu many, many weLsites aLout laige snakes.
DisamLiguation is nevei an easy task, Lut tools like the Natuial Language
Toolkit liLiaiy can make it simplei.
' The NASA aiticle uenies this, Lut also says that in 19S+, they ueciueu that the low values (whch
went Lack to the 70s) weie ¨ieal.¨ Vhethei humans oi soltwaie ueciueu to ignoie anomalous
uata, it appeais that uata was ignoieu.
What is data science? | 7
Vhen natuial language piocessing lails, you can ieplace aitilicial intelligence
with human intelligence. That`s wheie seivices like Amazon`s Mechanical
Tuik come in. Il you can split youi task up into a laige numLei ol suLtasks
that aie easily uesciiLeu, you can use Mechanical Tuik`s maiketplace loi cheap
laLoi. Foi example, il you`ie looking at joL listings, anu want to know which
oiiginateu with Apple, you can have ieal people uo the classilication loi
ioughly $0.01 each. Il you have alieauy ieuuceu the set to 10,000 postings with
the woiu ¨Apple,¨ paying humans $0.01 to classily them only costs $100.
Working with data at scale
Ve`ve all heaiu a lot aLout ¨Lig uata,¨ Lut ¨Lig¨ is ieally a ieu heiiing. Oil
companies, telecommunications companies, anu othei uata-centiic inuustiies
have hau huge uatasets loi a long time. Anu as stoiage capacity continues to
expanu, touay`s ¨Lig¨ is ceitainly tomoiiow`s ¨meuium¨ anu next week`s
¨small.¨ The most meaninglul uelinition I`ve heaiu: ¨big data¨ is whcn thc sizc
oj thc data itsc|j bcconcs part oj thc prob|cn. Ve`ie uiscussing uata pioLlems
ianging liom gigaLytes to petaLytes ol uata. At some point, tiauitional tech-
nigues loi woiking with uata iun out ol steam.
Vhat aie we tiying to uo with uata that`s uilleient? Accoiuing to ]ell Ham-
meiLachei
¦
(¿hackinguata), we`ie tiying to Luilu inloimation platloims oi
uataspaces. Inloimation platloims aie similai to tiauitional uata waiehouses,
Lut uilleient. They expose iich APIs, anu aie uesigneu loi exploiing anu un-
ueistanuing the uata iathei than loi tiauitional analysis anu iepoiting. They
accept all uata loimats, incluuing the most messy, anu theii schemas evolve
as the unueistanuing ol the uata changes.
Most ol the oiganizations that have Luilt uata platloims have lounu it neces-
saiy to go Leyonu the ielational uataLase mouel. Tiauitional ielational uata-
Lase systems stop Leing ellective at this scale. Managing shaiuing anu iepli-
cation acioss a hoiue ol uataLase seiveis is uillicult anu slow. The neeu to
ueline a schema in auvance conllicts with ieality ol multiple, unstiuctuieu uata
souices, in which you may not know what`s impoitant until altei you`ve an-
alyzeu the uata. Relational uataLases aie uesigneu loi consistency, to suppoit
complex tiansactions that can easily Le iolleu Lack il any one ol a complex set
ol opeiations lails. Vhile iock-soliu consistency is ciucial to many applica-
tions, it`s not ieally necessaiy loi the kinu ol analysis we`ie uiscussing heie.
Do you ieally caie il you have 1,010 oi 1,012 Twittei lolloweis? Piecision has
an alluie, Lut in most uata-uiiven applications outsiue ol linance, that alluie
is ueceptive. Most uata analysis is compaiative: il you`ie asking whethei sales
¦ ¨Inloimation Platloims as Dataspaces,¨ Ly ]ell HammeiLachei (in Bcautiju| Data)
8 | Chapter 1:Data Science and Data Tools
to Noithein Euiope aie incieasing lastei than sales to Southein Euiope, you
aien`t conceineu aLout the uilleience Letween 5.92 peicent annual giowth
anu 5.93 peicent.
To stoie huge uatasets ellectively, we`ve seen a new Lieeu ol uataLases appeai.
These aie lieguently calleu NoSQL uataLases, oi Non-Relational uataLases,
though neithei teim is veiy uselul. They gioup togethei lunuamentally uis-
similai piouucts Ly telling you what they aien`t. Many ol these uataLases aie
the logical uescenuants ol Google`s BigTaLle anu Amazon`s Dynamo, anu aie
uesigneu to Le uistiiLuteu acioss many noues, to pioviue ¨eventual consis-
tency¨ Lut not aLsolute consistency, anu to have veiy llexiLle schema. Vhile
theie aie two uozen oi so piouucts availaLle (almost all ol them open souice),
a lew leaueis have estaLlisheu themselves:
º Cassanuia: Developeu at FaceLook, in piouuction use at Twittei, Rack-
space, Reuuit, anu othei laige sites. Cassanuia is uesigneu loi high pei-
loimance, ieliaLility, anu automatic ieplication. It has a veiy llexiLle uata
mouel. A new staitup, Riptano, pioviues commeicial suppoit.
º HBase: Pait ol the Apache Hauoop pioject, anu mouelleu on Google`s
BigTaLle. SuitaLle loi extiemely laige uataLases (Lillions ol iows, millions
ol columns), uistiiLuteu acioss thousanus ol noues. Along with Hauoop,
commeicial suppoit is pioviueu Ly Clouueia.
Stoiing uata is only pait ol Luiluing a uata platloim, though. Data is only uselul
il you can uo something with it, anu enoimous uatasets piesent computational
pioLlems. Google populaiizeu the MapReuuce appioach, which is Lasically a
uiviue-anu-conguei stiategy loi uistiiLuting an extiemely laige pioLlem acioss
an extiemely laige computing clustei. In the ¨map¨ stage, a piogiamming task
is uiviueu into a numLei ol iuentical suLtasks, which aie then uistiiLuteu
acioss many piocessois; the inteimeuiate iesults aie then comLineu Ly a single
ieuuce task. In hinusight, MapReuuce seems like an oLvious solution to Goo-
gle`s Liggest pioLlem, cieating laige seaiches. It`s easy to uistiiLute a seaich
acioss thousanus ol piocessois, anu then comLine the iesults into a single set
ol answeis. Vhat`s less oLvious is that MapReuuce has pioven to Le wiuely
applicaLle to many laige uata pioLlems, ianging liom seaich to machine
leaining.
The most populai open souice implementation ol MapReuuce is the Hauoop
pioject. Yahoo`s claim that they hau Luilt the woilu`s laigest piouuction Ha-
uoop application, with 10,000 coies iunning Linux, Liought it onto centei
stage. Many ol the key Hauoop uevelopeis have lounu a home at Clouueia,
which pioviues commeicial suppoit. Amazon`s Elastic MapReuuce makes it
much easiei to put Hauoop to woik without investing in iacks ol Linux ma-
chines, Ly pioviuing pieconliguieu Hauoop images loi its EC2 clusteis. You
What is data science? | 9
can allocate anu ue-allocate piocessois as neeueu, paying only loi the time you
use them.
Hauoop goes lai Leyonu a simple MapReuuce implementation (ol which theie
aie seveial); it`s the key component ol a uata platloim. It incoipoiates
HDFS, a uistiiLuteu lilesystem uesigneu loi the peiloimance anu ieliaLility
ieguiiements ol huge uatasets; the HBase uataLase; Hive, which lets uevelop-
eis exploie Hauoop uatasets using SQL-like gueiies; a high-level uatallow lan-
guage calleu Pig; anu othei components. Il anything can Le calleu a one-stop
inloimation platloim, Hauoop is it.
Hauoop has Leen instiumental in enaLling ¨agile¨ uata analysis. In soltwaie
uevelopment, ¨agile piactices¨ aie associateu with lastei piouuct cycles, closei
inteiaction Letween uevelopeis anu consumeis, anu testing. Tiauitional uata
analysis has Leen hampeieu Ly extiemely long tuin-aiounu times. Il you stait
a calculation, it might not linish loi houis, oi even uays. But Hauoop (anu
paiticulaily Elastic MapReuuce) make it easy to Luilu clusteis that can peiloim
computations on long uatasets guickly. Fastei computations make it easiei to
test uilleient assumptions, uilleient uatasets, anu uilleient algoiithms. It`s
easei to consult with clients to liguie out whethei you`ie asking the iight
guestions, anu it`s possiLle to puisue intiiguing possiLilities that you`u oth-
eiwise have to uiop loi lack ol time.
Hauoop is essentially a Latch system, Lut Hauoop Online Piototype (HOP) is
an expeiimental pioject that enaLles stieam piocessing. Hauoop piocesses
uata as it aiiives, anu ueliveis inteimeuiate iesults in (neai) ieal-time. Neai
ieal-time uata analysis enaLles leatuies like tienuing topics on sites like Twit-
tei. These leatuies only ieguiie solt ieal-time; iepoits on tienuing topics uon`t
ieguiie milliseconu accuiacy. As with the numLei ol lolloweis on Twittei, a
¨tienuing topics¨ iepoit only neeus to Le cuiient to within live minutes÷oi
even an houi. Accoiuing to Hilaiy Mason (¿hmason), uata scientist at
Lit.ly, it`s possiLle to piecompute much ol the calculation, then use one ol the
expeiiments in ieal-time MapReuuce to get piesentaLle iesults.
Machine leaining is anothei essential tool loi the uata scientist. Ve now expect
weL anu moLile applications to incoipoiate iecommenuation engines, anu
Luiluing a iecommenuation engine is a guintessential aitilicial intelligence
pioLlem. You uon`t have to look at many mouein weL applications to see
classilication, eiioi uetection, image matching (Lehinu Google Goggles anu
SnapTell) anu even lace uetection÷an ill-auviseu moLile application lets you
take someone`s pictuie with a cell phone, anu look up that peison`s iuentity
using photos availaLle online. Anuiew Ng`s Machine Leaining couise is one
ol the most populai couises in computei science at Stanloiu, with hunuieus
ol stuuents (this viueo is highly iecommenueu).
10 | Chapter 1:Data Science and Data Tools
Theie aie many liLiaiies availaLle loi machine leaining: PyBiain in Python,
Elelant, Veka in ]ava, anu Mahout (coupleu to Hauoop). Google has just
announceu theii Pieuiction API, which exposes theii machine leaining algo-
iithms loi puLlic use via a RESTlul inteilace. Foi computei vision, the
OpenCV liLiaiy is a ue-lacto stanuaiu.
Mechanical Tuik is also an impoitant pait ol the toolLox. Machine leaining
almost always ieguiies a ¨tiaining set,¨ oi a signilicant Louy ol known uata
with which to uevelop anu tune the application. The Tuik is an excellent way
to uevelop tiaining sets. Once you`ve collecteu youi tiaining uata (peihaps a
laige collection ol puLlic photos liom Twittei), you can have humans classily
them inexpensively÷possiLly soiting them into categoiies, possiLly uiawing
ciicles aiounu laces, cais, oi whatevei inteiests you. It`s an excellent way to
classily a lew thousanu uata points at a cost ol a lew cents each. Even a iela-
tively laige joL only costs a lew hunuieu uollais.
Vhile I haven`t stiesseu tiauitional statistics, Luiluing statistical mouels plays
an impoitant iole in any uata analysis. Accoiuing to Mike Diiscoll (¿uata-
spoia), statistics is the ¨giammai ol uata science.¨ It is ciucial to ¨making uata
speak coheiently.¨ Ve`ve all heaiu the joke that eating pickles causes ueath,
Lecause eveiyone who uies has eaten pickles. That joke uoesn`t woik il you
unueistanu what coiielation means. Moie to the point, it`s easy to notice that
one auveitisement loi R in a Nutshc|| geneiateu 2 peicent moie conveisions
than anothei. But it takes statistics to know whethei this uilleience is signili-
cant, oi just a ianuom lluctuation. Data science isn`t just aLout the existence
ol uata, oi making guesses aLout what that uata might mean; it`s aLout testing
hypotheses anu making suie that the conclusions you`ie uiawing liom the uata
aie valiu. Statistics plays a iole in eveiything liom tiauitional Lusiness intelli-
gence (BI) to unueistanuing how Google`s au auctions woik. Statistics has
Lecome a Lasic skill. It isn`t supeiseueu Ly newei technigues liom machine
leaining anu othei uisciplines; it complements them.
Vhile theie aie many commeicial statistical packages, the open souice R lan-
guage÷anu its compiehensive package liLiaiy, CRAN÷is an essential tool.
Although R is an ouu anu guiiky language, paiticulaily to someone with a
Lackgiounu in computei science, it comes close to pioviuing ¨one stop shop-
ping¨ loi most statistical woik. It has excellent giaphics lacilities; CRAN in-
cluues paiseis loi many kinus ol uata; anu newei extensions extenu R into
uistiiLuteu computing. Il theie`s a single tool that pioviues an enu-to-enu sol-
ution loi statistics woik, R is it.
What is data science? | 11
Making data tell its story
A pictuie may oi may not Le woith a thousanu woius, Lut a pictuie is ceitainly
woith a thousanu numLeis. The pioLlem with most uata analysis algoiithms
is that they geneiate a set ol numLeis. To unueistanu what the numLeis mean,
the stoiies they aie ieally telling, you neeu to geneiate a giaph. Euwaiu Tulte`s
Visual Display ol Quantitative Inloimation is the classic loi uata visualization,
anu a lounuational text loi anyone piacticing uata science. But that`s not ieally
what conceins us heie. Visualization is ciucial to each stage ol the uata scien-
tist. Accoiuing to Maitin VattenLeig (¿wattenLeig, lounuei ol Flowing Me-
uia), visualization is key to uata conuitioning: il you want to linu out just how
Lau youi uata is, tiy plotting it. Visualization is also lieguently the liist step in
analysis. Hilaiy Mason says that when she gets a new uata set, she staits Ly
making a uozen oi moie scattei plots, tiying to get a sense ol what might Le
inteiesting. Once you`ve gotten some hints at what the uata might Le saying,
you can lollow it up with moie uetaileu analysis.
Theie aie many packages loi plotting anu piesenting uata. GnuPlot is veiy
ellective; R incoipoiates a laiily compiehensive giaphics package; Casey Reas`
anu Ben Fiy`s Piocessing is the state ol the ait, paiticulaily il you neeu to cieate
animations that show how things change ovei time. At IBM`s Many Eyes, many
ol the visualizations aie lull-lleugeu inteiactive applications.
Nathan Yau`s FlowingData Llog is a gieat place to look loi cieative visualiza-
tions. One ol my lavoiites is this animation ol the giowth ol Valmait ovei
time. Anu this is one place wheie ¨ait¨ comes in: not just the aesthetics ol the
visualization itsell, Lut how you unueistanu it. Does it look like the spieau ol
cancei thioughout a Louy? Oi the spieau ol a llu viius thiough a population?
Making uata tell its stoiy isn`t just a mattei ol piesenting iesults; it involves
making connections, then going Lack to othei uata souices to veiily them.
Does a successlul ietail chain spieau like an epiuemic, anu il so, uoes that give
us new insights into how economies woik? That`s not a guestion we coulu
even have askeu a lew yeais ago. Theie was insullicient computing powei, the
uata was all lockeu up in piopiietaiy souices, anu the tools loi woiking with
the uata weie insullicient. It`s the kinu ol guestion we now ask ioutinely.
Data scientists
Data science ieguiies skills ianging liom tiauitional computei science to
mathematics to ait. DesciiLing the uata science gioup he put togethei at Face-
Look (possiLly the liist uata science gioup at a consumei-oiienteu weL piop-
eity), ]ell HammeiLachei saiu:
12 | Chapter 1:Data Science and Data Tools
... on any given uay, a team memLei coulu authoi a multistage piocessing
pipeline in Python, uesign a hypothesis test, peiloim a iegiession analysis ovei
uata samples with R, uesign anu implement an algoiithm loi some uata-inten-
sive piouuct oi seivice in Hauoop, oi communicate the iesults ol oui analyses
to othei memLeis ol the oiganization
¦
Vheie uo you linu the people this veisatile? Accoiuing to D] Patil, chiel sci-
entist at LinkeuIn (¿upatil), the Lest uata scientists tenu to Le ¨haiu scien-
tists,¨ paiticulaily physicists, iathei than computei science majois. Physicists
have a stiong mathematical Lackgiounu, computing skills, anu come liom a
uiscipline in which suivival uepenus on getting the most liom the uata. They
have to think aLout the Lig pictuie, the Lig pioLlem. Vhen you`ve just spent
a lot ol giant money geneiating uata, you can`t just thiow the uata out il it isn`t
as clean as you`u like. You have to make it tell its stoiy. You neeu some ciea-
tivity loi when the stoiy the uata is telling isn`t what you think it`s telling.
Scientists also know how to Lieak laige pioLlems up into smallei pioLlems.
Patil uesciiLeu the piocess ol cieating the gioup iecommenuation leatuie at
LinkeuIn. It woulu have Leen easy to tuin this into a high-ceiemony uevelop-
ment pioject that woulu take thousanus ol houis ol uevelopei time, plus thou-
sanus ol houis ol computing time to uo massive coiielations acioss LinkeuIn`s
memLeiship. But the piocess woikeu guite uilleiently: it staiteu out with a
ielatively small, simple piogiam that lookeu at memLeis` pioliles anu maue
iecommenuations accoiuingly. Asking things like, uiu you go to Coinell? Then
you might like to join the Coinell Alumni gioup. It then Liancheu out incie-
mentally. In auuition to looking at pioliles, LinkeuIn`s uata scientists staiteu
looking at events that memLeis attenueu. Then at Looks memLeis hau in theii
liLiaiies. The iesult was a valuaLle uata piouuct that analyzeu a huge uataLase
÷Lut it was nevei conceiveu as such. It staiteu small, anu auueu value iteia-
tively. It was an agile, llexiLle piocess that Luilt towaiu its goal inciementally,
iathei than tackling a huge mountain ol uata all at once.
This is the heait ol what Patil calls ¨uata jiujitsu¨÷using smallei auxiliaiy
pioLlems to solve a laige, uillicult pioLlem that appeais intiactaLle. CDDB is
a gieat example ol uata jiujitsu: iuentilying music Ly analyzing an auuio stieam
uiiectly is a veiy uillicult pioLlem (though not unsolvaLle÷see miuomi, loi
example). But the CDDB stall useu uata cieatively to solve a much moie tiact-
aLle pioLlem that gave them the same iesult. Computing a signatuie Laseu on
tiack lengths, anu then looking up that signatuie in a uataLase, is tiivially
simple.
¦ ¨Inloimation Platloims as Dataspaces,¨ Ly ]ell HammeiLachei (in Bcautiju| Data)
What is data science? | 13
Hiring trends for data science
|t`s not casy to gct a hand|c on jobs in data scicncc. Howcvcr, data jron O`Rci||y
Rcscarch shows a stcady ycar-ovcr-ycar incrcasc in Hadoop and Cassandra job
|istings, which arc good proxics jor thc ¨data scicncc¨ nar|ct as a who|c. This
graph shows thc incrcasc in Cassandra jobs, and thc conpanics |isting Cassandra
positions, ovcr tinc.
Entiepieneuiship is anothei piece ol the puzzle. Patil`s liist llippant answei to
¨what kinu ol peison aie you looking loi when you hiie a uata scientist?¨ was
¨someone you woulu stait a company with.¨ That`s an impoitant insight:
we`ie enteiing the eia ol piouucts that aie Luilt on uata. Ve uon`t yet know
what those piouucts aie, Lut we uo know that the winneis will Le the people,
anu the companies, that linu those piouucts. Hilaiy Mason came to the same
conclusion. Hei joL as scientist at Lit.ly is ieally to investigate the uata that
Lit.ly is geneiating, anu linu out how to Luilu inteiesting piouucts liom it. No
one in the nascent uata inuustiy is tiying to Luilu the 2012 Nissan Stanza oi
Ollice 2015; they`ie all tiying to linu new piouucts. In auuition to Leing phys-
icists, mathematicians, piogiammeis, anu aitists, they`ie entiepieneuis.
Data scientists comLine entiepieneuiship with patience, the willingness to
Luilu uata piouucts inciementally, the aLility to exploie, anu the aLility to
iteiate ovei a solution. They aie inheiently inteiuiscplinaiy. They can tackle
all aspects ol a pioLlem, liom initial uata collection anu uata conuitioning to
uiawing conclusions. They can think outsiue the Lox to come up with new
ways to view the pioLlem, oi to woik with veiy Lioauly uelineu pioLlems:
¨heie`s a lot ol uata, what can you make liom it?¨
The lutuie Lelongs to the companies who liguie out how to collect anu use
uata successlully. Google, Amazon, FaceLook, anu LinkeuIn have all tappeu
14 | Chapter 1:Data Science and Data Tools
into theii uatastieams anu maue that the coie ol theii success. They weie the
vanguaiu, Lut newei companies like Lit.ly aie lollowing theii path. Vhethei
it`s mining youi peisonal Liology, Luiluing maps liom the shaieu expeiience
ol millions ol tiavelleis, oi stuuying the URLs that people pass to otheis, the
next geneiation ol successlul Lusinesses will Le Luilt aiounu uata. The pait ol
Hal Vaiian`s guote that noLouy iememLeis says it all:
The abiIity to take data~to be abIe to understand it, to process it, to
extract vaIue from it, to visuaIize it, to communicate it~that's going to
be a hugeIy important skiII in the next decades.
Data is inueeu the new Intel Insiue.
O'ReiIIy pubIications reIated to data science
R in a Nutshell
A guick anu piactical ieleience to leain what is Lecoming the stanuaiu loi
ueveloping statistical soltwaie.
Statistics in a Nutshell
An intiouuction anu ieleience loi anyone with no pievious Lackgiounu in
statistics.
Data Analysis with Open Souice Tools
This Look shows you how to think aLout uata anu the iesults you want to
achieve with it.
Piogiamming Collective Intelligence
Leain how to Luilu weL applications that mine the uata cieateu Ly people on
the Inteinet.
Beautilul Data
Leain liom the Lest uata piactitioneis in the lielu aLout how wiue-ianging÷
anu Leautilul÷woiking with uata can Le.
Beautilul Visualization
This Look uemonstiates why visualizations aie Leautilul not only loi theii
aesthetic uesign, Lut also loi elegant layeis ol uetail.
Heau Fiist Statistics
This Look teaches statistics thiough puzzles, stoiies, visual aius, anu ieal-
woilu examples.
Heau Fiist Data Analysis
Leain how to collect youi uata, soit the uistiactions liom the tiuth, anu linu
meaninglul patteins.
What is data science? | 15
The SMAQ stack for big data
Storage, MapReduce and Query are ushering in data-driven products
and services.
Ly Euu DumLill
SMAQ report sections
→ ¨MapReuuce¨ on page 17
→ ¨Stoiage¨ on page 20
→ ¨Queiy¨ on page 25
→ ¨Conclusion¨ on page 2S
¨Big uata¨ is uata that Lecomes laige enough that it cannot Le piocesseu using
conventional methous. Cieatois ol weL seaich engines weie among the liist
to conliont this pioLlem. Touay, social netwoiks, moLile phones, sensois anu
science contiiLute to petaLytes ol uata cieateu uaily.
To meet the challenge ol piocessing such laige uata sets, Google cieateu Map-
Reuuce. Google`s woik anu Yahoo`s cieation ol the Hauoop MapReuuce im-
plementation has spawneu an ecosystem ol Lig uata piocessing tools.
As MapReuuce has giown in populaiity, a stack loi Lig uata systems has
emeigeu, compiising layeis ol Stoiage, MapReuuce anu Queiy (SMAQ).
SMAQ systems aie typically open souice, uistiiLuteu, anu iun on commouity
haiuwaie.
16 | Chapter 1:Data Science and Data Tools
In the same way the commouity LAMP stack ol Linux, Apache, MySQL anu
PHP changeu the lanuscape ol weL applications, SMAQ systems aie Liinging
commouity Lig uata piocessing to a Lioau auuience. SMAQ systems unueipin
a new eia ol innovative uata-uiiven piouucts anu seivices, in the same way
that LAMP was a ciitical enaLlei loi VeL 2.0.
Though uominateu Ly Hauoop-Laseu aichitectuies, SMAQ encompasses a
vaiiety ol systems, incluuing leauing NoSQL uataLases. This papei uesciiLes
the SMAQ stack anu wheie touay`s Lig uata tools lit into the pictuie.
MapReduce
Cieateu at Google in iesponse to the pioLlem ol cieating weL seaich inuexes,
the MapReuuce liamewoik is the poweihouse Lehinu most ol touay`s Lig uata
piocessing. The key innovation ol MapReuuce is the aLility to take a gueiy
ovei a uata set, uiviue it, anu iun it in paiallel ovei many noues. This uistii-
Lution solves the issue ol uata too laige to lit onto a single machine.
The SMAQ stack for big data | 17
To unueistanu how MapReuuce woiks, look at the two phases suggesteu Ly
its name. In the map phase, input uata is piocesseu, item Ly item, anu tians-
loimeu into an inteimeuiate uata set. In the ieuuce phase, these inteimeuiate
iesults aie ieuuceu to a summaiizeu uata set, which is the uesiieu enu iesult.
A simple example ol MapReuuce is the task ol counting the numLei ol unigue
woius in a uocument. In the map phase, each woiu is iuentilieu anu given the
count ol 1. In the ieuuce phase, the counts aie auueu togethei loi each woiu.
Il that seems like an oLscuie way ol uoing a simple task, that`s Lecause it is.
In oiuei loi MapReuuce to uo its joL, the map anu ieuuce phases must oLey
ceitain constiaints that allow the woik to Le paiallelizeu. Tianslating gueiies
into one oi moie MapReuuce steps is not an intuitive piocess. Highei-level
aLstiactions have Leen uevelopeu to ease this, uiscusseu unuei Queiy Lelow.
An impoitant way in which MapReuuce-Laseu systems uillei liom conven-
tional uataLases is that they piocess uata in a Latch-oiienteu lashion. Voik
must Le gueueu loi execution, anu may take minutes oi houis to piocess.
Using MapReuuce to solve pioLlems entails thiee uistinct opeiations:
º Loading the data÷This opeiation is moie piopeily calleu Extiact,
Tiansloim, Loau (ETL) in uata waiehousing teiminology. Data must Le
extiacteu liom its souice, stiuctuieu to make it ieauy loi piocessing, anu
loaueu into the stoiage layei loi MapReuuce to opeiate on it.
º MapReduce÷This phase will ietiieve uata liom stoiage, piocess it, anu
ietuin the iesults to the stoiage.
º Extracting the resuIt÷Once piocessing is complete, loi the iesult to Le
uselul to humans, it must Le ietiieveu liom the stoiage anu piesenteu.
Many SMAQ systems have leatuies uesigneu to simplily the opeiation ol each
ol these stages.
18 | Chapter 1:Data Science and Data Tools
Hadoop MapReduce
Hauoop is the uominant open souice MapReuuce implementation. Funueu
Ly Yahoo, it emeigeu in 2006 anu, accoiuing to its cieatoi Doug Cutting,
ieacheu ¨weL scale¨ capaLility in eaily 200S.
The Hauoop pioject is now hosteu Ly Apache. It has giown into a laige en-
ueavoi, with multiple suLpiojects that togethei compiise a lull SMAQ stack.
Since it is implementeu in ]ava, Hauoop`s MapReuuce implementation is ac-
cessiLle liom the ]ava piogiamming language. Cieating MapReuuce joLs in-
volves wiiting lunctions to encapsulate the map anu ieuuce stages ol the com-
putation. The uata to Le piocesseu must Le loaueu into the Hauoop DistiiL-
uteu Filesystem.
Taking the woiu-count example liom aLove, a suitaLle map lunction might
look like the lollowing (taken liom the Hauoop MapReuuce uocumentation,
the key opeiations shown in Lolu).
public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
The coiiesponuing ieuuce lunction sums the counts loi each woiu.
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
The SMAQ stack for big data | 19
The piocess ol iunning a MapReuuce joL with Hauoop involves the lollowing
steps:
º Delining the MapReuuce stages in a ]ava piogiam
º Loauing the uata into the lilesystem
º SuLmitting the joL loi execution
º Retiieving the iesults liom the lilesystem
Run via the stanualone ]ava API, Hauoop MapReuuce joLs can Le complex to
cieate, anu necessitate piogiammei involvement. A Lioau ecosystem has
giown up aiounu Hauoop to make the task ol loauing anu piocessing uata
moie stiaightloiwaiu.
Other implementations
MapReuuce has Leen implementeu in a vaiiety ol othei piogiamming lan-
guages anu systems, a list ol which may Le lounu in Vikipeuia`s entiy loi
MapReuuce. NotaLly, seveial NoSQL uataLase systems have integiateu Map-
Reuuce, anu aie uesciiLeu latei in this papei.
Storage
MapReuuce ieguiies stoiage liom which to letch uata anu in which to stoie
the iesults ol the computation. The uata expecteu Ly MapReuuce is not iela-
tional uata, as useu Ly conventional uataLases. Insteau, uata is consumeu in
chunks, which aie then uiviueu among noues anu leu to the map phase as key-
value paiis. This uata uoes not ieguiie a schema, anu may Le unstiuctuieu.
Howevei, the uata must Le availaLle in a uistiiLuteu lashion, to seive each
piocessing noue.
20 | Chapter 1:Data Science and Data Tools
The uesign anu leatuies ol the stoiage layei aie impoitant not just Lecause ol
the inteilace with MapReuuce, Lut also Lecause they allect the ease with which
uata can Le loaueu anu the iesults ol computation extiacteu anu seaicheu.
Hadoop Distributed File System
The stanuaiu stoiage mechanism useu Ly Hauoop is the Hauoop DistiiLuteu
File System, HDFS. A coie pait ol Hauoop, HDFS has the lollowing leatuies,
as uetaileu in the HDFS uesign uocument.
º FauIt toIerance÷Assuming that lailuie will happen allows HDFS to iun
on commouity haiuwaie.
º Streaming data access÷HDFS is wiitten with Latch piocessing in minu,
anu emphasizes high thioughput iathei than ianuom access to uata.
º Extreme scaIabiIity÷HDFS will scale to petaLytes; such an installation
is in piouuction use at FaceLook.
º PortabiIity÷HDFS is poitaLle acioss opeiating systems.
º Write once÷By assuming a lile will iemain unchangeu altei it is wiitten,
HDFS simplilies ieplication anu speeus up uata thioughput.
º LocaIity of computation÷Due to uata volume, it is olten much lastei
to move the piogiam neai to the uata, anu HDFS has leatuies to lacilitate
this.
HDFS pioviues an inteilace similai to that ol iegulai lilesystems. Unlike a
uataLase, HDFS can only stoie anu ietiieve uata, not inuex it. Simple ianuom
access to uata is not possiLle. Howevei, highei-level layeis have Leen cieateu
to pioviue linei-giaineu lunctionality to Hauoop ueployments, such as HBase.
HBase, the Hadoop Database
One appioach to making HDFS moie usaLle is HBase. Moueleu altei Google`s
BigTaLle uataLase, HBase is a column-oiienteu uataLase uesigneu to stoie
massive amounts ol uata. It Lelongs to the NoSQL univeise ol uataLases, anu
is similai to Cassanuia anu HypeitaLle.
The SMAQ stack for big data | 21
HBase uses HDFS as a stoiage system, anu thus is capaLle ol stoiing a laige
volume ol uata thiough lault-toleiant, uistiiLuteu noues. Like similai column-
stoie uataLases, HBase pioviues REST anu Thiilt Laseu API access.
Because it cieates inuexes, HBase olleis last, ianuom access to its contents,
though with simple gueiies. Foi complex opeiations, HBase acts as Loth a
sourcc anu a sin| (uestination loi computeu uata) loi Hauoop MapReuuce.
HBase thus allows systems to inteilace with Hauoop as a uataLase, iathei than
the lowei level ol HDFS.
Hive
Data waiehousing, oi stoiing uata in such a way as to make iepoiting anu
analysis easiei, is an impoitant application aiea loi SMAQ systems. Developeu
oiiginally at FaceLook, Hive is a uata waiehouse liamewoik Luilt on top ol
Hauoop. Similai to HBase, Hive pioviues a taLle-Laseu aLstiaction ovei HDFS
anu makes it easy to loau stiuctuieu uata. In contiast to HBase, Hive can only
iun MapReuuce joLs anu is suiteu loi Latch uata analysis. Hive pioviues a
SQL-like gueiy language to execute MapReuuce joLs, uesciiLeu in the Queiy
section Lelow.
Cassandra and Hypertable
Cassanuia anu HypeitaLle aie Loth scalaLle column-stoie uataLases that lol-
low the pattein ol BigTaLle, similai to HBase.
An Apache pioject, Cassanuia oiiginateu at FaceLook anu is now in piouuc-
tion in many laige-scale weLsites, incluuing Twittei, FaceLook, Reuuit anu
Digg. HypeitaLle was cieateu at Zvents anu spun out as an open souice pioject.
22 | Chapter 1:Data Science and Data Tools
Both uataLases ollei inteilaces to the Hauoop API that allow them to act as a
souice anu a sink loi MapReuuce. At a highei level, Cassanuia olleis integia-
tion with the Pig gueiy language (see the Queiy section Lelow), anu HypeitaLle
has Leen integiateu with Hive.
NoSQL database implementations of MapReduce
The stoiage solutions examineu so lai have all uepenueu on Hauoop loi Map-
Reuuce. Othei NoSQL uataLases have Luilt-in MapReuuce leatuies that allow
computation to Le paiallelizeu ovei theii uata stoies. In contiast with the
multi-component SMAQ aichitectuies ol Hauoop-Laseu systems, they ollei a
sell-containeu system compiising stoiage, MapReuuce anu gueiy all in one.
Vheieas Hauoop-Laseu systems aie most olten useu loi Latch-oiienteu ana-
lytical puiposes, the usual lunction ol NoSQL stoies is to Lack live applica-
tions. The MapReuuce lunctionality in these uataLases tenus to Le a seconuaiy
leatuie, augmenting othei piimaiy gueiy mechanisms. Riak, loi example, has
a uelault timeout ol 60 seconus on a MapReuuce joL, in contiast to the ex-
pectation ol Hauoop that such a piocess may iun loi minutes oi houis.
These piominent NoSQL uataLases contain MapReuuce lunctionality:
º CouchDB is a uistiiLuteu uataLase, olleiing semi-stiuctuieu uocument-
Laseu stoiage. Its key leatuies incluue stiong ieplication suppoit anu the
aLility to make uistiiLuteu upuates. Queiies in CouchDB aie implementeu
using ]avaSciipt to ueline the map anu ieuuce phases ol a MapReuuce
piocess.
º MongoDB is veiy similai to CouchDB in natuie, Lut with a stiongei em-
phasis on peiloimance, anu less suitaLility loi uistiiLuteu upuates, iepli-
cation, anu veisioning. MongoDB MapReuuce opeiations aie specilieu
using ]avaSciipt.
º Riak is anothei uataLase similai to CouchDB anu MongoDB, Lut places
its emphasis on high availaLility. MapReuuce opeiations in Riak may Le
specilieu with ]avaSciipt oi Eilang.
The SMAQ stack for big data | 23
Integration with SQL databases
In many applications, the piimaiy souice ol uata is in a ielational uataLase
using platloims such as MySQL oi Oiacle. MapReuuce is typically useu with
this uata in two ways:
º Using ielational uata as a souice (loi example, a list ol youi liienus in a
social netwoik).
º Re-injecting the iesults ol a MapReuuce opeiation into the uataLase (loi
example, a list ol piouuct iecommenuations Laseu on liienus` inteiests).
It is theieloie impoitant to unueistanu how MapReuuce can inteilace with
ielational uataLase systems. At the most Lasic level, uelimiteu text liles seive
as an impoit anu expoit loimat Letween ielational uataLases anu Hauoop
systems, using a comLination ol SQL expoit commanus anu HDFS opeiations.
Moie sophisticateu tools uo, howevei, exist.
The Sgoop tool is uesigneu to impoit uata liom ielational uataLases into
Hauoop. It was uevelopeu Ly Clouueia, an enteipiise-locuseu uistiiLutoi ol
Hauoop platloims. Sgoop is uataLase-agnostic, as it uses the ]ava ]DBC ua-
taLase API. TaLles can Le impoiteu eithei wholesale, oi using gueiies to iestiict
the uata impoit.
Sgoop also olleis the aLility to ie-inject the iesults ol MapReuuce liom HDFS
Lack into a ielational uataLase. As HDFS is a lilesystem, Sgoop expects ue-
limiteu text liles anu tiansloims them into the SQL commanus ieguiieu to
inseit uata into the uataLase.
Foi Hauoop systems that utilize the Cascauing API (see the Queiy section
Lelow) the cascauing.juLc anu cascauing-uLmigiate tools ollei similai souice
anu sink lunctionality.
Integration with streaming data sources
In auuition to ielational uata souices, stieaming uata souices, such as weL
seivei log liles oi sensoi output, constitute the most common souice ol input
to Lig uata systems. The Clouueia Flume pioject aims at pioviuing convenient
integiation Letween Hauoop anu stieaming uata souices. Flume aggiegates
uata liom Loth netwoik anu lile souices, spieau ovei a clustei ol machines,
anu continuously pipes these into HDFS. The SciiLe seivei, uevelopeu at
FaceLook, also olleis similai lunctionality.
Commercial SMAQ solutions
Seveial massively paiallel piocessing (MPP) uataLase piouucts have MapRe-
uuce lunctionality Luilt in. MPP uataLases have a uistiiLuteu aichitectuie with
24 | Chapter 1:Data Science and Data Tools
inuepenuent noues that iun in paiallel. Theii piimaiy application is in uata
waiehousing anu analytics, anu they aie commonly accesseu using SQL.
º The Gieenplum uataLase is Laseu on the open souice PostieSQL DBMS,
anu iuns on clusteis ol uistiiLuteu haiuwaie. The auuition ol MapRe-
uuce to the iegulai SQL inteilace enaLles last, laige-scale analytics ovei
Gieenplum uataLases, ieuucing gueiy times Ly seveial oiueis ol magni-
tuue. Gieenplum MapReuuce peimits the mixing ol exteinal uata souices
with the uataLase stoiage. MapReuuce opeiations can Le expiesseu as
lunctions in Peil oi Python.
º Astei Data`s nClustei uata waiehouse system also olleis MapReuuce
lunctionality. MapReuuce opeiations aie invokeu using Astei Data`s SQL-
MapReuuce technology. SQL-MapReuuce enaLles the inteimingling ol
SQL gueiies with MapReuuce joLs uelineu using coue, which may Le
wiitten in languages incluuing C=, C--, ]ava, R oi Python.
Othei uata waiehousing solutions have opteu to pioviue connectois with Ha-
uoop, iathei than integiating theii own MapReuuce lunctionality.
º Veitica, lamously useu Ly Faimville cieatoi Zynga, is an MPP column-
oiienteu uataLase that olleis a connectoi loi Hauoop.
º Netezza is an estaLlisheu manulactuiei ol haiuwaie uata waiehousing anu
analytical appliances. Recently acguiieu Ly IBM, Netezza is woiking with
Hauoop uistiiLutoi Clouueia to enhance the inteiopeiation Letween theii
appliances anu Hauoop. Vhile it solves similai pioLlems, Netezza lalls
outsiue ol oui SMAQ uelinition, lacking Loth the open souice anu com-
mouity haiuwaie aspects.
Although cieating a Hauoop-Laseu system can Le uone entiiely with open
souice, it ieguiies some elloit to integiate such a system. Clouueia aims to
make Hauoop enteipiise-ieauy, anu has cieateu a unilieu Hauoop uistiiLution
in its Clouueia DistiiLution loi Hauoop (CDH). CDH loi Hauoop paiallels
the woik ol Reu Hat oi ULuntu in cieating Linux uistiiLutions. CDH comes
in Loth a liee euition anu an Enteipiise euition with auuitional piopiietaiy
components anu suppoit. CDH is an integiateu anu polisheu SMAQ enviion-
ment, complete with usei inteilaces loi opeiation anu gueiy. Clouueia`s woik
has iesulteu in some signilicant contiiLutions to the Hauoop open souice eco-
system.
Query
Specilying MapReuuce joLs in teims ol uelining uistinct map anu ieuuce lunc-
tions in a piogiamming language is unintuitive anu inconvenient, as is eviuent
liom the ]ava coue listings shown aLove. To mitigate this, SMAQ systems
The SMAQ stack for big data | 25
incoipoiate a highei-level gueiy layei to simplily Loth the specilication ol the
MapReuuce opeiations anu the ietiieval ol the iesult.
Many oiganizations using Hauoop will have alieauy wiitten in-house layeis
on top ol the MapReuuce API to make its opeiation moie convenient. Seveial
ol these have emeigeu eithei as open souice piojects oi commeicial piouucts.
Queiy layeis typically ollei leatuies that hanule not only the specilication ol
the computation, Lut the loauing anu saving ol uata anu the oichestiation ol
the piocessing on the MapReuuce clustei. Seaich technology is olten useu to
implement the linal step in piesenting the computeu iesult Lack to the usei.
Pig
Developeu Ly Yahoo anu now pait ol the Hauoop pioject, Pig pioviues a new
high-level language, Pig Latin, loi uesciiLing anu iunning Hauoop MapReuuce
joLs. It is intenueu to make Hauoop accessiLle loi uevelopeis lamiliai with
uata manipulation using SQL, anu pioviues an inteiactive inteilace as well as
a ]ava API. Pig integiation is availaLle loi the Cassanuia anu HBase uataLases.
Below is shown the woiu-count example in Pig, incluuing Loth the uata loau-
ing anu stoiing phases (the notation S0 ieleis to the liist lielu in a iecoiu).
input = LOAD 'input/sentences.txt' USING TextLoader();
words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();
Vhile Pig is veiy expiessive, it is possiLle loi uevelopeis to wiite custom steps
in Usei Delineu Functions (UDFs), in the same way that many SQL uataLases
suppoit the auuition ol custom lunctions. These UDFs aie wiitten in ]ava
against the Pig API.
26 | Chapter 1:Data Science and Data Tools
Though much simplei to unueistanu anu use than the MapReuuce API, Pig
sulleis liom the uiawLack ol Leing yet anothei language to leain. It is SQL-
like in some ways, Lut it is sulliciently uilleient liom SQL that it is uillicult loi
useis lamiliai with SQL to ieuse theii knowleuge.
Hive
As intiouuceu aLove, Hive is an open souice uata waiehousing solution Luilt
on top ol Hauoop. Cieateu Ly FaceLook, it olleis a gueiy language veiy similai
to SQL, as well as a weL inteilace that olleis simple gueiy-Luiluing lunction-
ality. As such, it is suiteu loi non-uevelopei useis, who may have some lamil-
iaiity with SQL.
Hive`s paiticulai stiength is in olleiing au-hoc gueiying ol uata, in contiast to
the compilation ieguiiement ol Pig anu Cascauing. Hive is a natuial staiting
point loi moie lull-leatuieu Lusiness intelligence systems, which ollei a usei-
liienuly inteilace loi non-technical useis.
The Clouueia DistiiLution loi Hauoop integiates Hive, anu pioviues a highei-
level usei inteilace thiough the HUE pioject, enaLling useis to suLmit gueiies
anu monitoi the execution ol Hauoop joLs.
Cascading, the API Approach
The Cascauing pioject pioviues a wiappei aiounu Hauoop`s MapReuuce API
to make it moie convenient to use liom ]ava applications. It is an intentionally
thin layei that makes the integiation ol MapReuuce into a laigei system moie
convenient. Cascauing`s leatuies incluue:
º A uata piocessing API that aius the simple uelinition ol MapReuuce joLs.
º An API that contiols the execution ol MapReuuce joLs on a Hauoop clus-
tei.
º Access via ]VM-Laseu sciipting languages such as ]ython, Gioovy, oi
]RuLy.
º Integiation with uata souices othei than HDFS, incluuing Amazon S3 anu
weL seiveis.
º Valiuation mechanisms to enaLle the testing ol MapReuuce piocesses.
Cascauing`s key leatuie is that it lets uevelopeis assemLle MapReuuce opeia-
tions as a llow, joining togethei a selection ol ¨pipes¨. It is well suiteu loi
integiating Hauoop into a laigei system within an oiganization.
Vhile Cascauing itsell uoesn`t pioviue a highei-level gueiy language, a ueiiv-
ative open souice pioject calleu Cascalog uoes just that. Using the Clojuie ]VM
language, Cascalog implements a gueiy language similai to that ol Datalog.
The SMAQ stack for big data | 27
Though poweilul anu expiessive, Cascalog is likely to iemain a niche gueiy
language, as it olleis neithei the ieauy lamiliaiity ol Hive`s SQL-like appioach
noi Pig`s pioceuuial expiession. The listing Lelow shows the woiu-count ex-
ample in Cascalog: it is signilicantly teisei, il less tianspaient.
(defmapcatop split [sentence]
(seq (.split sentence "\\s+")))
(?<- (stdout) [?word ?count]
(sentence ?s) (split ?s :> ?word)
(c/count ?count))
Search with Solr
An impoitant component ol laige-scale uata ueployments is ietiieving anu
summaiizing uata. The auuition ol uataLase layeis such as HBase pioviues
easiei access to uata, Lut uoes not pioviue sophisticateu seaich capaLilities.
To solve the seaich pioLlem, the open souice seaich anu inuexing platloim
Soli is olten useu alongsiue NoSQL uataLase systems. Soli uses Lucene seaich
technology to pioviue a sell-containeu seaich seivei piouuct.
Foi example, consiuei a social netwoik uataLase wheie MapReuuce is useu to
compute the inlluencing powei ol each peison, accoiuing to some suitaLle
metiic. This ianking woulu then Le ieinjecteu to the uataLase. Using Soli in-
uexing allows opeiations on the social netwoik, such as linuing the most in-
lluential people whose inteiest pioliles mention moLile phones, loi instance.
Oiiginally uevelopeu at CNET anu now an Apache pioject, Soli has evolveu
liom Leing just a text seaich engine to suppoiting laceteu navigation anu ie-
sults clusteiing. Auuitionally, Soli can manage laige uata volumes ovei uis-
tiiLuteu seiveis. This makes it an iueal solution loi iesult ietiieval ovei Lig
uata sets, anu a uselul component loi constiucting Lusiness intelligence uash-
Loaius.
Conclusion
MapReuuce, anu Hauoop in paiticulai, olleis a poweilul means ol uistiiLuting
computation among commouity seiveis. ComLineu with uistiiLuteu stoiage
anu incieasingly usei-liienuly gueiy mechanisms, the iesulting SMAQ aichi-
tectuie Liings Lig uata piocessing within ieach loi even small- anu solo-ue-
velopment teams.
It is now economic to conuuct extensive investigation into uata, oi cieate uata
piouucts that iely on complex computations. The iesulting explosion in ca-
paLility has loievei alteieu the lanuscape ol analytics anu uata waiehousing
systems, loweiing the Lai to entiy anu losteiing a new geneiation ol piouucts,
28 | Chapter 1:Data Science and Data Tools
seivices anu oiganizational attituues÷a tienu exploieu moie Lioauly in Mike
Loukiues` ¨Vhat is Data Science?¨ iepoit.
The emeigence ol Linux gave powei to the innovative uevelopei with meiely
a small Linux seivei at theii uesk: SMAQ has the same potential to stieamline
uata centeis, lostei innovation at the euges ol an oiganization, anu enaLle new
staitups to cheaply cieate uata-uiiven Lusinesses.
Scraping, cleaning, and selling big data
Infochimps execs discuss the chaIIenges of data scraping.
Ly Auuiey Vatteis
In 200S, the Austin-Laseu uata staitup Inlochimps ieleaseu a sciape ol Twittei
uata that was latei taken uown at the ieguest ol the micioLlogging site Lecause
ol usei piivacy conceins. Inlochimps has since stiuck a ueal with Twittei to
make some uatasets availaLle on the site, anu the Inlochimps maiketplace now
contains moie than 10,000 uatasets liom a vaiiety ol souices. Not all these
uatasets have Leen oLtaineu via sciaping, Lut neveitheless, the company`s
piocess ol sciaping, cleaning, anu selling Lig uata is an inteiesting topic to
exploie, Loth technically anu legally.
Vith that in minu, Inlochimps CEO Nick Ducoll, CTO Flip Kiomei, anu
Lusiness uevelopment managei Dick Hall explain the Lusiness ol uata sciaping
in the lollowing inteiview.
What arc thc |cga| inp|ications oj data scraping?
Dick HaII: Theie aie thiee main aieas you neeu to consiuei: copyiight, teims
ol seivice, anu ¨tiespass to chattels.¨
Scraping, cleaning, and selling big data | 29
Uniteu States copyiight law piotects against unauthoiizeu copying ol ¨oiiginal
woiks ol authoiship.¨ Facts anu iueas aie not copyiightaLle. Howevei, ex-
piessions oi aiiangements ol lacts may Le copyiightaLle. Foi example, a iecipe
loi uinnei is not copyiightaLle, Lut a iecipe Look with a seiies ol iecipes se-
lecteu Laseu on a unilying theme woulu Le copyiightaLle. This example illus-
tiates the ¨oiiginality¨ ieguiiement loi copyiight.
Let`s apply this to a conciete weL-sciaping example. The New Yoik Times
puLlishes a Llog post that incluues the iesults ol an election poll aiiangeu in
uescenuing oiuei Ly peicentage. The New Yoik Times can claim a copyiight
on the Llog post, Lut not the taLle ol poll iesults. A weL sciapei is liee to copy
the uata containeu in the taLle without leai ol copyiight inliingement. How-
evei, in oiuei to make a copy ol the Llog post wholesale, the weL sciapei woulu
have to iely on a uelense to inliingement, such as laii use. The iesult is that it
is uillicult to maintain a copyiight ovei uata, Lecause only a specilic aiiange-
ment oi selection ol the uata will Le piotecteu.
Most weLsites incluue a page outlining theii teims ol seivice (ToS), which
uelines the acceptaLle use ol the weLsite. Foi example, YouTuLe loiLius a usei
liom posting copyiighteu mateiials il the usei uoes not own the copyiight.
Teims ol seivice aie Laseu in contiact law, Lut theii enloiceaLility is a giay
aiea in US law. A weL sciapei violating the lettei ol a site`s ToS may aigue that
they nevei explicitly saw oi agieeu to the teims ol seivice.
Assuming ToS aie enloiceaLle, they aie a iisky issue loi weL sciapeis. Fiist,
eveiy site on the Inteinet will have a uilleient ToS ÷ Twittei, FaceLook, anu
The New Yoik Times may all have uiastically uilleient iueas ol what is ac-
ceptaLle use. Seconu, a site may unilateially change the ToS without notice
anu maintain that continueu use iepiesents acceptance ol the new ToS Ly a
weL sciapei oi usei. Foi example, Twittei iecently changeu its ToS to make it
signilicantly moie uillicult loi outsiue oiganizations to stoie oi expoit tweets
loi any ieason.
Theie`s also the issue ol volume. High-volume weL sciaping coulu cause sig-
nilicant monetaiy uamages to the sites Leing sciapeu. Foi example, il a weL
sciapei checks a site loi changes seveial thousanu times pei seconu, it is lunc-
tionally eguivalent to a uenial ol seivice attack. In this case, the weL sciapei
may Le liaLle loi uamages unuei a theoiy ol ¨tiespass to chattels,¨ Lecause the
site ownei has a piopeity inteiest in his oi hei weL seiveis. A goou-natuieu
weL sciapei shoulu Le aLle to avoiu this issue Ly picking a ieasonaLle lie-
guency loi sciaping.
30 | Chapter 1:Data Science and Data Tools
OSCON Data 2011, Leing helu ]uly 25-27 in Poitlanu, Oie., is a gatheiing
loi uevelopeis who aie hanus-on, uoing the systems woik anu evolving aichi-
tectuies anu tools to manage uata. (This event is co-locateu with OSCON.)
Save 20% on registration with the code OS11RAD
What arc sonc oj thc cha||cngcs oj acquiring data through scraping?
FIip Kromer: Theie aie seveial pioLlems with the scale anu the metauata, as
well as histoiical complications.
º Scale ÷ It`s oLvious that teiaLytes ol uata will cause pioLlems, Lut so (on
most lilesystems) will having tens ol millions ol liles in the same uiiectoiy
tiee.
º Metauata ÷ It`s a chicken-anu-egg pioLlem. Since lew piogiams can uiaw
on iich metauata, it`s not much use annotating it. But since so lew uatasets
aie annotateu, it`s not woith wiiting suppoit into youi applications. Ve
have an inteinal uata-uesciiption language that we plan to open souice as
it matuies.
º Histoiical complications ÷ Statisticians like SPSS liles. Semantic weL au-
vocates like RDF/XML. Vall Stieet guants like Mathematica expoits.
Theie is no One Tiue Foimat. Lilting each out ol its souice uomain is time
consuming.
But the Liggest non-oLvious pioLlem we see is souice uomain complexity. This
is what we call the ¨uLei¨ pioLlem. A uevelopei wants the answei to a
Scraping, cleaning, and selling big data | 31
ieasonaLle guestion, such as ¨Vhat was the aii tempeiatuie in Austin at noon
on August 6, 199S?¨ The oLvious answei ÷ ¨uamn hot¨ ÷ isn`t acceptaLle.
Neithei is:
Vell, it`s complicateu. See, theie aie multiple weathei stations, all iepoiting
tempeiatuies ÷ each with its own eiioi estimate ÷ at uilleient times. So you
simply have to take the spatial- anu time-aveiage ol theii iepoiteu values acioss
the iegion. Anu Ly the way, uiu you mean Austin`s city Lounuaiy, oi its met-
iopolitan aiea, oi its uowntown iegion?
Theie aie moie than a uozen incompatiLle yet lunuamentally coiiect ways to
measuie time: Eaith-centeieu? Leap seconus? Calenuiical? Does the length ol
a uay change as the eaith`s iotational speeu uoes?
Data at ¨eveiything¨ scale is souiceu Ly uomain expeits, who necessaiily live
at the ¨it`s complicateu¨ level. To make it uselul to the iest ol the woilu ieguiies
uomain knowleuge, anu olten a tiansloimation that is simply nonsensical
within the souice uomain.
How wi|| data nar|ctp|accs changc thc wor| and dircction oj data startups?
Nick Ducoff: I viviuly iememLei Leing taught aLout compaiative auvant-
age. This might age me a Lit, Lut the lesson was: Michael ]oiuan uoesn`t mow
his own lawn. Vhy? Because he shoulu spenu his time piacticing LasketLall
since that`s what he`s Lest at anu makes a lot ol money uoing. The same anal-
ogy applies to soltwaie uevelopeis. Il you aie Lest at the piesentation layei,
you uon`t want to spenu youi time lutzing aiounu with uataLases
Inlochimps allows these uevelopeis to spenu theii time uoing what they uo
Lest ÷ Luiluing apps ÷ while we spenu ouis uoing what we uo Lest ÷ making
uata easy to linu anu use. Vhat we`ie seeing is staitups locusing on pieces ol
the stack. Ovei time the Lig clouu pioviueis will Luy these companies to in-
tegiate into theii stacks.
Companies like Heioku (acguiieu Ly Salesloice) anu ClouuKick (acguiieu Ly
Rackspace) have paveu the way loi this. Tools like SciapeiViki anu ]unai will
allow anyLouy to pull uown taLles oll the weL, anu companies like Masheiy,
Apigee anu 3scale will continue to make APIs moie pievalent. Ve`ll help make
32 | Chapter 1:Data Science and Data Tools
these taLles anu APIs linuaLle anu usaLle. Developeis will Le aLle to go liom
iuea to app in houis, not uays oi weeks.
This intcrvicw was cditcd and condcnscd.
Data hand tools
A data task iIIustrates the importance of simpIe and fIexibIe tooIs.
Ly Mike Loukiues
The lloweiing ol uata science has Loth uiiven, anu Leen uiiven Ly, an explosion
ol poweilul tools. R pioviues a gieat platloim loi uoing statistical analysis,
Hauoop pioviues a liamewoik loi oichestiating laige clusteis to solve pioL-
lems in paiallel, anu many NoSQL uataLases exist loi stoiing huge amounts
ol unstiuctuieu uata. The heavy machineiy loi seiious numLei ciunching in-
cluues peiennials such as Mathematica, MatlaL, anu Octave, most ol which
have Leen extenueu loi use with laige clusteis anu othei Lig iion.
But these tools haven`t negateu the value ol much simplei tools; in lact, they`ie
an essential pait ol a uata scientist`s toolkit. Hilaiy Mason anu Chiis Viggins
wiote that ¨Seu, awk, giep aie enough loi most small tasks,¨ anu theie`s a
layei ol tools Lelow scd, aw|, anu grcp that aie egually uselul. Hilaiy has
pointeu out the value ol exploiing uata sets with simple tools Leloie pioceeu-
Data hand tools | 33
ing to a moie in-uepth analysis. The auvent ol clouu computing, Amazon`s
EC2 in paiticulai, also places a piemium on lluency with simple commanu-
line tools. In conveisation, Mike Diiscoll ol Metamaikets pointeu out the value
ol Lasic tools like grcp to liltei youi uata Leloie piocessing it oi moving it
somewheie else. Tools like grcp weie uesigneu to uo one thing anu uo it well.
Because they`ie so simple, they`ie also extiemely llexiLle, anu can easily Le
useu to Luilu up poweilul piocessing pipelines using nothing Lut the com-
manu line. So while we have an extiaoiuinaiy wealth ol powei tools at oui
uisposal, we`ll Le the pooiei il we loiget the Lasics.
Vith that in minu, heie`s a veiy simple, anu not contiiveu, task that I neeueu
to accomplish. I`m a ham iauio opeiatoi. I spent time iecently in a contest that
involveu making contacts with lots ol stations all ovei the woilu, Lut paitic-
ulaily in Russia. Russian stations all sent theii two-lettei oLlast aLLieviation
(eguivalent to a US state). I neeueu to liguie out how many oLlasts I contacteu,
along with counting oLlasts on paiticulai ham Lanus. Yes, I have soltwaie to
uo that; anu no, it wasn`t woiking (Lau uata lile, since lixeu). So let`s look at
how to uo this with the simplest ol tools.
(Notc: Sonc oj thc spacing in thc associatcd data was cditcd to jit on thc pagc.
|j you copy and pastc thc data, a jcw connands that rc|y on counting spaccs
won`t wor|.)
Log entiies look like this:
QSO: 14000 CW 2011-03-19 1229 W1JQ 599 0001 UV5U 599 0041
QSO: 14000 CW 2011-03-19 1232 W1JQ 599 0002 SO2O 599 0043
QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0003 RG3K 599 VR
QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0004 UD3D 599 MO
...
Most ol the lielus aie aicane stull that we won`t neeu loi these exeicises. The
Russian entiies have a two-lettei oLlast aLLieviation at the enu; iows that enu
with a numLei aie contacts with stations outsiue ol Russia. Ve`ll also use the
seconu lielu, which iuentilies a ham iauio Lanu (21000 KHz, 1+000 KHz, 7000
KHz, 3500 KHz, etc.) So liist, let`s stiip eveiything Lut the Russians with
grcp anu a iegulai expiession:
$ grep '599 [A-Z][A-Z]' rudx-log.txt | head -2
QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0003 RG3K 599 VR
QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0004 UD3D 599 MO
grcp may Le the most uselul tool in the Unix toolchest. Heie, I`m just seaiching
loi lines that have 599 (which occuis eveiywheie) lolloweu Ly a space, lol-
loweu Ly two uppeicase letteis. To ueal with mixeu case (not necessaiy heie),
use grcp -i. You can use chaiactei classes like :uppei: iathei than specilying
the iange A-Z, Lut why Lothei? Regulai expiessions can Lecome veiy complex,
Lut simple will olten uo the joL, anu Le less eiioi-pione.
34 | Chapter 1:Data Science and Data Tools
Il you`ie lamiliai with grcp, you may Le asking why I uiun`t use $ to match the
enu ol line, anu loiget aLout the 599 noise. Goou guestion. Theie is some
whitespace at the enu ol the line; we`u have to match that, too. Because this
lile was cieateu on a Vinuows machine, insteau ol just a newline at the enu
ol each line, it has a ietuin anu a newline. The $ that grcp uses to match the
enu-ol-line only matches a Unix newline. So I uiu the easiest thing that woulu
woik ieliaLly.
The simple hcad utility is a jewel. Il you leave hcad oll ol the pievious com-
manu, you`ll get a long listing sciolling uown youi scieen. That`s iaiely uselul,
especially when you`ie Luiluing a chain ol commanus. hcad gives you the liist
lew lines ol output: 10 lines Ly uelault, Lut you can specily the numLei ol lines
you want. -2 says ¨just two lines,¨ which is enough loi us to see that this sciipt
is uoing what we want.
Next, we neeu to cut out the junk we uon`t want. The easy way to uo this is
to use co|rn (iemove columns). That takes two aiguments: the liist anu last
column to iemove. Column numLeiing staits with one, so in this case we can
use co|rn 1 72.
$ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | head -2
VR
MO
...
How uiu I know we wanteu column 72? ]ust a little expeiimentation; com-
manu lines aie cheap, especially with commanu histoiy euiting. I shoulu ac-
tually use 73, Lut that auuitional space won`t huit, noi will the auuitional
whitespace at the enu ol each line. Yes, theie aie Lettei ways to select columns;
we`ll see them shoitly. Next, we neeu to soit anu linu the unigue aLLieviations.
I`m going to use two commanus heie: sort (which uoes what you`u expect),
anu uniq (to iemove uuplicates).
$ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | sort |\
uniq | head -2
AD
AL
Sort has a -u option that suppiesses uuplicates, Lut loi some ieason I pielei to
keep sort anu uniq sepaiate. sort can also Le maue case-insensitive (-j), can
select paiticulai lielus (meaning we coulu eliminate the co|rn commanu, too),
can uo numeiic soits in auuition to lexical soits, anu lots ol othei things. Pei-
sonally, I pielei Luiluing up long Unix pipes one commanu at a time to hunting
loi the iight options.
Finally, I saiu I wanteu to count the numLei ol oLlasts. One ol the most uselul
Unix utilities is a little piogiam calleu wc: ¨woiu count.¨ That`s what it uoes.
Its output is thiee numLeis: the numLei ol lines, the numLei ol woius, anu
Data hand tools | 35
the numLei ol chaiacteis it has seen. Foi many small uata piojects, that`s ieally
all you neeu.
$ grep '599 [A-Z][A-Z]' rudx-log.txt | colrm 1 72 | sort | uniq | wc
38 38 342
So, 3S unigue oLlasts. You can say wc -| il you only want to count the lines;
sometimes that`s uselul. Notice that we no longei neeu to enu the pipeline
with hcad; we want wc to see all the uata.
But I saiu I also wanteu to know the numLei ol oLlasts on each ham Lanu.
That`s the liist numLei (like 21000) in each log entiy. So we`ie thiowing out
too much uata. Ve coulu lix that Ly aujusting co|rn, Lut I piomiseu a Lettei
way to pull out inuiviuual columns ol uata. Ve`ll use aw| in a veiy simple way:
$ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\
sort | uniq
14000 AD
14000 AL
14000 AN
...
aw| is a veiy poweilul tool; it`s a complete piogiamming language that can
uo almost any kinu ol text manipulation. Ve coulu uo eveiything we`ve seen
so lai as an aw| piogiam. But iathei than use it as a powei tool, I`m just using
it to pull out the seconu anu eleventh lielus liom my input. The single guotes
aie neeueu aiounu the aw| piogiam, to pievent the Unix shell liom getting
conluseu. Vithin aw|`s piint commanu, we neeu to explicitly incluue the
space, otheiwise it will iun the lielus togethei.
The cut utility is anothei alteinative to co|rn anu aw|. It`s uesigneu loi ie-
moving poitions ol a lile. cut isn`t a lull piogiamming language, Lut it can
make moie complex tiansloimations than simply ueleting a iange ol columns.
Howevei, although it`s a simple tool at heait, it can get tiicky; I usually linu
that, when co|rn iuns out ol steam, it`s Lest jumping all the way to aw|.
Ve`ie still a little shoit ol oui goal: how uo we count the numLei ol oLlasts
on each Lanu? At this point, I use a ieally cheesy solution: anothei grcp, lol-
loweu Ly wc:
$ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\
sort | uniq | grep 21000 | wc
20 40 180
$ grep '599 [A-Z][A-Z]' rudx-log.txt | awk '{print $2 " " $11}' |\
sort | uniq | grep 14000 | wc
26 52 234
...
OK, 20 oLlasts on the 21 MHz Lanu, 26 on the 1+ MHz Lanu. Anu at this
point, theie aie two guestions you ieally shoulu Le asking. Fiist, why not put
grcp 21000 liist, anu save the aw| invocation? That`s just how the sciipt ue-
36 | Chapter 1:Data Science and Data Tools
velopeu. You coulu put the grcp liist, though you`u still neeu to stiip extia
gunk liom the lile. Seconu: Vhat il theie aie gigaLytes ol uata? You have to
iun this commanu loi each Lanu, anu loi some othei pioject, you might neeu
to iun it uozens oi hunuieus ol times. That`s a valiu oLjection. To solve this
pioLlem, you neeu a moie complex aw| sciipt (which has associative aiiays
in which you can save uata), oi you neeu a piogiamming language such as peil,
python, oi iuLy. At the same time, we`ve gotten laiily lai with oui uata ex-
ploiation, using only the simplest ol tools.
Now let`s up the ante. Let`s say that theie aie a numLei ol uiiectoiies with lots
ol liles in them, incluuing these iuux-log.txt liles. Let`s say that these uiiecto-
iies aie oiganizeu Ly yeai (2001, 2002, etc.). Anu let`s say we want to count
oLlasts acioss all the yeais loi which we have iecoius. How uo we uo that?
Heie`s wheie we neeu jind. My liist appioach is to take the lilename (iuux-
log.txt) out ol the grcp commanu, anu ieplace it with a jind commanu that
looks loi eveiy lile nameu iuux-log.txt in suLuiiectoiies ol the cuiient uiiec-
toiy:
$ grep '599 [A-Z][A-Z]' `find . -name rudx-log.txt -print` |\
awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc
48 96 432
OK, so +S uiiectoiies on the 1+ MHz Lanu, liletime. I thought I hau uone Lettei
than that. Vhat`s happening, though? That jind commanu is simply saying
¨look at the cuiient uiiectoiy anu its suLuiiectoiies, linu liles with the given
name, anu piint the output.¨ The Lackguotes tell the Unix shell to use the
output ol jind as aiguments to grcp. So we`ie just giving grcp a long list ol liles,
insteau ol just one. Note the -piint option: il it`s not theie, jind happily uoes
nothing.
Ve`ie almost uone, Lut theie aie a couple ol Lits ol haii you shoulu woiiy
aLout. Fiist, il you invoke grcp with moie than one lile on the commanu line,
each line ol output Legins with the name ol the lile in which it lounu a match:
...
./2008/rudx-log.txt:QSO: 14000 CW 2008-03-15 1526 W1JQ 599 0054 \\
UA6YW 599 AD
./2009/rudx-log.txt:QSO: 14000 CW 2009-03-21 1225 W1JQ 599 0015 \\
RG3K 599 VR
...
Ve`ie lucky. grcp just sticks the lilename at the Leginning ol the line without
auuing spaces, anu we`ie using aw| to piint selecteu whitespace-sepaiateu
lielus. So the numLei ol any lielu uiun`t change. Il we weie using co|rn, we`u
have to liuule with things to linu the iight columns. Il the lilenames hau uil-
leient lengths (ieasonaLly likely, though not possiLle heie), we coulun`t use
co|rn at all. Foitunately, you can suppiess the lilename Ly using grcp -h.
Data hand tools | 37
The seconu piece ol haii is less common, Lut potentially moie tiouLlesome.
Il you look at the last commanu, what we`ie uoing is giving the jind commanu
a ieally long list ol lilenames. How long is long? Can that list get too long? The
answeis aie ¨we uon`t know,¨ anu ¨mayLe.¨ In the nasty olu uays, things Lioke
when the commanu line got longei than a lew thousanu chaiacteis. These
uays, who knows what`s too long ... But we`ie uoing ¨Lig uata,¨ so it`s easy to
imagine the jind commanu expanuing to hunuieus ol thousanus, even millions
ol chaiacteis. Moie than that, oui single Unix pipeline uoesn`t paiallelize veiy
well; anu il we ieally have Lig uata, we want to paiallelize it.
The answei to this pioLlem is anothei olu Unix utility, xargs. Xargs uates Lack
to the time when it was laiily easy to come up with lile lists that weie too long.
Its joL is to Lieak up commanu line aiguments into gioups anu spawn as many
sepaiate commanus as neeueu, iunning in paiallel il possiLle (-P). Ve`u use it
like this:
$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]' |\
awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc
48 96 432
This commanu is actually a nice little map-ieuuce implementation: the xargs
commanu maps grcp all the coies on youi machine, anu the output is ieuuceu
(comLineu) Ly the aw|/sort/uniq chain. xargs has lots ol commanu line op-
tions, so il you want to Le conluseu, ieau the man page.
Anothei appioach is to use jind`s -cxcc option to invoke aiLitiaiy commanus.
It`s somewhat moie llexiLle than xargs, though in my opinion, jind -cxcc has
the soit ol oveily llexiLle Lut conlusing syntax that`s suipiisingly likely to leau
to uisastei. (It`s woith noting that the examples loi -cxcc almost always involve
automating Lulk lile ueletion. Excuse me, Lut that`s a iecipe loi heaitache.
Take this liom the guy who once ueleteu the Lusiness plan, then lounu that
the Lackups haun`t Leen uone loi aLout 6 months.) Theie`s an excellent tu-
toiial loi Loth xargs anu jind -cxcc at Soltpanoiama. I paiticulaily like this
tutoiial Lecause it emphasizes testing to make suie that youi commanu won`t
iun amok anu uo Lau things (like ueleting the Lusiness plan).
That`s not all. Back in the uaik ages, I wiote a shell sciipt that uiu a iecuisive
grcp thiough all the suLuiiectoiies ol the cuiient uiiectoiy. That`s a goou shell
piogiamming exeicise which I`ll leave to the ieauei. Moie to the point, I`ve
noticeu that theie`s now a -R option to grcp that makes it iecuisive. Clevei
little Luggeis ...
Beloie closing, I`u like to touch on a couple ol tools that aie a Lit moie exotic,
Lut which shoulu Le in youi aisenal in case things go wiong. od -c gives a iaw
uump ol eveiy chaiactei in youi lile. (-c says to uump chaiacteis, iathei than
octal oi hexauecimal). It`s uselul il you think youi uata is coiiupteu (it hap-
38 | Chapter 1:Data Science and Data Tools
pens), oi il it has something in it that you uiun`t expect (it happens a LOT).
od will show you what`s happening; once you know what the pioLlem is, you
can lix it. To lix it, you may want to use scd. scd is a cianky olu thing: moie
than a hanu tool, Lut not guite a powei tool; soit ol an antigue tieaule-opeiateu
uiill piess. It`s gieat loi euiting liles on the lly, anu uoing Latch euits. Foi
example, you might use it il NUL chaiacteis weie scatteieu thiough the uata.
Finally, a tool I just leaineu aLout (thanks, ¿uataspoia): the pipe viewei, pv.
It isn`t a stanuaiu Unix utility. It comes with some veisions ol Linux, Lut the
chances aie that you`ll have to install it youisell. Il you`ie a Mac usei, it`s in
macpoits. pv tells you what`s happening insiue the pipes as the commanu
piogiesses. ]ust inseit it into a pipe like this:
$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]' |\
awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc
3.41kB 0:00:00 [ 20kB/s] [<=>
48 96 432
The pipeline iuns noimally, Lut you`ll get some auuitional output that shows
the commanu`s piogiess. Il something`s getting mallunctioning oi peiloiming
too slowly, you`ll linu out. pv is paiticulaily goou when you have huge amounts
ol uata, anu you can`t tell whethei something has giounu to a halt, oi you just
neeu to go out loi collee while the commanu iuns to completion.
Vhenevei you neeu to woik with uata, uon`t oveilook the Unix ¨hanu tools.¨
Suie, eveiything I`ve uone heie coulu Le uone with Excel oi some othei lancy
tool like R oi Mathematica. Those tools aie all gieat, Lut il youi uata is living
in the clouu, using these tools is possiLle, Lut painlul. Yes, we have iemote
uesktops, Lut iemote uesktops acioss the Inteinet, even with mouein high-
speeu netwoiking, aie lai liom comloitaLle. Youi pioLlem may Le too laige
to use the hanu tools loi linal analysis, Lut they`ie gieat loi initial exploiations.
Once you get useu to woiking on the Unix commanu line, you`ll linu that it`s
olten lastei than the alteinatives. Anu the moie you use these tools, the moie
lluent you`ll Lecome.
Oh yeah, that Lioken uata lile that woulu have maue this exeicise supeilluous?
Someone emaileu it to me altei I wiote these sciipts. The sciipting took less
than 10 minutes, stait to linish. Anu, liankly, it was moie lun.
Data hand tools | 39
Hadoop: What it is, how it works, and what it can do
CIoudera CEO Mike OIson on Hadoop's architecture and its data
appIications.
Ly ]ames Tuinei
http://hadoop.apachc.org/Hauoop gets a lot ol Luzz these uays in uataLase anu
content management ciicles, Lut many people in the inuustiy still uon`t ieally
know what it is anu oi how it can Le Lest applieu.
Clouueia CEO anu Stiata speakei Mike Olson, whose company olleis an en-
teipiise uistiiLution ol Hauoop anu contiiLutes to the pioject, uiscusses Ha-
uoop`s Lackgiounu anu its applications in the lollowing inteiview.
Whcrc did Hadoop conc jron?
Mike OIson: The unueilying technology was inventeu Ly Google Lack in theii
eailiei uays so they coulu uselully inuex all the iich textuial anu stiuctuial
inloimation they weie collecting, anu then piesent meaninglul anu actionaLle
iesults to useis. Theie was nothing on the maiket that woulu let them uo that,
so they Luilt theii own platloim. Google`s innovations weie incoipoiateu into
Nutch, an open souice pioject, anu Hauoop was latei spun-oll liom that. Ya-
hoo has playeu a key iole ueveloping Hauoop loi enteipiise applications.
What prob|cns can Hadoop so|vc?
40 | Chapter 1:Data Science and Data Tools
Mike OIson: The Hauoop platloim was uesigneu to solve pioLlems wheie
you have a lot ol uata ÷ peihaps a mixtuie ol complex anu stiuctuieu uata
÷ anu it uoesn`t lit nicely into taLles. It`s loi situations wheie you want to iun
analytics that aie ueep anu computationally extensive, like clusteiing anu tai-
geting. That`s exactly what Google was uoing when it was inuexing the weL
anu examining usei Lehavioi to impiove peiloimance algoiithms.
Hauoop applies to a Lunch ol maikets. In linance, il you want to uo accuiate
poitlolio evaluation anu iisk analysis, you can Luilu sophisticateu mouels that
aie haiu to jam into a uataLase engine. But Hauoop can hanule it. In online
ietail, il you want to uelivei Lettei seaich answeis to youi customeis so they`ie
moie likely to Luy the thing you show them, that soit ol pioLlem is well au-
uiesseu Ly the platloim Google Luilt. Those aie just a lew examples.
Strata: Making Data Work, Leing helu FeL. 1-3, 2011 in Santa Claia, Calil.,
will locus on the Lusiness anu piactice ol uata. The conleience will pioviue
thiee uays ol tiaining, Lieakout sessions, anu plenaiy uiscussions÷along with
an Executive Summit, a Sponsoi Pavilion, anu othei events showcasing the
new uata ecosystem.
Save 30% off registration with the code STR11RAD
How is Hadoop architcctcd?
Mike OIson: Hauoop is uesigneu to iun on a laige numLei ol machines that
uon`t shaie any memoiy oi uisks. That means you can Luy a whole Lunch ol
commouity seiveis, slap them in a iack, anu iun the Hauoop soltwaie on each
one. Vhen you want to loau all ol youi oiganization`s uata into Hauoop, what
Hadoop: What it is, how it works, and what it can do | 41
the soltwaie uoes is Lust that uata into pieces that it then spieaus acioss youi
uilleient seiveis. Theie`s no one place wheie you go to talk to all ol youi uata;
Hauoop keeps tiack ol wheie the uata iesiues. Anu Lecause theie aie multiple
copy stoies, uata stoieu on a seivei that goes ollline oi uies can Le automati-
cally ieplicateu liom a known goou copy.
In a centializeu uataLase system, you`ve got one Lig uisk connecteu to loui oi
eight oi 16 Lig piocessois. But that is as much hoisepowei as you can Liing to
Leai. In a Hauoop clustei, eveiy one ol those seiveis has two oi loui oi eight
CPUs. You can iun youi inuexing joL Ly senuing youi coue to each ol the
uozens ol seiveis in youi clustei, anu each seivei opeiates on its own little
piece ol the uata. Results aie then ueliveieu Lack to you in a unilieu whole.
That`s MapReuuce: you map the opeiation out to all ol those seiveis anu then
you ieuuce the iesults Lack into a single iesult set.
Aichitectuially, the ieason you`ie aLle to ueal with lots ol uata is Lecause Ha-
uoop spieaus it out. Anu the ieason you`ie aLle to ask complicateu computa-
tional guestions is Lecause you`ve got all ol these piocessois, woiking in pai-
allel, hainesseu togethei.
At this point, do conpanics nccd to dcvc|op thcir own Hadoop app|ications?
Mike OIson: It`s laii to say that a cuiient Hauoop auoptei must Le moie
sophisticateu than a ielational uataLase auoptei. Theie aie not that many
¨shiink wiappeu¨ applications touay that you can get iight out ol the Lox anu
iun on youi Hauoop piocessoi. It`s similai to the eaily `S0s when Ingies anu
IBM weie selling theii uataLase engines anu people olten hau to wiite appli-
cations locally to opeiate on the uata.
That saiu, you can uevelop applications in a lot ol uilleient languages that iun
on the Hauoop liamewoik. The uevelopei tools anu inteilaces aie pietty sim-
ple. Some ol oui paitneis ÷ Inloimatica is a goou example ÷ have poiteu
theii tools so that they`ie aLle to talk to uata stoieu in a Hauoop clustei using
Hauoop APIs. Theie aie specialist venuois that aie up anu coming, anu theie
aie also a couple ol geneial piocess gueiy tools: a veision ol SQL that lets you
inteiact with uata stoieu on a Hauoop clustei, anu Pig, a language uevelopeu
Ly Yahoo that allows loi uata llow anu uata tiansloimation opeiations on a
Hauoop clustei.
Hauoop`s ueployment is a Lit tiicky at this stage, Lut the venuois aie moving
guickly to cieate applications that solve these pioLlems. I expect to see moie
ol the shiink-wiappeu apps appeaiing ovei the next couple ol yeais.
Whcrc do you stand in thc SQL vs NoSQL dcbatc?
Mike OIson: I`m a ueep Lelievei in ielational uataLases anu in SQL. I think
the language is awesome anu the piouucts aie incieuiLle.
42 | Chapter 1:Data Science and Data Tools
I hate the teim ¨NoSQL.¨ It was inventeu to cieate cachet aiounu a Lunch ol
uilleient piojects, each ol which has uilleient piopeities anu Lehaves in uil-
leient ways. The ieal guestion is, what pioLlems aie you solving? That`s what
matteis to useis.
Four free data tools for journalists (and snoops)
A Iook at free services that reveaI traffic data, server detaiIs and
popuIarity.
Ly Pete Vaiuen
Notc: Thc jo||owing is an cxccrpt jron Pctc Wardcn`s jrcc cboo| ¨Whcrc arc thc
bodics buricd on thc wcb? Big data jor journa|ists.¨
Theie`s Leen a ievolution in uata ovei the last lew yeais, uiiven Ly an aston-
ishing uiop in the piice ol gatheiing anu analyzing massive amounts ol inloi-
mation. It only cost me $120 to gathei, analyze anu visualize 220 million puLlic
FaceLook pioliles, anu you can use S0legs to uownloau a million weL pages
loi just $2.20. Those aie just two examples.
The technology is also getting easiei to use. Companies like Extiactiv anu
NeeuleLase aie cieating point-anu-click tools loi gatheiing uata liom almost
any site on the weL, anu eveiy othei stage ol the analysis piocess is getting
iauically simplei too.
Vhat uoes this mean loi jouinalists? You no longei have to Le a technical
specialist to linu exciting, convincing anu suipiising uata loi youi stoiies. Foi
example, the lollowing loui seivices all easily ieveal unueilying uata aLout weL
pages anu uomains.
WHOIS
Many ol you will alieauy Le lamiliai with VHOIS, Lut it`s so uselul loi ieseaich
it`s still woith pointing out. Il you go to this site (oi just type ¨whois www.ex-
ample.com¨ in Teiminal.app on a Mac) you can get the Lasic iegistiation in-
loimation loi any weLsite. In iecent yeais, some owneis have chosen ¨piivate¨
iegistiation, which hiues theii uetails liom view, Lut in many cases you`ll see
Four free data tools for journalists (and snoops) | 43
a name, auuiess, email anu phone numLei loi the peison who iegisteieu the
site.
You can also entei numeiical IP auuiesses heie anu get uata on the oiganization
oi inuiviuual that owns that seivei. This is especially hanuy when you`ie tiying
to tiack uown moie inloimation on an aLusive oi malicious usei ol a seivice,
since most weLsites iecoiu an IP auuiess loi eveiyone who accesses them
Strata: Making Data Work, Leing helu FeL. 1-3, 2011 in Santa Claia, Calil.,
will locus on the Lusiness anu piactice ol uata. The conleience will pioviue
thiee uays ol tiaining, Lieakout sessions, anu plenaiy uiscussions÷along with
an Executive Summit, a Sponsoi Pavilion, anu othei events showcasing the
new uata ecosystem.
Save 30% off registration with the code STR11RAD
Blekko
The newest seaich engine in town, one ol Blekko`s selling points is the iichness
ol the uata it olleis. Il you type in a uomain name lolloweu Ly /seo, you`ll
ieceive a page ol statistics on that URL:
44 | Chapter 1:Data Science and Data Tools
The liist taL shows othei sites that aie linking to the cuiient uomain, in pop-
ulaiity oiuei. This can Le extiemely uselul when you`ie tiying to unueistanu
what coveiage a site is ieceiving, anu il you want to unueistanu why it`s ianking
highly in Google`s seaich iesults, since they`ie Laseu on those inLounu links.
Inclusion ol this inloimation woulu have Leen an inteiesting auuition to the
iecent DecoiMyEyes stoiy, loi example.
The othei hanuy taL is ¨Ciawl stats,¨ especially the ¨Cohosteu with¨ section:
This tells you which othei weLsites aie iunning liom the same machine. It`s
common loi scammeis anu spammeis to astiotuil theii way towaiu legitimacy
Ly Luiluing multiple sites that ieview anu link to each othei. They look like
Four free data tools for journalists (and snoops) | 45
inuepenuent uomains, anu may even have uilleient iegistiation uetails, Lut
olten they`ll actually live on the same seivei Lecause that`s a lot cheapei. These
statistics give you an insight into the hiuuen Lusiness stiuctuie ol shauy op-
eiatois.
bit.ly
I always tuin to Lit.ly when I want to know how people aie shaiing a paiticulai
link. To use it, entei the URL you`ie inteiesteu in:
Then click on the `Inlo Page-` link:
That takes you to the lull statistics page (though you may neeu to choose
¨aggiegate Lit.ly link¨ liist il you`ie signeu in to the seivice).
This will give you an iuea ol how populai the page is, incluuing activity on
FaceLook anu Twittei. Below that you`ll see puLlic conveisations aLout the
link pioviueu Ly Lacktype.com.
46 | Chapter 1:Data Science and Data Tools
I linu this comLination ol tiallic uata anu conveisations veiy helplul when I`m
tiying to unueistanu why a site oi page is populai, anu who exactly its lans
aie. Foi example, it pioviueu me with stiong eviuence that the pievailing nai-
iative aLout giassioots shaiing anu Saiah Palin was wiong.
jDisc|osurc: O`Rci||y A|phaTcch \cnturcs is an invcstor in bit.|y.j
Compete
By suiveying a cioss-section ol Ameiican consumeis, Compete Luilus up ue-
taileu usage statistics loi most weLsites, anu they make some Lasic uetails
lieely availaLle.
Choose the ¨Site Piolile¨ taL anu entei a uomain:
You`ll then see a giaph ol the site`s tiallic ovei the last yeai, togethei with
liguies loi how many people visiteu, anu how olten.
Since they`ie Laseu on suiveys, Compete`s numLeis aie only appioximate.
Nonetheless, I`ve lounu them ieasonaLly accuiate when I`ve Leen aLle to
compaie them against inteinal analytics.
Compete`s stats aie a goou souice when compaiing two sites. Vhile the aL-
solute numLeis may Le oll loi Loth sites, Compete still olleis a uecent iepie-
sentation ol the sites` ielative uilleience in populaiity.
Four free data tools for journalists (and snoops) | 47
One caveat: Compete only suiveys U.S. consumeis, so the uata will Le pooi
loi pieuominantly inteinational sites.
Additiona| data rcsourccs and too|s arc discusscd in Pctc`s jrcc cboo|.
The quiet rise of machine learning
AIasdair AIIan on how machine Iearning is taking over the mainstream.
Ly ]enn VeLL
The concept ol machine leaining was Liought to the loieliont loi the geneial
masses when IBM`s Vatson computei appeaieu on ]eopaiuy anu wipeu the
llooi with humanity. Foi those same masses, machine leaining guickly laueu
liom view as Vatson moveu out ol the spotlight ... oi so they may think.
Machine leaining is slowly anu guietly Lecoming uemociatizeu. Goouieaus,
loi instance, iecently puichaseu Discoveieaus.com, piesumaLly to make use
ol its machine leaining algoiithms to make Look iecommenuations.
To linu out moie aLout what`s happening in this iapiuly auvancing lielu, I
tuineu to Alasuaii Allan, an authoi anu senioi ieseaich lellow in Astionomy
at the Univeisity ol Exetei. In an email inteiview, he talkeu aLout how machine
leaining is Leing useu Lehinu the scenes in eveiyuay applications. He also
uiscusseu his cuiient eSTAR intelligent ioLotic telescope netwoik pioject anu
how that machine leaining-Laseu system coulu Le useu in othei applications.
|n what ways is nachinc |carning bcing uscd?
AIasdair AIIan: Machine leaining is guietly taking ovei in the mainstieam.
OiLitz, loi instance, is using it Lehinu the scenes to optimize caching ol hotel
piices, anu Google is going to ioll out smaitei auveitisements ÷ much ol the
machine leaining that consumeis aie seeing anu using eveiy uay is invisiLle to
them.
48 | Chapter 1:Data Science and Data Tools
The inteiesting thing aLout machine leaining iight now is that ieseaich in the
lielu is going on guietly as well Lecause laige coipoiations aie tieu up in non-
uisclosuie agieements. Vhile theie is a laige amount ol acauemic liteiatuie on
the suLject, it`s actually haiu to tell whethei this open ieseaich is actually
cuiient.
Ouuly, machine leaining ieseaich miiiois the way ciyptogiaphy ieseaich ue-
velopeu aiounu the miuule ol the 20th centuiy. Much ol the cutting euge ie-
seaich was uone in seciet, anu we`ie only linuing out now, +0 oi 50 yeais latei,
what GCHQ oi the NSA was uoing Lack then. I`m hopelul that it won`t take
guite that long loi Amazon oi Google to tell us what they`ie thinking aLout
touay.
How docs your cSTAR intc||igcnt robotic tc|cscopc nctwor| wor|?
AIasdair AIIan: My woik has locuseu on applying intelligent agent aichitec-
tuies anu technigues to astionomy loi telescope contiol anu scheuuling, anu
also loi uata mining. I`m cuiiently leauing the woik at Exetei Luiluing a peei-
to-peei uistiiLuteu netwoik ol telescopes that, acting entiiely autonomously,
can ieactively scheuule oLseivations ol time-ciitical tiansient events in ieal-
time. NotaLle successes incluue contiiLuting to the uetection ol the most uis-
tant oLject yet uiscoveieu, a gamma-iay Luistei at a ieushilt ol S.2.
The quiet rise of machine learning | 49
A diagran showing how thc cSTAR nctwor| opcratcs. Thc |ntc||igcnt Agcnts
acccss tc|cscopcs and cxisting astrononica| databascs through thc Grid.
CRED|T: joint Astronony Ccntrc. Eta Carinac inagc courtcsy oj N. Snith (U.
Co|orado), j. Morsc (Arizona Statc U.), and NASA.
All the components ol the system aie thought ol as agents ÷ ellectively
¨smait¨ pieces ol soltwaie. Negotiation takes place Letween the agents in the
system. each ol the iesouices Lius to caiiy out the woik, with the science agent
scheuuling the woik with the agent emLeuueu at the iesouice that piomises
to ietuin the Lest iesult.
This aichitectuial uistinction ol viewing Loth siues ol the negotiation as agents
÷ anu as eguals ÷ is ciucial. Impoitantly, this pieseives the autonomy ol
inuiviuual iesouices to implement oLseivation scheuuling at theii lacilities as
they see lit, anu it olleis incieaseu auaptaLility in the lace ol asynchionously
aiiiving uata.
The system is a meta-netwoik that layeis communication, negotiation, anu
ieal-time analysis soltwaie on top ol existing telescopes, allowing scheuuling
anu piioiitization ol oLseivations to Le uone locally. It is llat, peei-to-peei,
anu owneu anu opeiateu Ly uispaiate gioups with theii own goals anu piioi-
ities. Theie is no cential mastei-scheuulei oveiseeing the netwoik ÷ optimi-
zation aiises thiough emeiging complexity anu social convention.
How cou|d thc idcas bchind cSTAR bc app|icd c|scwhcrc?
AIasdair AIIan: Essentially what I`ve Luilt is a geogiaphically uistiiLuteu sen-
soi aichitectuie. The actual aichitectuies I`ve useu to uo this aie entiiely ge-
neiic ÷ lunuamentally, it`s just a peei-to-peei uistiiLuteu system loi optimiz-
ing scaice iesouices in ieal-time in the lace ol a constantly changing enviion-
ment.
The aichitectuies aie theieloie egually applicaLle to othei systems. The most
oLvious use case is sensoi motes. Cheap, possiLly even uisposaLle, single-use,
mesh-netwoikeu sensoi Lunules coulu Le uistiiLuteu ovei a laige geogiaphic
aiea to get situational awaieness guickly anu easily. Despite the unueilying
haiuwaie uilleiences, the same uistiiLuteu machine leaining-Laseu aichitec-
tuies can Le useu.
At FeLiuaiy`s Stiata conleience, Alasuaii Allan uiscusseu the amLiguity sui-
iounuing a loimal uelinition ol machine leaining:
http://youtuLe.com
This intcrvicw was cditcd and condcnscd.
50 | Chapter 1:Data Science and Data Tools
Where the semantic web stumbled, linked data will
succeed
Linked data aIIows for deep and serendipitous consumer experiences.
Ly Tylei Bell
In the same way that the Holy Roman Empiie was neithei holy noi Roman,
FaceLook`s OpenGiaph Piotocol is neithei open noi a piotocol. It is, howevei,
an extiemely stiaightloiwaiu anu applicaLle stanuaiu loi uocument metauata.
Fiom a stiictly semantic viewpoint, OpenGiaph is consiueieu haiuly woithy
ol comment: it is a liankenstanuaiu, a mishmash ol micioloimats anu loosely-
typeu entities, loLLeu casually into the semantic weL woilu with haiuly a
Lackwaiu glance.
But this is not impoitant. Vhile OpenGiaph avoius, oi outiight ignoies, many
ol the pioLlematic issues suiiounuing semantic annotation (see Alex Iskolu`s
excellent commentaiy on OpenGiaph heie on Rauai), ciiticism locusing only
on its technical puiity is missing hall ol the eguation. FaceLook gets it iight
wheie othei initiatives have laileu. Vhile OpenGiaph is incomplete anu im-
peilect, it is immeuiately usaLle anu sympathetic with extant appioaches.
Most impoitantly, OpenGiaph is one component in a wiuei ecosystem. Its
ueployment Lenelits aie appaient to the consumei anu the uevelopei: auu the
metatags, get the ¨likes,¨ know youi customeis.
Such consumei causality is ciitical to the auoption ol any semantic maik-up.
Ve`ve seen it Leloie with micioloimats, whose eventual populaiity was uiiven
Ly theii aLility to impiove how a page is iepiesenteu in seaich engine list-
ings, anu not Ly an aLstiact uesiie to stiuctuie the unstiuctuieu. Successlul
auoption will olten entail saciilicing stanuaiuization anu semantic puiity loi
piagmatic ease-ol-use; this is wheie the semantic weL appeais to have stum-
Lleu, anu wheie linkeu uata will most likely succeeu.
Linkeu uata intenus to make the VeL moie inteiconnecteu anu uata-oiienteu.
Beyonu this outcome, the teim is less iigiuly uelineu. I woulu aigue that linkeu
uata is moie ol an ethos than a stanuaiu, locuseu on pioviuing context,
assisting in uisamLiguation, anu incieasing seienuipity within the usei expe-
Where the semantic web stumbled, linked data will succeed | 51
iience. This iuea ol linkeu uata can Le ueliveieu Ly a numLei ol existing com-
ponents that woik togethei on the uata, platloim, anu application levels:
º Entity provision: Delining the who, what, wheie anu when ol the Intei-
net, entities encapsulate meaning anu pioviue context Ly type. In its most
Lasic sense, an entity is one iow in a list ol things oiganizeu Ly type÷such
as people, places, oi piouucts÷each with a unigue iuentiliei. Oiganiza-
tions that iealize the Lenelits ol linkeu uata aie ieleasing entities like nevei
Leloie, incluuing the puLlication ol 10,000 suLject heauings Ly the New
Yoik Times, aumin iegions anu postcoues liom the UK`s Oiunance Sui-
vey, placenames liom Yahoo GeoPlanet, anu the uata inliastiuctuies Le-
ing cieateu Ly Factual ¦uisclosuie: I`ve just signeu on with Factual¦.
º Entity annotation: Theie aie numeious loimats loi annotating entities
when they exist in unstiuctuieu content, such as a weL page oi Llog post.
FaceLook`s OpenGiaph is a loim ol entity annotation, as aie HTML5
miciouata, RDFa, anu micioloimats such as hcaiu. Miciouata is the shiny,
new playei in the game, Lut see Evan Piouiomou`s gieat post on RDFa v.
micioloimats loi a Lieakuown ol these two moie estaLlisheu appioaches.
º Endpoints and Introspection: Entities contiiLute Lest to a linkeu uata
ecosystem when each is associateu with a Uniloim Resouice Iuentiliei
(URI), an Inteinet-accessiLle, machine ieauaLle enupoint. These enu-
points shoulu pioviue introspcction, the means to oLtain the piopeities ol
that entity, incluuing its ielationship to otheis. Foi example, the Oiunance
Suivey URI loi the ¨City ol Southampton¨ is http://data.ordnanccsurvcy
.co.u|/id/700000000003725ó. Its piopeities can Le ietiieveu in machine-
ieauaLle loimat (RDF/XML,Tuitle anu ]SON) Ly appenuing an ¨iul,¨
¨ttl,¨ oi ¨json¨ extension to the aLove. To Le piopeily open, URIs must
Le accessiLle outsiue a loimal API anu authentication mechanism, ex-
poseu to semantically-awaie weL ciawleis anu seaich tools such as Yahoo
BOSS. Unuei this uelinition, local Lusiness URLs, loi example, can seive
in-pait as URIs÷`view souice` to see the semi-stiuctuieu uata in these
listings liom Yelp (using hcaiu anu OpenGiaph), anu Fouisguaie (using
miciouata anu OpenGiaph).
º Entity extraction: Some linkeu uata enthusiasts long loi the uay when
all content is annotateu so that it can Le unueistoou egually well Ly ma-
chines anu humans. Until we get to that happy place, we will continue to
iely on entity extiaction technologies that paise unstiuctuieu content loi
iecognizaLle entities, anu make contextually intelligent iuentilications ol
theii type anu iuentiliei. Nameu entity iecognition (NER) is one appioach
that employs the aLove entity lists, which may also Le comLineu with
heuiistic appioaches uesigneu to iecognize entities that lie outsiue ol a
known entity list. Yahoo, Google anu Miciosolt aie all hugely inteiesteu
52 | Chapter 1:Data Science and Data Tools
in this aiea, anu we`ll see an incieasing numLei ol staitups like Semanti-
net emeige with evei-impioving piecision anu iecall. Il you want to see
how entity extiaction woiks liist-hanu, check out Reuteis-owneu Open
Calais anu expeiiment with theii loim-Laseu tool.
º Entity concordance and crosswaIking: The multituue ol place name-
spaces illustiates how a single entity, such as a local Lusiness, will iesiue
in multiple lists. Because the ¨unigue¨ (U) in a URI is unigue only to a
given namespace, a woilu uiiven Ly linkeu uata ieguiies systems that ex-
plicitly match a single entity acioss namespaces. Examples ol ciosswalking
seivices incluue: Placecast`s Match API, which ietuins the Placecast IDs
ol any place when supplieu with an hcaiu eguivalent; Yahoo`s Concoiu-
ance, which ietuins the Vheie on Eaith Iuentiliei (VOEID) ol a place
using as input the place ID ol one ol louiteen exteinal iesouices, incluuing
OpenStieetMap anu Geonames; anu the Guaiuian Content API, which
allows useis to seaich Guaiuian content using non-Guaiuian iuentilieis.
These systems aie the unsung heioes ol the linkeu uata woilu, lacilitating
inteiopeiaLility Ly estaLlishing links Letween iuentical entities acioss
namespaces. Huge, uniealizeu value exists within these applications, anu
we neeu moie ol them.
º ReIationships: Entities aie only pait ol the stoiy. The ieal powei ol the
semantic weL is iealizeu in knowing how entities ol uilleient types ielate
to each othei: actois to movies, employees to companies, politicians to
uonois, iestauiants to neighLoihoous, oi Lianus to stoies. The powei ol
all giaphs÷these netwoiks ol entities÷is not in the entities themselves
(the noues), Lut how they ielate togethei (the euges). Howevei, I may Le
alone in Lelieving that we neeu to nail the pioLlem ol multiple instances
ol the same entity, via concoiuance anu ciosswalking, Leloie we can tap
piopeily into the iich vein that entity ielationships ollei.
The appioaches outlineu aLove comLine to help puLlisheis anu application
uevelopeis pioviue intelligent, ueep anu seienuipitous consumei expeiiences.
Examples incluue the semantic hanuset liom Aio MoLile, the BBC`s Voilu
Cup expeiience, anu aggiegating ieleiences on youi FaceLook news leeu.
Linkeu uata will tiiumph in this space Lecause elloits to uate locus less on the
how anu moie on the why. RDF, SPARQL, OVL, anu tiiple stoies aie oneious.
URIs, micio-loimats, RDFa, anu ]SON, less so. Vhy invest in uillicult tech-
nologies il consumei outcomes can Le iealizeu with extant tools anu knowl-
euge? Ve have the means to iealize linkeu uata now÷the pieces ol the puzzle
aie theie anu we (just) neeu to put them togethei.
Linkeu uata is, at last, Liinging the uiscussion aiounu to the usei. The con-
sumei ¨enu¨ tiumps the semantic ¨means.¨
Where the semantic web stumbled, linked data will succeed | 53
Social data is an oracle waiting for a question
ºMining the SociaI Web" author Matthew RusseII on the questions and
answers sociaI data can handIe.
Ly Mac Slocum
Ve`ie still in the stage wheie access to massive amounts ol social uata has
novelty. That`s why companies aie pumping out APIs anu seivices aie pop-
ping up to captuie anu soit all that inloimation. But ovei time, as the novelty
laues anu the toolsets impiove, we`ll move into a new phase that`s uelineu Ly
the app|ication ol social uata. Access will Le implieu. It`s what you uo with the
uata that will mattei.
Matthew Russell (¿ptwoLiussell), authoi ol ¨Mining the Social VeL¨ anu a
speakei at the upcoming Vheie 2.0 Conleience, has alieauy iounueu that
coinei. In the lollowing inteiview, Russell uiscusses the tools anu the minuset
that can unlock social uata`s ieal utility.
How do you dcjinc thc ¨socia| wcb¨?
Matthew RusseII: The ¨social weL¨ is aumitteuly a notional entity with some
Lluiiy Lounuaiies. Theie isn`t a Venn uiagiam that caives the ¨social weL¨ out
ol the oveiall weL laLiic. The weL is inheiently a social laLiic, anu it`s getting
moie social all the time.
The uistinction I make is that some paits ol the laLiic aie much easiei to access
than otheis. Natuially, the platloims that expose theii uata with well-uelineu
APIs will Le the ones to ieceive the most attention anu captuie the minushaie
when someone thinks ol the ¨social weL.¨
In that iegaiu, the social weL is moie ol a heatmap wheie the hot aieas aie
populai social netwoiking huLs like Twittei, FaceLook, anu LinkeuIn. Blogs,
54 | Chapter 1:Data Science and Data Tools
mailing lists, anu even souice coue iepositoiies such as Souice Foige GitHuL,
howevei, aie ceitainly pait ol the social weL.
What sorts oj qucstions can socia| data answcr?
Matthew RusseII: Heie aie some conciete examples ol guestions I askeu ÷
anu answeieu ÷ in ¨Mining the Social VeL¨:
º Vhat`s youi potential inlluence when you tweet?
º Vhat uoes ]ustin BieLei have (oi not have) in common with the Tea Paity?
º Vheie uoes most ol youi piolessional netwoik geogiaphically iesiue, anu
how might this impact caieei uecisions?
º How uo you summaiize the content ol Llog posts to guickly get the gist?
º Vhich ol youi liienus on Twittei, FaceLook, oi elsewheie know one an-
othei, anu how well?
It`s not haiu at all to ask lots ol valuaLle guestions against social weL uata anu
answei them with high uegiees ol ceitainty. The most populai souices ol social
uata aie populai Lecause they`ie geneially platloims that expose the uata
thiough well-cialteu APIs. The ellect is that it`s laiily easy to amass the uata
that you neeu to answei guestions.
Vith the necessaiy uata in hanu to answei youi guestions, the selection ol a
piogiamming language, toolkit, anu/oi liamewoik that makes shaking out the
answei is a ciitical step that shoulun`t Le taken lightly. The moie ellicient it is
to test youi hypotheses, the moie time you can spenu ana|yzing youi uata.
Spenuing sullicient time in analysis engenueis the kinu ol cieative lieeuom
neeueu to piouuce tiuly inteiesting iesults. This why oiganizations like Inlo-
chimps anu GNIP aie lilling a ciitical voiu.
Where 2.0: 2011, Leing helu Apiil 19-21 in Santa Claia, Calil., will exploie
the inteisection ol location technologies anu tienus in soltwaie uevelopment,
Lusiness stiategies, anu maiketing.
Social data is an oracle waiting for a question | 55
Save 25% on registration with the code WHR11RAD
What progranning s|i||s or dcvc|opncnt bac|ground do you nccd to cjjcctivc|y
ana|yzc socia| data?
Matthew RusseII: A Lasic piogiamming Lackgiounu uelinitely helps, Lecause
it allows you to automate so many ol the munuane tasks that aie involveu in
getting the uata anu munging it into a noimalizeu loim that`s easy to woik
with. That saiu, the lack ol a piogiamming Lackgiounu shoulu Le among the
last things that stops you liom uiving heau liist into social uata analysis. Il
you`ie sulliciently motivateu anu analytical enough to ask inteiesting gues-
tions, theie`s a veiy goou chance you can pick up an easy language, like Python
oi RuLy, anu leain enough to Le uangeious ovei a weekenu. The iest will take
caie ol itsell.
Why did you opt to usc GitHub to sharc thc cxanp|c codc jron thc boo|?
Matthew RusseII: GitHuL is a lantastic souice coue management tool, Lut
the most inteiesting thing aLout it is that it`s a socia| couing iepositoiy. Vhat
GitHuL allows you to uo is shaie coue in such a way that people can clone
youi coue iepositoiy. They can make impiovements oi loik the examples into
an entiiely new loim, anu then shaie those changes with the iest ol the woilu
in a veiy tianspaient way.
Il you look at the pioject I staiteu on GitHuL, you can see exactly who uiu
what with the coue, whethei I incoipoiateu theii changes Lack into my own
iepositoiy, whethei someone else has uone something novel Ly using an ex-
ample listing as a template, etc. You enu up with a community ol people that
emeige aiounu common causes, anu amazing things stait to happen as these
people shaie anu communicate aLout impoitant pioLlems anu ways to solve
them.
Vhile I ol couise want people Luy the Look, all ol the souice coue is out theie
loi the taking. I hope people put it to goou use.
The challenges of streaming real-time data
Jud VaIeski on how Gnip handIes the Twitter fire hose.
Ly Auuiey Vatteis
56 | Chapter 1:Data Science and Data Tools
Although Gnip hanules ieal-time stieaming ol uata liom a vaiiety ol social
meuia sites, it`s Lest known as the ollicial commeicial pioviuei ol the Twittei
activity stieam.
Fiankly, ¨stieam¨ is a misnomei. ¨Fiie hose,¨ the colloguial vaiiation, Lettei
iepiesents the toiient ol uata Twittei piouuces. That hose pumps out aiounu
155 million tweets pei uay, anu it`s all auuiesseu at a sustaineu iate.
I iecently spoke with Gnip CEO ]uu Valeski (¿jvaleski) aLout what it takes
to manage Twittei`s lloou ol uata anu how the Inteinet`s aichitectuie neeus
to auapt to ieal-time neeus. Oui inteiview lollows.
Thc |ntcrnct wasn`t rca||y bui|t to hand|c a rivcr oj big data. What arc thc ar-
chitcctura| cha||cngcs oj running rca|-tinc data through thcsc pipcs?
Jud VaIeski: The most signilicant challenge is iusty inliastiuctuie. ]ust as with
many massive inliastiuctuie piojects that the woilu has seen, auopteu, anu
exploiteu (agueuucts, highways, powei/eneigy giius), the connective tissue ol
the netwoik Lecomes exciuciatingly uateu. Ve`ie lucky to have gotten as lai
as we have on it. The capital Luilu-outs on Lehall ol the telecommunications
inuustiy have yielueu ielatively low-Lanuwiuth solutions lauen with lalse au-
veitising aLout tiue thioughput. The upsiue is that highly tiansactional HTTP
REST apps aie ielatively scalaLle in this enviionment anu they ¨just woik.¨ It
isn`t until we get into heavy payloau apps ÷ viueo stieaming, laige-scale ac-
tivity liie hoses like Twittei ÷ that the ueliciencies in touay`s netwoik get put
in the spotlight. That`s when the pipes Legin to Luist.
Ve can ieuesign applications to cieate smallei activities/actions in oiuei to
ieuuce oveiall sizes. Ve can use tightei piotocols/loimats (Piotocol Bulleis
loi example), anu compiession to minimize sizes as well. Howevei, with the
evei-incieasing usage ol social netwoiks geneiating moie ¨activities,¨ we`ie
iunning into tiue pipe capacity limits, anu those limits olten come with veiy
haiu stops. Typical Lusiness-class netwoik connections uon`t come close to
hanuling high volumes, anu you can loiget aLout consumei-class connections
hanuling them.
The challenges of streaming real-time data | 57
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
Beyonu inliastiuctuie issues, as engineeis, the weL app piogiamming we`ve
Leen uoing ovei the past 15 yeais has taught us to Luilu applications in a highly
synchionous tiansactional mannei. Because each HTTP tiansaction geneially
only lasts a seconu oi so at most, it`s easy to uigest anu piocess many uisciete
chunks ol uata. Howevei, the Lastaiu stepchilu ol eveiy HTTP liL`s ¨get()¨
ioutine that ietuins the complete iesult, is the ¨ieau()¨ ioutine that only gives
you a pooily Lounueu chunk.
You woulu Le shockeu at the iatio ol engineeis who can`t Luilu event-uiiven,
asynchionous uata piocessing applications, to those who can, yet this is a Lig
pait ol this space. Lack ol ecosystem knowleuge aiounu these kinus ol pio-
giamming piimitives is a Lig pioLlem. Many highei level aLstiactions exist loi
stieaming HTTP apps, Lut they`ie not inuustiial stiength, anu theieloie you
have to ieally know what`s going on to Luilu youi own.
Shilting Lack to inliastiuctuie: Olten the Liggei issue plaguing the netwoik
itsell is one ol latency, not thioughput. Vhile uata tenus to move guickly once
stieaming connections aie estaLlisheu, inevitaLle ieconnects cieate gaps. The
longei those connections take to stanu up, the Liggei the gaps. Run a tiaceioute
to youi lavoiite API anu see how many hops you take. It`s not pietty. Latencies
on the netwoik aie geneially a lunction ol ioutei anu gateway cluttei, as oui
packets Lounce acioss a uozen seiveis just to get to the main seivei anu then
Lack to the client.
How is Gnip addrcssing thcsc issucs?
58 | Chapter 1:Data Science and Data Tools
Jud VaIeski: On the inliastiuctuie siue, we aie tiying (successlully to-uate)
to use existing, ielatively oll the shell, Lack plane netwoik topologies in the
clouu to Luilu oui systems. Ve live on EC2 Laiges anu XLs to ensuie ueuicateu
NICs in oui clusteis. That helps with the ioutei anu gateway cluttei. Ve`ie
also woiking with Amazon to ensuie seamless connection upgiaues as volumes
inciease. These aie use cases they actually want to solve at a platloim level, so
oui incentives aie nicely aligneu. Ve also play at the IP-stack level to ensuie
packet tiansmission is optimizeu loi constant high-volume stieams.
Once total volumes move past stanuaiu inLounu anu outLounu connection
capaLilities, we will Le olleiing ueuicateu inteiconnects. Howevei, those come
at a veiy steep piice loi us anu oui volume customeis.
All ol this leaus me to my ieal answei: Tiimming the lat.
Vhile a sweet spot loi us is ceitainly high-volume uata consumeis, theie aie
many lolks who uon`t want volume, they want coveiage. Coveiage ol just the
activities they caie aLout; usually theii customeis` Lianus oi piouucts. Ve
take on the challenge ol uigesting anu piocessing the high volume on inLounu,
anu uistill the stieam uown to just the Lits oui coveiage customeis uesiie. You
may neeu 100º ol the activities that mention ¨goou loou,¨ Lut that oLviously
isn`t 100º ol a puLlishei`s liie hose. Piocessing high-velocity ioot stieams on
Lehall ol hunuieus ol customeis without auveisely impacting latency takes a
lot ol woik. Touay, that means goou ol`-lashioneu engineeiing.
What too|s and injrastructurc changcs arc nccdcd to bcttcr hand|c big-data
strcaning?
Jud VaIeski: ¨Big uata¨ as we talk aLout it touay has Leen slayeu Ly lots ol
cool aLstiactions (e.g. Hauoop) that lit nicely into the way we think aLout the
stack we all know anu love. ¨Big stieams,¨ on the othei hanu, challenge the
paiallelization piimitives lolks have Leen solving loi ¨Lig uata.¨ Theie`s veiy
little oveilap, unloitunately. So, on the soltwaie solution siue, Lettei anu moie
wiuely useu liamewoiks aie neeueu. Companies like BackType anu Gnip
pushing theii cuiient solutions onto the netwoik loi open ielinement woulu
Le an awesome step loiwaiu. I`m intiigueu Ly the piospect ol BackType`s
Stoim pioject, anu I`m looking loiwaiu to seeing moie ol it. Moie Liains leau
to Lettei solutions.
Ve shoulun`t Le giving CPU anu netwoik latency injection a seconu thought,
Lut we have to. The coue I wiite to piocess Lits as they come oll the wiie ÷
guickly ÷ shoulu just ¨go last,¨ iegaiuless ol its complexity. That`s too haiu
touay. It ieguiies too much custom coue.
On the inliastiuctuie siue ol things, ISPs neeu to pioviue cheapei access to
ieliaLle lat pipes. Il they uon`t, soltwaie will outpace theii lack ol innovation.
The challenges of streaming real-time data | 59
To Le cleai, they uon`t get this anu the soltwaie will lap them. You askeu what
I think we neeu, not what I think we`ll actually get.
This intcrvicw was cditcd and condcnscd.
60 | Chapter 1:Data Science and Data Tools
CHAPTER 2
Data Issues
Why the term “data science” is flawed but useful
Counterpoints to four common data science criticisms.
Ly Pete Vaiuen
Mention ¨uata science¨ to a lot ol the high-piolile people you might think
piactice it anu you`ie likely to see iolling eyes anu shaking heaus. It has taken
me a while, Lut I`ve leaineu to love the teim, uespite my uouLts. The key ieason
is that the iest ol the woilu unueistanus ioughly what I mean when I use it.
Altei yeais ol stumLling thiough long-winueu explanations aLout what I uo,
I can now say ¨I`m a uata scientist¨ anu move on. It is still an incieuiLly hazy
uelinition, Lut my loimei uesciiptions lelt people conluseu as well, so this
appioach is no woise anu at least saves time.
Vith that in minu, heie aie the aiguments I`ve heaiu against the teim, anu
why I uon`t think they shoulu stop its auoption.
It’s not a real science
I just linisheu ieauing ¨The Philosophical Bieaklast CluL,¨ the stoiy ol loui
Victoiian liienus who cieateu the mouein stiuctuie ol science, as well as in-
venting the woiu ¨scientist.¨ I giew up with the iuea that physics, chemistiy
anu Liology weie the only ieal sciences anu eveiy othei suLject using the teim
was just stealing theii clothes (¨Anything that neeus science in the name is not
61
a ieal science¨). The Look shows that liom the Leginning the laLel was nevei
iestiicteu to just the haiu expeiimental sciences. It was chosen to piomote a
uisciplineu appioach to ieasoning that ielieu on uata iathei than the pooily-
suppoiteu logical ueuuctions many contempoiaiies lavoieu. Data science lits
comloitaLly in this moie open tiauition.
OSCON Data 2011, Leing helu ]uly 25-27 in Poitlanu, Oie., is a gatheiing
loi uevelopeis who aie hanus-on, uoing the systems woik anu evolving aichi-
tectuies anu tools to manage uata. (This event is co-locateu with OSCON.)
Save 20% on registration with the code OS11RAD
It’s an unnecessary label
To me, it`s oLvious that theie has Leen a massive change in the lanuscape ovei
the last lew yeais. Data anu the tools to piocess it aie suuuenly aLunuant anu
cheap. Thousanus ol people aie exploiting this change, making things that
woulu have Leen impossiLle oi impiactical Leloie now, using a whole new set
ol technigues. Ve neeu a teim to uesciiLe this movement, so we can cieate
joL aus, conleiences, tiaining anu Looks that ieach the iight people. Those
goals might sounu veiy munuane, Lut without an agieeu-upon teim we just
can`t communicate.
The name doesn’t even make sense
As a liienu saiu, ¨show me a science that uoesn`t involve uata.¨ I hate the name
mysell, Lut I also know it coulu Le a lot woise. ]ust look at othei lielus that
sullei unuei teims like ¨new aichaeology¨ (now moie than 50 yeais olu) oi
¨moueinist ait¨ (pushing a centuiy). I leaineu liom teenage Lanus that the
naming piocess is the most uivisive pait ol any new ventuie, so my philosophy
has always Leen to take the name you`ie given, anu iely on time anu haiu woik
to give it the iight associations. Apple anu Miciosolt (née Micio-solt) aie tei-
62 | Chapter 2:Data Issues
iiLle staitup names Ly any oLjective measuie, Lut they`ve eaineu theii minu-
shaie. People aie calling what we`ie uoing ¨uata science,¨ so lets accept that
anu locus on moving the suLject loiwaiu.
There’s no definition
This is pioLaLly the ueepest oLjection, anu the one with the most teeth. Theie
is no wiuely accepteu Lounuaiy loi what`s insiue anu outsiue ol uata science`s
scope. Is it just a lauuish ieLianuing ol statistics? I uon`t think so, Lut I also
uon`t have a lull uelinition. I Lelieve that the iecent aLunuance ol uata has
spaikeu something new in the woilu, anu when I look aiounu I see people with
shaieu chaiacteiistics who uon`t lit into tiauitional categoiies. These people
tenu to woik Leyonu the naiiow specialties that uominate the coipoiate anu
institutional woilu, hanuling eveiything liom linuing the uata, piocessing it
at scale, visualizing it anu wiiting it up as a stoiy. They also seem to stait Ly
looking at what the uata can tell them, anu then picking inteiesting thieaus to
lollow, iathei than the tiauitional scientist`s appioach ol choosing the pioLlem
liist anu then linuing uata to sheu light on it. I uon`t know what the eventual
consensus will Le on the limits ol uata science, Lut we`ie staiting to see some
outlines emeige.
Time for the community to rally
I`m Letting a lot on the peisistence ol the teim. Il I`m wiong the Data Science
Toolkit will enu up sounuing as uateu as ¨suiling the inloimation supei-high-
way.¨ I think uata science, as a phiase, is heie to stay though, whethei we like
it oi not. That means we as a community can eithei step up anu steei its lutuie,
oi let otheis exploit its cuiient name iecognition anu uilute it Leyonu uselul-
ness. Il we uon`t ially aiounu a woikaLle uelinition to ieplace the cuiient
vagueness, we`ll have lost a poweilul tool loi explaining oui woik.
Why you can’t really anonymize your data
It's time to accept and work within the Iimits of data anonymization.
Ly Pete Vaiuen
Why you can’t really anonymize your data | 63
One ol the joys ol the last lew yeais has Leen the lloou ol ieal-woilu uatasets
Leing ieleaseu Ly all soits ol oiganizations. These usually involve some iecoiu
ol inuiviuuals` activities, so to assuage piivacy leais, the uistiiLutois will claim
that any peisonally-iuentilying inloimation (PII) has Leen stiippeu. The iuea
is that this makes it impossiLle to match any iecoiu with the peison it`s ie-
coiuing.
Something that my liienu Aivinu Naiayanan has taught me, Loth with theo-
ietical papeis anu iepeateu piactical uemonstiations, is that this anonymiza-
tion piocess is an illusion. Piecisely Lecause theie aie now so many uilleient
puLlic uatasets to cioss-ieleience, any set ol iecoius with a non-tiivial amount
ol inloimation on someone`s actions has a goou chance ol matching iuentili-
aLle puLlic iecoius. Aivinu liist uemonstiateu this when he anu his lellow
ieseaichei took the ¨anonymous¨ uataset ieleaseu as pait ol the liist Netllix
piize, anu uemonstiateu how he coulu coiielate the movie ientals listeu with
puLlic IMDB ieviews. That let them iuentily some nameu inuiviuuals, anu then
gave access to theii complete iental histoiies. Moie iecently, he anu his col-
laLoiatois useu the same appioach to win a Kaggle contest Ly matching the
topogiaphy ol the anonymizeu anu a puLlicly ciawleu veision ol the social
connections on Flicki. They weie aLle to take two paitial social giaphs, anu
like piecing togethei a jigsaw puzzle, liguie out liagments that matcheu anu
iepiesenteu the same useis in Loth.
All the known examples ol this type ol iuentilication aie liom the ieseaich
woilu ÷ no commeicial oi malicious uses have yet come to light ÷ Lut they
piove that anonymization is not an aLsolute piotection. In lact, it cieates a
lalse sense ol secuiity. Any uataset that has enough inloimation on people to
Le inteiesting to ieseaicheis also has enough inloimation to Le ue-anony-
mizeu. This is impoitant Lecause I want to see oui tools applieu to pioLlems
that ieally mattei in aieas like health anu ciime. This means ieleasing uetaileu
uatasets on those aieas to ieseaicheis, anu those aie Lounu to contain uata
moie sensitive than movie ientals oi photo logs. Il just one ol those sets is ue-
anonymizeu anu causes a usei Lacklash, we`ll lose access to all ol them.
So, what shoulu we uo? Accepting that anonymization is not a complete sol-
ution uoesn`t mean giving up, it just means we have to Le smaitei aLout oui
uata ieleases. Below I outline loui suggestions.
64 | Chapter 2:Data Issues
OSCON Data 2011, Leing helu ]uly 25-27 in Poitlanu, Oie., is a gatheiing
loi uevelopeis who aie hanus-on, uoing the systems woik anu evolving aichi-
tectuies anu tools to manage uata. (This event is co-locateu with OSCON.)
Save 20% on registration with the code OS11RAD
Keep the anonymization
]ust Lecause it`s not totally ieliaLle, uon`t stop stiipping out PII. It`s a goou
liist step, anu makes the ieconstiuction piocess much haiuei loi any attackei.
Acknowledge there’s a risk of de-anonymization
Don`t make lalse piomises to useis aLout how anonymous theii uata is. Make
the case to them that you`ie minimizing the iisk anu possiLle haim ol any uata
leaks, sell them on the Lenelits (eithei loi themselves oi the wiuei woilu) anu
get theii peimission to go aheau. This is a painlul slog, Lut the moie oigani-
zations that take this appioach, the easiei it will Le. A gieat mouel is Reuuit,
which askeu theii useis to opt-in to shaiing theii uata. They got a gieat ie-
sponse.
Limit the detail
Look at the iecoius you`ie getting ieauy to open up to the woilu, anu imagine
that they can Le linkeu Lack to nameu people. Aie theie paits ol it that aie
moie sensitive than otheis, anu mayLe less impoitant to the soit ol applications
you have in minu? Can you aggiegate multiple people togethei into cohoits
that iepiesent the aveiage Lehavioi ol small gioups?
Why you can’t really anonymize your data | 65
Learn from the experts
Theie`s many uecaues ol expeiience ol uealing with highly sensitive anu pei-
sonal uata in sociology anu economics uepaitments acioss the gloLe. They`ve
uevelopeutechnigues that coulu piove uselul to the emeiging community ol
uata scientists, such as suLtle uistoitions ol the inloimation to pievent iuen-
tilication ol inuiviuuals, oi even the soit ol lockeu-uown clean-ioom conui-
tions that aie ieguiieu to access uetaileu IRS uata.
Theie`s so much goou that can Le accomplisheu using open uatasets, it woulu
Le a tiageuy il we let this slip thiough oui lingeis with pieventaLle eiiois. Vith
a Lit ol caie up liont, anu an acknowleugement ol the challenges we lace, I
ieally Lelieve we can uelivei conciete Lenelits without uestioying people`s pii-
vacy.
Big data and the semantic web
At war, indifferent, or intimateIy connected?
Ly Euu DumLill
On Quoia, Geialu McCollum askeu il Lig uata anu the semantic weL weie
inuilleient to each othei, as theie was little uiscussion ol the semantic weL
topic at Stiata this FeLiuaiy.
My answei in Liiel is: Lig uata`s going to give the semantic weL the massive
amounts ol metauata it neeus to ieally get tiaction.
As the chaii ol the Stiata conleience, I see a vital link Letween Lig uata anu
semantic weL, anu have my own ioots in the semantic weL woilu. Eailiei this
yeai howevei, the inteiaction was not yet ol sullicient utility to make a stiong
connection in the conleience agenua.
Google and the semantic web
A goou example ol the uevelopment ol the ielationship Letween Lig uata anu
the semantic weL is Google. Eaily on, Google seaich escheweu explicit use ol
semantics, pieleiiing to inlei a vaiiety ol signals in oiuei to geneiate iesults.
They useu Lig uata to cieate signals such as PageRank.
66 | Chapter 2:Data Issues
Now, as the seaich algoiithms matuie, Google`s mission is to make theii ie-
sults evei moie uselul to useis. To achieve this, theii soltwaie must stait to
unueistanu moie aLout the actual woilu. Vho`s an authoi? Vhat`s a iecipe?
Vhat uo my liienus linu uselul? So the connections Letween entities Lecome
moie impoitant. To achieve this Google is using uata liom initiatives such as
schema.oig, RDFa anu micioloimats.
Google uo not use these semantic weL technigues to ieplace theii seaich, Lut
iathei to augment it anu make it moie uselul. To get all lancypants aLout it:
Google aie staiting to piomote the inloimation they gathei towaiu Leing
knowleuge. They even ienameu theii seaich gioup as ¨Knowleuge¨.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
Metadata is hard: big data can help
Conventionally, semantic weL systems geneiate metauata anu iuentilieu en-
tities explicitly, ie. Ly hanu oi as the output ol uataLase values. But as anyLouy
who`s tiieu to get useis to uo it will tell you, geneiating metauata is haiu. This
is pait ol why the lull semantic weL uieam isn`t yet iealizeu. Analytical ap-
pioaches take a uilleient appioach: suilacing anu classilying the metauata
liom analysis ol the actual content anu uata itsell. (Fieely exposing metauata
is also contioveisial anu iisky, as open uata auvocates will attest.)
Once Lig uata technigues have Leen successlully applieu, you have iuentilieu
entities anu the connections Letween them. Il you want to join that inloima-
tion up to the iest ol the weL, oi to concepts outsiue ol youi system, you neeu
Big data and the semantic web | 67
a language in which to uo that. You neeu to oiganize, exchange anu ieason
aLout those entities. It`s this liamewoik that has Leen steauily Luilt up ovei
the last 15 yeais with the semantic weL pioject.
To give an alieauy wiuespieau example: many uata scientists use Vikipeuia
to help with entity iesolution anu uisamLiguation, using Vikipeuia URLs to
iuentily entities. This is a classic use ol the most lunuamental ol semantic weL
technologies: the URI.
Foi Stiata, as oui New Yoik seiies ol conleiences appioaches, we will Le stait-
ing to incluue a little moie semantic weL, Lut with a stiict emphasis on utility.
Stiata itsell is not as much Leholuen to Lig uata, as aLout Leing uata-uiiven,
anu the ongoing conseguences that has loi technology, Lusiness anu society.
Big data: Global good or zero-sum arms race?
It remains to be seen if big data wiII cataIyze exponentiaI growth.
Ly ]im Stoguill
Last month, Netezza CEO ]im Baum gave a talk at the GigaOM Lig uata
event. Il I`m honest, I was checking my email anu misseu most ol it, Lut I uo
iememLei tuning in just in time to heai him say something like ¨Lig uata is
going to have a huge economic impact.¨
I spenu most ol my uays consiueiing how the component pieces ol this Lig
uata tiansloimation will impact the coipoiate enteipiise. Baum`s comment
got me thinking, though, aLout a moie meta guestion: Is ¨Lig uata¨ a key to
some kinu ol inuustiial ievolution ieLoot? Oi, is it just going to Le expensive
taLle stakes loi pieviously simple-to-unueistanu Lusinesses?
Foi 200-plus yeais the inuustiial ievolution
'
has Leen a kinu ol Mooie`s law
ol human piouuctivity. Ovei that peiiou oui economic output pei peison has
Leen giowing like clockwoik, anu whatevei you think ol the vaiious political
-isms that spiung liom inuustiialization, this maich ol piouuctivity has pulleu
' Ior thc purposcs oj this post, |`n trcating thc industria| agc and thc injornation agc as two parts oj
onc continuun.
68 | Chapter 2:Data Issues
a lot ol people out ol poveity anu is cause loi the liist sustaineu inciease in
wealth acioss human histoiy.
But like Mooie`s law in a single coie, oui inuustiial ievolution in auvanceu
economies is kinua playing out. Oui economy has Leen shilting loi some time
towaiu seivices that aie pioving to Le impeivious to oiuei-ol-magnituue pio-
uuctivity gains. The thousanu-lolu incieases in piouuctivity we saw on the
laim anu in the lactoiy just uon`t seem likely to happen in health caie anu
othei seivice intensive sectois.
Ol couise oui economy continues to giow, Lut at a iate that is staying just a
skosh aheau ol population giowth. Anu since the top 1º aie taking all ol
that (anu peihaps moie), loi the liist time in Ameiican histoiy paients aie
woiiying that theii kius won`t have oppoitunities Lettei than theii own. Voila!
Theie stems the populist angei that leeus the Tea Paity.
That`s the U.S.-centiic view. Ol couise on a gloLal Lasis theie is tiemenuous
giowth as late-stage inuustiial ievolution innovations aie applieu with vigoi
to ueveloping economies. The Sº giowth iates many countiies aie achieving
will uouLle theii population`s wealth eveiy 10 yeais. But loi the U.S., achieving
giowth ieguiies paiallelism. Ol couise, in this context we call it gloLalism anu
it means il we can`t Le moie piouuctive in one place, we have to take auvantage
ol mouein communications to uo it in a Lunch ol othei cheapei places. The
pioLlem is, altei a lew uecaues ol those Sº oveiseas giowth iates, theie will
Le less compaiative auvantage loi us to take auvantage ol anu il we want con-
tinueu economic giowth, we ieally will neeu to linu ways to Le moie piouuc-
tive.
So, that`s why Baum`s comment stuck in my heau.
At the iisk ol way ovei geneializing, so lai ¨Lig uata¨ has mostly Leen aLout
Lehavioial analysis to Lettei taiget aus. Is that what Baum meant? That moie
ellectively matching piouucei anu consumei long tails thiough piecision au
placement is going to lunuamentally change the economy? That type ol match-
ing can piomote economic activity, which is goou, Lut I uon`t see the link to
lunuamentally impioveu piouuctivity. Il this kinu ol innovation pulls anothei
tianche ol the Lell cuive out ol poveity it will uo it Ly putting moie people to
woik uoing the same stull, not Ly making oui economy lunuamentally moie
ellicient.
Vhen I heaiu ¨huge impact on the economy,¨ my liist thought was mayLe it`s
just a thiow-away comment. MayLe he just meant the economy as seen
thiough the naiiow lens ol his company ievenues. But then I tiieu to think
aLout this on a ueepei level: Vhat`s heie that I haven`t consiueieu? Coulu he
somehow Le saying that this is a catalyst loi the next Lig phase ol piouuctivity
giowth in oui 200-yeai-olu inuustiial ievolution? Is this the inuustiial ievolu-
Big data: Global good or zero-sum arms race? | 69
tion eguivalent ol nanometei chip uesign, which staits the next uecaue ol
uouLling? Is it the thing that gets the miuule class giowing again anu eases all
this populist angei?
Yeah, that might sounu kinu ol aLsuiu, Lut that`s how my heau woiks ÷ a
uaily stieam ol ADHD-lueleu Lig uieams immeuiately uasheu on the iocks ol
ieality.
(As an asiue, hall way thiough wiiting this I came acioss a pieuiction ol the
¨wine anu ioses¨ we`ll all expeiience with this ¨New Inloimation Age.¨ Don`t
sweat the ueath ol piivacy, the suiveillance state is highly unlikely.)
So, Lack to the guestion: Is Lig uata an economic uiivei oi just a must-have to
Le in the game?
As eaily as the 1950s it was oLvious that ioLotic automation was going to
lunuamentally change manulactuiing. As automoLiles incieasingly weie Luilt
Ly ioLotic laLoi, the inuustiy saw incieuiLle piouuctivity gains. The houis ol
human laLoi pei automoLile uioppeu Ly oiueis ol magnituue ovei the next
30 yeais. Natuially, cais uiun`t just get cheapei, they also got moie complex
anu leatuie-iich. But anyone coulu unueistanu the ietuin on capital ol instal-
ling ioLotic lines. Vhat`s the ietuin on capital look like loi a Hauoop clustei?
It`s woith noting that ioLots uiun`t just inciease piouuctivity, they also iesha-
peu laLoi`s ielationship with management. Il you`ie laLoi, competing with a
ioLot sucks. This was piesciently uesciiLeu Ly NoiLeit Veinei in his classic
¨The Human Use ol Human Beings, CyLeinetics anu Society.¨ Ol couise we
uon`t neeu histoiy`s waining to know that Lig uata might have a uaik siue,
too. Il you uon`t see it now, you will when you uownloau a new cai steieo
soltwaie veision anu it iesets all youi iauio station piesets Laseu on Toyota`s
notion ol people like you. Ol couise, loi a cai company to Le as oLnoxious as
youi soltwaie anu seaich Lai pioviueis have long Leen, they have to leain as
much aLout you as those soltwaie guys uo, anu that`s weiiu. Ve aien`t ieally
useu to the iuea ol a manulactuiei knowing wheie we go anu who we go theie
with.
Theie oLviously aie places wheie laige-scale uata anu analysis will impiove
elliciencies anu piouuctivity. Paiticulaily in aieas like smait giiu, wheie it will
ieuuce the investment necessaiy in powei plant constiuction, oi linancial
seivices, wheie it piomises to help light liauuulent tiansactions. Vhat else?
Aie theie Lig oppoitunities loi oiuei-ol-magnituue piouuctivity gains out
theie that come to minu? Oi is most ol the value cieateu Ly this ¨new inloi-
mation age¨ going to Le in some mushy uppei iegion ol Maslow`s hieiaichy?
A kinu ol miuule class leel-goou machine that iemains completely iiielevant
to the woiking pooi uieaming ol theii liist homes?
70 | Chapter 2:Data Issues
NoiLeit Veinei was conceineu that automation-Laseu piouuctivity gains
woulu uisiupt the woiking man anu woman`s living. He helu that concein in
the lace ol the oLvious anu compelling piouuctivity gains that weie suie to
llow thiough to GDP as wealth.
As we entei the Lig uata eia ol the inloimation age anu give up what`s lelt ol
oui piivacy, I`u like to think that it will Le loi moie than a zeio-sum game ol
musical chaiis to ueciue the next winneis.
The truth about data: Once it’s out there, it’s hard to
control
Jeff Jonas on data ownership, security concerns, and privacy trade offs.
Ly ]enn VeLL
The amount ol uata Leing piouuceu is incieasing exponentially, which iaises
Lig guestions aLout secuiity anu owneiship. Do we neeu to Le moie conceineu
aLout the inloimation many ol us ieauily give out to join populai social net-
woiks, sign up loi weLsite community memLeiships, oi suLsciiLe to liee on-
line email? Anu what happens to that uata once it`s out theie?
In a iecent inteiview, ]ell ]onas (¿]ell]onas), IBM uistinguisheu engineei anu
a speakei at the O`Reilly Stiata Online Conleience, saiu consumeis` willing-
ness to give away theii uata is a concein, Lut it`s peihaps seconuaiy to the sheei
numLei ol uata copies piouuceu.
Oui inteiview lollows.
What is thc currcnt statc oj data sccurity?
Jeff Jonas: A lot ol uata has Leen cieateu, anu a Loatloau moie is on its way
÷ we have seen nothing yet. Oiganizations now wonuei how they aie going
to piotect all this uata ÷ especially how to piotect it liom unintenueu uisclo-
suie. Healthcaie pioviueis, loi example, aie just as ueteimineu to pievent a
¨wickeu leak¨ as anyone else. ]ust imagine the conveisation Letween the CIO
anu the Loaiu tiying to explain the iisk ol the enemy within ÷ the ¨insiuei
thieat¨ ÷ anu the enuless anu evei-changing attack vectois.
The truth about data: Once it’s out there, it’s hard to control | 71
I`m thinking a lot these uays aLout uata piotection, ianging liom ieuucing the
numLei ol copies ol uata to uata anonymization to peipetual insiuei thieat
uetection.
How arc advanccncnts in data gathcring, ana|ysis, and app|ication ajjccting
privacy, and shou|d wc bc conccrncd?
Jeff Jonas: Vhen oiganizations only collect what they neeu in oiuei to con-
uuct Lusiness, tell the consumei what they aie collecting, why anu how they
aie going to use it, anu then use it this way, most woulu say ¨laii game.¨ This
is all in line with Faii Inloimation Piactices (FIPs).
Theie continues to Le some piogiess in the aiea ol piivacy-enhancing tech-
nology. Foi example, tampei-iesistant auuit logs, which aie a way to iecoiu
how a system was useu that even the uataLase auministiatoi cannot altei. On
the othei hanu, the tienu that I see involves the willingness ol consumeis to
give up all kinus ol peisonal uata in ietuin loi some Lenelit ÷ liee email oi a
lantastic social netwoik site, loi example.
Vhile it is haiu to not Le conceineu aLout what is happening to oui piivacy,
I have to aumit that loi the most pait technology auvances aie ieally ueliveiing
a lot ol Lenelit to mankinu.
The Strata OnIine Conference, Leing helu Apiil 6, will look at how inloi-
mation ÷ anu the aLility to put it to woik ÷ will shape tomoiiow`s maikets.
Scheuuleu speakeis incluue: Gavin Staiks liom AMEE, ]ell ]onas liom IBM,
Chiis Thoipe liom Aitlinuei, anu Ian Vhite liom UiLan Mapping.
72 | Chapter 2:Data Issues
Registiation is open
What arc thc najor issucs surrounding data owncrship?
Jeff Jonas: Il useis continue to give theii uata away Lecause the Lenelits aie
iiiesistiLle, then theie will Le lewei Lattles, I suppose. The tiuth aLout uata is
that once it is out theie, it`s haiu to contiol.
I uiu a Lack ol the envelope estimate a lew yeais ago to estimate the numLei
ol copies a single piece ol uata may expeiience. Tuins out the numLei is
ioughly the same as the numLei ol licks it takes to get to the centei ol a Tootsie
Pop ÷ a play on an olu TV commeicial that Lasically tianslates to moie than
you can easily count.
A well-thought-out uata Lackup stiategy alone may cieate moie than 100 cop-
ies. Then what aLout the opeiational uata stoies, uata waiehouses, uata maits,
seconuaiy systems anu theii Lackups? Thousanus ol copies woulu not Le un-
common. Even il a consumei thought they coulu own theii uata ÷ which they
can`t in many settings ÷ how coulu they evei uo anything to allect it?
The truth about data: Once it’s out there, it’s hard to control | 73
CHAPTER 3
The Application of Data: Products
and Processes
How the Library of Congress is building the Twitter
archive
Checking in on the Library of Congress' Twitter archive, one year Iater.
Ly Auuiey Vatteis
In Apiil 2010, Twittei announceu it was uonating its entiie aichive ol puLlic
tweets to the LiLiaiy ol Congiess. Eveiy tweet since Twittei`s inception in 2006
woulu Le pieseiveu. The uonation ol the aichive to the LiLiaiy ol Congiess
may have Leen in pait a symLolic act, a iecognition ol the cultuial signilicance
ol Twittei. Although seveial impoitant histoiical moments hau alieauy Leen
captuieu on Twittei when the announcement was maue last yeai (the liist
tweet liom space, loi example, Baiack OLama`s liist tweet as Piesiuent, oi
news ol Michael ]ackson`s ueath), since then oui awaieness ol the signilicance
ol the communication channel has ceitainly giown.
75
That`s leu to a lloou ol inguiiies to the LiLiaiy ol Congiess aLout how anu
when ieseaicheis will Le aLle to gain access to the Twittei aichive. These ie-
seaich ieguests weie peihaps heighteneu Ly some ol the changes that Twittei
has maue to its API anu liiehose access.
But cieating a Twittei aichive is a majoi unueitaking loi the LiLiaiy ol Con-
giess, anu the piocess isn`t as simple as meiely ciacking open a lile loi ie-
seaicheis to peiuse. I spoke with Maitha Anueison, the heau ol the liLiaiy`s
National Digital Inloimation Inliastiuctuie anu Pieseivation Piogiam
(NDIIP), anu Leslie ]ohnston, the managei ol the NDIIP`s Technical Aichi-
tectuie Initiatives, aLout the challenges anu oppoitunities ol aichiving uigital
uata ol this kinu.
It`s impoitant to note that the LiLiaiy ol Congiess is guite auept with the
pieseivation ol uigital mateiials, as it`s Leen hanuling these types ol piojects
loi moie than a uecaue. The liLiaiy has Leen aichiving congiessional anu
piesiuential campaign weLsites since 2000, loi example, anu it cuiiently has
moie than 200 teiaLytes ol weL aichives. It also has hunuieus ol teiaLytes ol
uigitizeu newspapeis, anu petaLytes ol uata liom othei souices, such as lilm
aichives anu mateiials liom the Folklile Centei. So the Twittei aichives lall
within the puiview ol these soits ol uigital pieseivation elloits, anu in teims
ol the size ol the aichive, it is actually not too unwieluy.
Even with a long expeiience with aichiving ¨Loin uigital¨ content, Anueison
says the LiLiaiy ol Congiess ¨lelt pietty Liave aLout taking on Twittei.¨
76 | Chapter 3:The Application of Data: Products and Processes
OSCON Data 2011, Leing helu ]uly 25-27 in Poitlanu, Oie., is a gatheiing
loi uevelopeis who aie hanus-on, uoing the systems woik anu evolving aichi-
tectuies anu tools to manage uata. (This event is co-locateu with OSCON.)
Save 20% on registration with the code OS11RAD
Vhat makes the enueavoi challenging, il not the size ol the aichive, is its com-
position: Lillions anu Lillions anu Lillions ol tweets. Vhen the uonation was
announceu last yeai, useis weie cieating aLout 50 million tweets pei uay. As
ol Twittei`s lilth anniveisaiy seveial months ago, that numLei has incieaseu
to aLout 1+0 million tweets pei uay. The uata keeps coming too, anu the Li-
Liaiy ol Congiess has access to the Twittei stieam via Gnip loi Loth ieal-time
anu histoiical tweet uata.
Each tweet is a ]SON lile, containing an immense amount ol metauata in au-
uition to the contents ol the tweet itsell: uate anu time, numLei ol lolloweis,
account cieation uate, geouata, anu so on. To auu anothei layei ol complexity,
many tweets contain shoiteneu URLs, anu the LiLiaiy ol Congiess is in uis-
cussions with many ol these pioviueis as well as with the Inteinet Aichive anu
its 301woiks pioject to help iesolve anu map the links.
As it stanus, Anueison anu ]ohnston say they won`t Le ciawling all these ex-
teinal sites anu enu-points, although Anueison says that in hei ¨gianu vision
ol the lutuie¨ all ol this uata ÷ not just liom the LiLiaiy ol Congiess Lut liom
all these uilleient technological anu cultuial heiitage institutions ÷ woulu Le
linkeu. In the meantime, the LiLiaiy ol Congiess won`t Le cieating a catalog
ol all these tweets anu all this uata, Lut they uo want to Le aLle to inuex the
mateiial so ieseaicheis can ellectively seaich it.
This ieguiies a signilicant technological unueitaking on the pait ol the liLiaiy
in oiuei to Luilu the inliastiuctuie necessaiy to hanule inguiiies, anu specili-
cally to hanule the soits ol inguiiies that ieseaicheis aie clamoiing loi. An-
ueison anu ]ohnston say that a cioss-uepaitmental team has Leen assemLleu
at the liLiaiy, anu they`ie actively taking input liom ieseaicheis to linu out
How the Library of Congress is building the Twitter archive | 77
exactly what theii neeus loi the mateiial may Le. Expectations also neeu to Le
set aLout exactly what the seaich paiameteis will Le ÷ this is a high-Lanu-
wiuth, high-computing-powei unueitaking altei all.
The pioject is still veiy much unuei constiuction, anu the team is weighing a
numLei ol uilleient open souice technologies in oiuei to Luilu out the stoiage,
management anu gueiying ol the Twittei aichive. Vhile the uecision hasn`t
Leen maue yet on which tools to use, the liLiaiy is testing the lollowing in
vaiious comLinations: Hive, ElasticSeaich, Pig, Elephant-Liiu, HBase, anu
Hauoop.
A pilot woikshop is slateu to iun this summei with ieseaicheis who can help
guiue the LiLiaiy ol Congiess in Luiluing out the aichive anu its accessiLility.
Anueison anu ]ohnston say they expect an initial olleiing to Le maue availaLle
in loui oi live months. But even then, access to the Twittei aichive will Le
iestiicteu to ¨known ieseaicheis¨ who will neeu to go thiough the LiLiaiy ol
Congiess appioval piocess to gain access to the uata. Baseu on the sheei num-
Lei ol ieseaich ieguests, theie aie going to Le plenty ol scholais lineu up to
have a closei examination ol this impoitant cultuial anu technological aichive.
Photo: Library oj Congrcss Rcading Roon 1 by navcric2003, on I|ic|r
Data journalism, data tools, and the newsroom stack
The 2011 Knight News ChaIIenge winners iIIustrate data's ascendance
in media and government.
Ly Alex Howaiu
MIT`s iecent Civic Meuia Conleience anu the latest Latch ol Knight News
Challenge winneis maue one ieality ciystal cleai: as a new eia ol technology-
lueleu tianspaiency, innovation anu open goveinment uawns, it won`t uepenu
on any single CIO oi leueial piogiam. It will Le uiiven Ly a uistiiLuteu com-
munity ol meuia, nonpiolits, acauemics anu civic auvocates locuseu on Lettei
outcomes, moie inloimeu communities anu the new news, whatevei loim it
is ueliveieu in.
78 | Chapter 3:The Application of Data: Products and Processes
The themes that unite this class ol Knight News Challenge winneis weie uata
jouinalism anu platloims loi civic connections. Each theme uiaws liom cential
iealities ol the inloimation ecosystems ol touay. Newsiooms anu citizens aie
conlionteu Ly unpieceuenteu amounts ol uata anu an expanueu numLei ol
news souices, incluuing a social weL populateu Ly oui liienus, lamily anu
colleagues. Newsiooms, the tiauitional hosts loi inloimation gatheiing anu
uissemination, aie now pait ol a llatteneu enviionment loi news, wheie news
Lieaks liist on social netwoiks, is cuiateu Ly a comLination ol piolessionals
anu amateuis, anu then analyzeu anu synthesizeu into contextualizeu joui-
nalism.
Data journalism and data tools
In an age ol inloimation aLunuance, jouinalists anu citizens alike all neeu
Lettei tools, whethei we`ie cuiating the samizuat ol the 21st centuiy in the
Miuule East, like Anuy Caivin, piocessing a late night uata uump, oi looking
loi the Lest way to visualize watei guality to a nation ol consumeis. As we
giapple with the consumption challenges piesenteu Ly this ueluge ol uata, new
puLlishing platloims aie also empoweiing us to gathei, ieline, analyze anu
shaie uata ouiselves, tuining it into inloimation.
In this lutuie ol meuia, as Mathew Ingiam wiote at GigaOm, Lig uata meets
jouinalism, in the same way that staitups see uata as an innovation engine, oi
civic uevelopeis see uata as the luel loi applications. ¨The meuia inuustiy is
(hopelully) staiting to unueistanu that uata can Le uselul loi its puiposes as
well,¨ Ingiam wiote. He continueu:
Data journalism, data tools, and the newsroom stack | 79
... uata anu the tools to manipulate it aie the mouein eguivalent ol the micio-
liche liLiaiies anu envelopes lull ol newspapei clippings that useu to make up
the ieseaich aim ol most meuia outlets. They aie just tools, Lut as some ol the
winneis ol the Knight News Challenge have alieauy shown, these new tools
can piouuce inloimation that might nevei have Leen lounu Leloie thiough
tiauitional means.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science ÷ liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
The Poyntei Institute took note ol the attention paiu to uata Ly the Knight
Founuation as well. As Steve Myeis iepoiteu, the Knight News Challenge gave
$1.5 million to piojects that liltei anu examine uata. The winneis that ielate
to uata jouinalism incluue:
º Oveiview, which is a tool to help jouinalists linu stoiies in laige amounts
ol uata Ly cleaning, visualizing anu inteiactively exploiing laige uocument
anu uata sets. Associateu Piess uata jouinalist ]onathan Stiay calleu Ovei-
view the ¨Diupal ol uata visualization¨.
º SciapeiViki, a lavoiite tool ol civic coueis at Coue loi Ameiica anu else-
wheie, enaLles anyone to collect, stoie anu puLlish puLlic uata. Vith the
Knight lunuing, the next veision ol SciapeiViki will Le even moie pow-
eilul.
º OpenBlock Ruial will use the OpenBlock platloim to paitnei with local
goveinments anu community newspapeis to collect anu puLlish uata in
Noith Caiolina, incluuing ciime, ieal estate, school iatings anu iestauiant
inspections.
80 | Chapter 3:The Application of Data: Products and Processes
º The PANDA Pioject will tiy to make ieseaich easiei in the newsioom with
a set ol open souice, weL-Laseu tools oiienteu at making it easiei loi joui-
nalists to use anu analyze uata.
I talkeu moie with the AP`s ]onathan Stiay aLout uata jouinalism anu Ovei-
view at the MIT Civic Meuia in the viueo Lelow. Foi an even ueepei uive into
his thinking on what jouinalists neeu in the age ol Lig uata, ieau his thoughts
on ¨the euitoiial seaich engine.¨
http://youtuLe.com
The newsroom stack
Vith these investments in the lutuie ol jouinalism, moie seeus have Leen
planteu to auu to a ¨newsroom stack,¨ to Loiiow a technical teim lamiliai
to Rauai ieaueis, comLining a seiies ol technologies loi use in a given entei-
piise.
¨I like the thought ol it,¨ saiu Biian Boyei, the pioject managei loi PANDA,
in an inteiview at the MIT Meuia LaL. ¨The newsioom stack coulu auu up to
the kit ol tools that you ought to Le using in youi uay to uay iepoiting.¨
Boyei uesciiLeu how the llow ol uata might move liom a spieausheet (as a .CSV
lile) to Google Reline (loi tiuying, clusteiing, auuing columns) to PANDA anu
then on to Oveiview oi Fusion TaLles oi Many Eyes, loi visualization. This is
aLout ¨small pieces, loosely joineu,¨ he saiu. ¨I woulu iathei Luilu one ieally
goou small piece than one Lig pioject that uoes eveiything.¨
PANDA anu Oveiview aie sguaiely oiienteu at Lieau-anu-Luttei issues loi
newsiooms in the age ol Lig uata. ¨It`s a pain to seaich acioss uatasets, Lut we
also have this geneial newsioom content management issue,¨ saiu Boyei. ¨The
uata stuck on youi haiu uiive is sau uata. Knowleuge management isn`t a sexy
pioLlem to solve, Lut it`s a ieal Lusiness pioLlem. People coulu Le uoing Lettei
iepoiting il they knew what was availaLle. Data shoulu Le visiLle inteinally.¨
Boyei thinks the tienus towaiu Lig uata in meuia aie pietty cleai, anu that he
anu othei hackei jouinalists can help theii colleagues to not only unueistanu
it Lut to thiive. ¨Theie`s a lot moie ol it, with goveinment ieleasing its stull
moie iapiuly,¨ he saiu. ¨The city ol Chicago is uiopping two uatasets a week
iight now. Ve`ie going loi incieaseu elliciency, to help people woik lastei anu
wiite Lettei stoiies. Eveiy majoi news oig in the countiy is hiiing a news app
uevelopei iight now. Oi two. Foi smallei news oiganizations, it ieally woiks
loi them. Theii uata apps account loi the majoiity ol theii tiallic.¨
Data journalism, data tools, and the newsroom stack | 81
Bridging the data divide
Theie`s some caution meiiteu heie. Big uata is not a panacea to all things, in
meuia oi otheiwise. Gieg Boienstein exploieu some ol these issues in his post
on Lig uata anu cyLeinetics eailiei this month. Shoit veision: humans still
mattei in Luiluing human ielationships anu making sense ol what matteis,
howevei goou oui peisonalizeu ielevance engines loi news Lecome. Piopo-
nents ol open uata have to consiuei a complementaiy concein: uigital liteiacy.
As ]esse Lichtenstein asseiteu ¨open uata along isn`t enough,¨ lollowing the
thieau ol uanah Loyu`s ¨tianspaiency is not enough talk at the 2010 Gov 2.0
Expo. Open uata can empowei the empoweieu.
To make open goveinment uata sing, inlomeuiaiies neeu to have time anu
iesouices. Il we`ie going to hope that citizens will uiaw theii own conclusions
liom showing puLlic uata in ieal-time, we`ll neeu to euucate them to Le aLle
to Le ciitical thinkeis. As Anuy Caivin tweeteu uuiing the MIT Civic Meuia
conleience, ¨you neeu to Le suie those people have high levels ol uigital liteiacy
anu meuia liteiacy.¨ Theie`s a uata uiviue that has to Le consiueieu heie, as
Nick Claik ]uuu pointeu out ovei at techPiesiuent.
It looks like those conceins weie at least paitially lactoieu into the juuges`
uecision on othei Knight News Challenge winneis. Spenuing Stoiies, liom the
Open Knowleuge Founuation, is uesigneu to auu context to news stoiies Laseu
upon goveinment uata Ly connecting stoiies to the uata useu. Poueiapeuia
will tiy to Liing moie tianspaiency to Chile using uata visualizations that uiaw
upon a uataLase ol ol euitoiial anu ciowusouiceu uata. The State Decoueu
will tiy to make the law moie usei-liienuly. The pioject has notaLle open gov-
einment DNA: Valuo ]aguith`s woik on OpenViiginia was aimeu at pioviuing
an API loi the Commonwealth.
Theie weie citizen science anu tianspaiency piojects alongsiue all ol those uata
plays too, incluuing:
º PuLlic LaLoiatoiy, a tool kit anu online community loi giassioots uata
gatheiing anu ieseaich that Luilus upon the success ol Giassioots Map-
ping.
º NextDiop, a pioject to pioviue moLile access to inloimation aLout watei
availaLility to iesiuents ol a city in Inuia.
Given the iecent stoiy heie at Rauai on citizen science anu ciowusouiceu ia-
uiation uata, theie`s goou ieason to watch Loth ol these piojects evolve. Anu
given ieseaich liom the Pew Inteinet anu Lile Pioject on the iole ol the Inteinet
as a platloim loi collective action, the ellect ol connecting like-minueu citizens
to one anothei thiough elloits like the Tiziano Pioject may piove lai ieaching.
82 | Chapter 3:The Application of Data: Products and Processes
Photo: NYTincs: 3ó5/3ó0 - 1981 (in co|or) by b|prnt_van, on I|ic|r
The data analysis path is built on curiosity, followed by
action
Why simpIicity, empiricism, and DIY are keys to data anaIysis.
Ly Mac Slocum
A tiauitional view ol uata analysis involves piecision, piepaiation, anu me-
thouical examination ol uelineu uatasets. Philipp ]aneit, authoi ol ¨Data
Analysis with Open Souice Tools,¨ has a somewhat uilleient peispective.
Those tiauitional elements aie still impoitant, Lut ]aneit also thinks simplicity,
expeiimentation, action, anu natuial cuiiosity all shape ellective uata woik.
He expanus on these iueas in the lollowing inteiview.
|s data ana|ysis inhcrcnt|y conp|icatcd?
PhiIipp Janert: I oLseive a tenuency to uo something complicateu anu lancy;
to Liing in a statistical concept anu othei ¨sophisticateu¨ stull. The pioLlem
is that the sophisticateu stull isn`t that easy to unueistanu.
Vhy not just look at the uata set? ]ust look at it in an euitoi. MayLe you`ll see
something. Oi, uiaw some giaphs. Giaphs uon`t ieguiie any soit ol loimal
analytical tiaining. These simple methous can Le illuminating piecisely Le-
cause you uon`t neeu anything complicateu, anu nothing is hiuuen.
Why do ana|ysts shy jron sinp|icity?
PJ: I olten peiceive a gieat sense ol insecuiity in my co-woikeis when it comes
to math. Because ol that, I get the sense people aie tiying to almost hiue Lehinu
complicateu methous.
The data analysis path is built on curiosity, followed by action | 83
The classic case loi me is that usually within the liist thiee minutes ol a con-
veisation, people stait talking aLout stanuaiu ueviations. It`s the one concept
liom classical statistics that eveiyone has heaiu ol. But contextually, it`s not
cleai what ¨stanuaiu ueviation¨ ieally means. Aie they talking aLout what`s
Leing measuieu Ly the stanuaiu ueviation, namely the wiuth ol the uistiiLu-
tion? Aie they ieleiiing to one paiticulai measuie anu how it`s Leing calcula-
teu? Do they mean the conclusions that can Le uiawn liom stanuaiu ueviations
in the Noimal case?
Ve neeu to keep it simple anu not get suckeu into aLstiact concepts that may
oi may not Le lully unueistoou.
What too| or ncthod ojjcrs thc bcst starting point jor data ana|ysis?
PJ: Stait Ly plotting the uata set. Plot all ol the uata points anu look at them.
Don`t tiy to calculate inuicatoi guantities oi summaiy statistics. ]ust look at
what you see in the plot. Almost anything woithwhile can Le seen in a goou
giaph.
|s thcrc a dcjincd carccr path jor pcop|c who want to bcconc data scicntists?
PJ: The stunning uevelopment ovei the 12 months I was wiiting this Look is
that ¨Lig uata¨ Lecame the thing that`s on eveiyLouy`s minu. All ol a suuuen,
people aie ieally conceineu aLout veiy laige uatasets. Ol couise, this seems to
Le mostly uiiven Ly the social netwoiking phenomenon. But the guestion is:
Vhat uo we uo with that uata?
I know that loi my puiposes, I nevei neeu Lig uata. Vhen I ask people what
they uo with Lig uata, I`ve lounu that it`s not what I woulu call ¨analysis¨ at
84 | Chapter 3:The Application of Data: Products and Processes
all, Lecause it uoes not involve the uevelopment ol conceptual mouels. It uoes
not involve the inuuctive/ueuuctive cycle ol scientilic ieasoning.
It lalls into one ol two camps. The liist is iepoiting. Foi instance, il a company
is Leing paiu Laseu on the numLei ol pages they seive, then counting the
numLei ol seiveu pages is impoitant. The iesulting log liles tenu to Le huge,
so that`s technically Lig uata. But it`s a veiy stiaightloiwaiu counting anu ie-
poiting game.
The othei camp is what I consiuei ¨geneializeu seaich.¨ These aie scenaiios
like: Il Usei A likes movies B, C, anu D, what othei specilic movie might Usei
A want? That`s a loim ol seaiching Lecause you`ie not actually tiying to cieate
a conceptual mouel ol usei Lehavioi. You`ie compaiing inuiviuual uata points;
you`ie tiying to linu the movie that has the gieatest similaiity to a veiy specilic
othei set ol pieuelineu movies. Foi this kinu ol geneializeu, exhaustive seaich,
you neeu a lot ol uata Lecause you look loi the inuiviuual uata points. But
that`s not ieally analysis as I unueistanu it, eithei.
So coming Lack to youi oiiginal guestion÷is theie a path to Lecoming a ¨uata
scientist?¨÷we neeu to liist linu out what uata science might Le. It will en-
compass uilleient things: the kinu ol Lig uata I mentioneu; iepoiting anu
Lusiness intelligence; hopelully the kinu ol conceptual moueling that I uo. But
uepenuing on what you`ie tiying to accomplish, you coulu ieguiie veiy uil-
leient skills.
Foi what I uo÷anu this is ieally the only uata analysis I can speak aLout with
any sense ol conliuence÷the most impoitant skill is cuiiosity. This sounus a
little tacky, Lut I mean it. Aie you cuiious why the giass is gieen? Aie you
cuiious why is the sky Llue? I`m talking aLout guestions ol this soit. These aie
iepiesentative ol the inguisitive minu ol a scientist. Il you have that, you`ie in
goou shape anu you can stait anywheie.
The skills anu tools ol uata science will Le uiscusseu at the Stiata Conleience,
Leing helu FeL. 1-3 in Santa Claia, Calil. Save 30% off registration with the
code SRT11RAD.
Bcsidcs curiosity, arc thcrc othcr traits or s|i||s that bcncjit data ana|ysts?
PJ: You neeu expeiience with empiiical woik. Anu Ly that I mean someone
who looks at the ¨iuiot lights¨ on a ioutei to make suie the caLle is pluggeu
in Leloie they tiouLleshoot. Ve`ve all Leen in the situation wheie you ieinstall
the IP stack Lecause you can`t get netwoik connectivity, anu only latei uiu you
iealize the ioutei wasn`t pluggeu in. These lailuies ol empiiical woik aie ciit-
ical Lecause empiiical skills can Le leaineu.
It`s also nice, Lut not essential, to have taken a college math class anu ietaineu
a Lit. You shoulu leain a piogiamming language as well Lecause you neeu to
The data analysis path is built on curiosity, followed by action | 85
know how to manipulate uata on youi own. Any ol the cuiient sciipting lan-
guages will uo.
The last thing is that you need to actually do the work. Find a dataset that you’re interested in and work on it. It
doesn’t have to be fancy, but you have to get started. You can’t just sit there and expect it to happen. Experience
and practice are really important.
|t sounds |i|c thc ¨just start¨ nindsct you jind in thc Ma|cr/D|Y connunity a|so
app|ics to data. |s that right?
PJ: I uon`t know aLout othei people, Lut I uo this Lecause it`s lun. Anu that`s
a similai mentality to the Make space. They`ie moie aLout cieating something
as opposeu to unueistanuing something, Lut the mentality is veiy much the
same.
It`s aLout cuiiosity lolloweu Ly action. You look at the uataset anu then go
ueepei to uiscovei something. Anu this piocess isn`t uelineu Ly tools. Peison-
ally, I`m inteiesteu in what someLouy`s tiying to linu iathei than il they`ie
using all the iight statistical methous.
How data and analytics can improve education
George Siemens on the appIications and chaIIenges of education data.
Ly Auuiey Vatteis
Schools have long amasseu uata: tiacking giaues, attenuance, textLook pui-
chases, test scoies, caleteiia meals, anu the like. But little has actually Leen
uone with this inloimation ÷ whethei uue to piivacy issues oi technical ca-
pacities ÷ to enhance stuuents` leaining.
Vith the auoption ol technology in moie schools anu with a push loi moie
open goveinment uata, theie aie cleaily a lot ol oppoitunities loi Lettei uata
gatheiing anu analysis in euucation. But what will that look like? It`s a politi-
cally chaigeu guestion, no uouLt, as some states aie tuining to things like
stanuaiuizeu test scoie uata in oiuei to gauge teachei ellectiveness anu, in tuin,
ietention anu piomotion.
86 | Chapter 3:The Application of Data: Products and Processes
I askeu euucation theoiist Geoige Siemens, liom the Technology Enhanceu
Knowleuge Reseaich Institute at AthaLasca Univeisity, aLout the possiLilities
anu challenges loi uata, teaching, anu leaining.
Oui inteiview lollows.
What |inds oj data havc schoo|s traditiona||y trac|cd?
George Siemens: Schools anu univeisities have long tiackeu a Lioau iange ol
leainei uata ÷ olten uiawn liom applications (univeisities) oi eniollment
loims (schools). This uata incluues any comLination ol: location, pievious
leaining activities, health conceins (physical anu emotional/mental), attenu-
ance, giaues, socio-economic uata (paiental income), paiental status, anu so
on. Most univeisities will stoie anu aggiegate this uata unuei the umLiella ol
institutional statistics.
Piivacy laws uillei liom countiy to countiy, Lut geneially will piohiLit aca-
uemics liom accessing uata that is not ielevant to a paiticulai class, couise, oi
piogiam. Unloitunately, most schools anu univeisities uo veiy little with this
wealth ol uata, othei than possiLly piouucing an annual institutional piolile
iepoit. Even a simple analysis ol existing institutional uata coulu iaise the
piolile ol potential at-iisk stuuents oi ieveal attenuance oi assignment suL-
mission patteins that inuicate the neeu loi auuitional suppoit.
What ncw typcs oj cducationa| data can now bc capturcd and nincd?
George Siemens: In teims ol leaining analytics oi euucational uata-mining,
the giowing exteinalization ol leaining activity (i.e. captuiing how leaineis
inteiact with content anu the uiscouise they have aiounu leaining mateiials
as well as the social netwoiks they loim in the piocess) is uiiven Ly the in-
cieaseu attention to online leaining. Foi example, a leaining management sys-
tem like Mooule oi Desiie2Leain captuies a signilicant amount ol uata, in-
cluuing time spent on a iesouice, lieguency ol posting, numLei ol logins, etc.
This uata is laiily similai to what Google Analytics oi Piwik collects iegaiuing
weLsite tiallic. A new geneiation ol tools, such as SNAPP, uses this uata to
analyze social netwoiks, uegiees ol connectivity, anu peiipheial leaineis. Dis-
couise analysis tools, such as those Leing uevelopeu at the Knowleuge Meuia
Institute at the Open Univeisity, UK, aie also ellective at evaluating the gual-
itative attiiLutes ol uiscouise anu uiscussions anu iate each leainei`s contii-
Lutions Ly uepth anu suLstance in ielation to the topic ol uiscussion.
An aiea ol uata gatheiing that univeisities anu schools aie laigely oveilooking
ielates to the uistiiLuteu social inteiactions leaineis engage in on a uaily Lasis
thiough FaceLook, Llogs, Twittei, anu similai tools. Ol couise, piivacy issues
aie signilicant heie. Howevei, as we aie ieseaiching at AthaLasca Univeisity,
social netwoiks can pioviue valuaLle insight into how connecteu leaineis aie
How data and analytics can improve education | 87
to each othei anu to the univeisity. Potential mouels aie alieauy Leing uevel-
opeu on the weL that woulu tianslate well to school settings. Foi example,
Klout measuies inlluence within a netwoik anu Rauian6 tiacks uiscussions in
uistiiLuteu netwoiks.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
The existing uata gatheiing in schools anu univeisities pales in compaiison to
the value ol uata mining anu leaining analytics oppoitunities that exist in the
uistiiLuteu social anu inloimational netwoiks that we all paiticipate in on a
uaily Lasis. It is heie, I think, that most ol the novel insights on leaining anu
knowleuge giowth will occui. Vhen we inteiact in a leaining management
system (LMS), we uo so puiposelully ÷ to leain oi to complete an assignment.
Oui inteiaction in uistiiLuteu systems is moie ¨authentic¨ anu can yielu novel
insights into how we aie connecteu, oui sentiments, anu oui neeus in ielation
to leaining success. The challenge, ol couise, is how to Lalance conceins ol
the Hawthoine ellect with piivacy.
Discussions aLout uata owneiship anu piivacy lag well Lehinu what is hap-
pening in leaining analytics. Vho owns leainei-piouuceu uata? Vho owns
the analysis ol that uata? Vho gets to see the iesults ol analysis? How much
shoulu leaineis know aLout the uata Leing collecteu anu analyzeu?
I Lelieve that leaineis shoulu have access to the same uashLoaiu loi analytics
that euucatois anu institutions see. Analytics can Le a poweilul tool in leainei
motivation ÷ how uo I compaie to otheis in this class? How am I uoing against
the piogiess goals that I set? Il uata anu analytics aie going to Le useu loi
88 | Chapter 3:The Application of Data: Products and Processes
uecision making in teaching anu leaining, then we neeu to have impoitant
conveisations aLout who sees what anu what aie the powei stiuctuies cieateu
Ly the iules we impose on uata anu analytics access.
How can ana|ytics changc cducation?
George Siemens: Euucation is, touay at least, a Llack Lox. Society invests
signilicantly in piimaiy, seconuaiy, anu highei euucation. Unloitunately, we
uon`t ieally know how oui inputs inlluence oi piouuce outputs. Ve uon`t
know, piecisely, which acauemic piactices neeu to Le cuiLeu anu which neeu
to Le encouiageu. Ve aie essentially swatting llies with a sleugehammei anu
uoing a laii amount ol peiipheial uamage.
Leaining analytics aie a lounuational tool loi inloimeu change in euucation.
Ovei the past uecaue, calls loi euucational ieloim have incieaseu, Lut veiy
little is unueistoou aLout how the system ol euucation will Le impacteu Ly the
pioposeu ieloims. I sometimes leai that the solution Leing pioposeu to what
ails euucation will Le woise than the cuiient pioLlem. Ve neeu a means, a
lounuation, on which to Lase ieloim activities. In the coipoiate sectoi, Lusi-
ness intelligence seives this ¨uecision lounuation¨ iole. In euucation, I Lelieve
leaining analytics will seive this iole. Once we Lettei unueistanu the leaining
piocess ÷ the inputs, the outputs, the lactois that contiiLute to leainei success
÷ then we can stait to make inloimeu uecisions that aie suppoiteu Ly evi-
uence.
Howevei, we have to walk a line line in the use ol leaining analytics. On the
one hanu, analytics can pioviue valuaLle insight into the lactois that inlluence
leaineis` success (time on task, attenuance, lieguency ol logins, position
within a social netwoik, lieguency ol contact with laculty memLeis oi teach-
eis). Peiipheial uata analysis coulu incluue the use ol physical seivices in a
school oi univeisity: access to liLiaiy iesouices anu leaining help seivices. On
the othei hanu, analytics can`t captuie the soltei elements ol leaining, such as
the motivating encouiagement liom a teachei anu the value ol inloimal social
inteiactions. In any assessment system, whethei stanuaiuizeu testing oi leain-
ing analytics, theie is a ieal uangei that the taiget Lecomes the oLject ol leain-
ing, iathei than the assessment ol leaining.
Vith that as a caveat, I Lelieve leaining analytics can pioviue uiamatic, stiuc-
tuial change in euucation. Foi example, touay, oui leaining content is cieateu
in auvance ol the leaineis taking a couise in the loim ol cuiiiculum like text-
Looks. This piocess is teiiiLly inellicient. Each leainei has uilleiing levels ol
knowleuge when they stait a couise. An intelligent cuiiiculum shoulu aujust
anu auapt to the neeus ol each leainei. Ve uon`t neeu one couise loi 30 leain-
eis; each leainei shoulu have hei own couise Laseu on hei lile expeiiences,
leaining pace, anu lamiliaiity with the topic. The content in the couises that
How data and analytics can improve education | 89
we take shoulu Le as auaptive, llexiLle, anu continually upuateu. The Llack
Lox ol euucation neeus to Le openeu anu auapteu to the ieguiiements ol each
inuiviuual leainei.
In teims ol evaluation ol leaineis, assessment shoulu Le in-piocess, not at the
conclusion ol a couise in the loim ol an exam oi a test. Let`s say we uevelop
semantically-uelineu leaining mateiials anu ways to automatically compaie
leainei-piouuceu aitilacts (in uiscussions, texts, papeis) to the knowleuge
stiuctuie ol a lielu. Oui knowleuge piolile coulu then iellect how we compaie
to the knowleuge aichitectuie ol a uomain ÷ i.e. ¨you aie 6+º on youi way
to Leing a psychologist¨ oi ¨you aie 3Sº on youi way to Leing a statistician.¨
Basically, evaluation shoulu Le uone Laseu on a complete piolile ol an inui-
viuual, not only the inuiviuual in ielation to a naiiowly uelineu suLject aiea.
Piogiams ol stuuy shoulu also incluue non-school-ielateu leaining (piioi
leaining assessment). A stuuent that volunteeis with a local chaiity oi a stuuent
that plays spoits outsiue ol school is acguiiing skills anu knowleuge that is
cuiiently ignoieu Ly the school system. ¨Vhole-peison analytics¨ is ieguiieu
wheie we move Leyonu the micio-locus ol exams. Foi stuuents that ietuin to
univeisity miu-caieei to gain auuitional gualilications, iecognition loi non-
acauemic leaining is paiticulaily impoitant.
Much ol the cuiient locus on analytics ielates to ieuucing attiition oi stuuent
uiopouts. This is the low-hanging liuit ol analytics. An analysis ol the signals
leaineis geneiate (oi lail to ÷ such as when they uon`t login to a couise) can
pioviue eaily inuications ol which stuuents aie at iisk loi uiopping out. By
iecognizing these stuuents anu olleiing eaily inteiventions, schools can ieuuce
uiopouts uiamatically.
All ol this is to say that leaining analytics seive as a lounuation loi inloimeu
change in euucation, alteiing how schools anu univeisities cieate cuiiiculum,
uelivei it, assess stuuent leaining, pioviue leaining suppoit, anu even allocate
iesouices.
What tcchno|ogics arc bchind |carning ana|ytics?
George Siemens: Some ol the uevelopments in leaining analytics tiack the
uevelopment ol the weL as a whole ÷ incluuing the use ol iecommenuei sys-
tems, social netwoik analysis, peisonalization, anu auaptive content. Ve aie
at an exciting cioss-ovei point Letween innovations in the technology space
anu ieseaich in univeisity ieseaich laLs. Language iecognition, aitilicial intel-
ligence, machine leaining, neuial netwoiks, anu ielateu concepts aie Leing
comLineu with the giowth ol social netwoik seivices, collaLoiative leaining,
anu paiticipatoiy peuagogy.
90 | Chapter 3:The Application of Data: Products and Processes
The comLination ol technical anu social innovations in leaining olleis huge
potential loi a Lettei, moie ellective leaining mouel. Togethei with Stephen
Downes anu Dave Coimiei, I`ve expeiimenteu with ¨massive open online
couises¨ ovei the past loui yeais. This expeiimentation has iesulteu in solt-
waie that we`ve uevelopeu to encouiage uistiiLuteu leaining, while still pio-
viuing a loose level ol aggiegation that enaLles analytics. Tools like Open
Stuuy take a similai appioach: uecentializeu leaining, centializeu analytics.
Companies like Giockit anu Knewton aie cieating peisonalizeu auaptive
leaining platloims. Not to Le outuone, tiauitional puLlisheis like Peaison anu
McGiaw-Hill aie investing heavily in auaptive leaining content anu aie staiting
to paitnei with univeisities anu schools to uelivei the content anu even eval-
uate leainei peiloimance. Leaining management system pioviueis (such as
Desiie2Leain anu BlackLoaiu) aie actively Luiluing analytics options into theii
olleiings.
Essentially, in oiuei loi leaining analytics to have a Lioau impact in euucation,
the locus neeus to move well Leyonu Lasic analytics technigues such as those
lounu in Google Analytics. An integiateu leaining anu knowleuge mouel is
ieguiieu wheie the leaining content is auaptive, piioi leaining is incluueu in
assessment, anu leaining iesouices aie pioviueu in vaiious contexts (e.g. ¨in
class touay you stuuieu Ancient Roman laws, two Llocks liom wheie you aie
now, a museum is holuing a special exhiLit on Roman cultuie¨). The piolile
ol the leainei, not pie-planneu content, neeus to uiives cuiiiculum anu leain-
ing oppoitunities.
What arc thc najor obstac|cs jacing cducation data and ana|ytics?
George Siemens: In spite ol the enoimous potential they holu to impiove
euucation, leaining analytics aie not without conceins. Piivacy loi leaineis
anu teacheis is a ciitical issue. Vhile I see analytics as a means to impiove
leainei success, oppoitunities exist to use analytics to evaluate anu ciitigue
the peiloimance ol teacheis. Data access anu owneiship aie egually impoitant
issues: who shoulu Le aLle to see the analysis that schools peiloim on leaineis?
Othei conceins ielate to eiioi-coiiection in analytics. Il euucatois iely heavily
on analytics, elloit shoulu Le uevoteu to evaluating the analytics mouels anu
unueistanuing in which contexts those analytics aie not valiu.
Vith iegaiu to the auoption ol leaining analytics, now is an exceptionally
piactical time to exploie analytics. The complex challenges that schools
anu univeisities lace can, at least paitially, Le illuminateu thiough analytics
applications.
How data and analytics can improve education | 91
Data science is a pipeline between academic disciplines
Drew Conway on how data science intersects with research and the so-
ciaI sciences.
Ly Auuiey Vatteis
Ve talk a lot aLout the ways in which uata science allects vaiious Lusinesses,
oiganizations, anu piolessions, Lut how aie we actually piepaiing lutuie uata
scientists? Vhat tiaining, il any, uo univeisity stuuents get in this aiea? The
answei may Le oLvious il stuuents locus on math, statistics oi haiu science
majois, Lut what aLout othei uisciplines?
I iecently spoke with Diew Conway (¿uiewconway) aLout uata science anu
acauemia, paiticulaily in iegaius to social sciences. Conway, a PhD canuiuate
in political science at New Yoik Univeisity, will expanu on some ol these topics
uuiing a session at next month`s Stiata Conleience in New Yoik.
Oui inteiview lollows.
How has thc wor| oj acadcnia - particu|ar|y po|itica| scicncc - bccn ajjcctcd
by tcchno|ogy, opcn data, and opcn sourcc?
Drew Conway: Theie aie lunuamentally two sepaiate guestions in heie, so I
will tiy to auuiess Loth ol them. Fiist is the guestion ol how acauemic ieseaich
has changeu as a iesult ol these technologies. Anu loi my pait, I can only ieally
speak loi how they have allecteu social science ieseaich. The open uata move-
ment has impacteu ieseaich most notaLly in compiessing the amount ol time
a ieseaichei goes liom the moment ol inception (¨hmm, that woulu Le intei-
esting to look at!¨) to actually looking at uata anu seaiching loi inteiesting
patteins. This is especially tiue ol the open uata movement happening at the
local, state anu leueial goveinment levels.
92 | Chapter 3:The Application of Data: Products and Processes
Only a lew yeais ago, the task ol iuentilying, collecting, anu noimalizing these
uata woulu have taken months, il not yeais. This meant that a ieseaichei coulu
have spent all ol that time anu elloit only to linu out that theii hypothesis was
wiong anu that ÷ in lact ÷ theie was nothing to Le lounu in a given uataset.
The iichness ol uata maue availaLle thiough open uata allows loi a much moie
iapiu ieseaich cycle, anu hopelully a gieatei Lieauth ol topics Leing ie-
seaicheu.
Open souice has also hau a tiemenuous impact on how acauemics uo ieseaich.
Fiist, open souice tools loi peiloiming statistical analysis, such as R anu Py-
thon, have ioLust communities aiounu them. Acauemics can uevelop anu
shaie coue within theii niche ieseaich aiea, anu as a iesult the entiie com-
munity Lenelits liom theii elloit. Moieovei, the philosophy ol open souice
has staiteu to entei into the liamewoik ol ieseaich. That is, acauemics aie
Lecoming much moie open to the iuea ol shaiing uata anu coue at eaily stages
ol a ieseaich pioject. Also, many jouinals in the social sciences aie now ie-
guiiing that authois pioviue ieplication coue anu uata.
The seconu piece ol the guestion is how these technologies allect the uissem-
ination ol ieseaich. In this case Llogs have Lecoming the ue lacto souice loi
eaily access to new ieseaich, oi scientilic ueLate. In my own uiscipline, The
Monkey Cage is most political scientists` liist souice loi new ieseaich. Vhat
is lantastic aLout the Monkey Cage, anu othei acauemic Llogs, is that they aie
not only ieauy Ly othei acauemics. ]ouinalists, policy makeis, anu engageu
citizens can also inteiact with acauemics in this way ÷ something that was
not possiLle Leloie these acauemic Llogs Lecame mainstieam.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Data science is a pipeline between academic disciplines | 93
Save 30% on registration with the code STN11RAD
Lct`s sidcstcp thc history oj thc discip|inc and dcbatcs about what constitutcs a
hard or sojt scicncc. But as its nanc suggcsts, ¨po|itica| scicncc¨ has |ong bccn
intcrcstcd in nodc|s, statistics, quantijiab|c data and so on. Has thc discip|inc
bccn ajjcctcd by thc risc oj data scicncc and big data?
Drew Conway: The impact ol Lig uata has Leen slow, Lut theie aie a lew
champions who aie uoing ieally inteiesting woik. Political science, at its coie,
is most inteiesteu in unueistanuing how people collectively make uecisions,
anu as ieseaicheis we attempt to Luilu mouels anu collect uata to that enu. As
such, the massive uata on social inteiactions Leing geneiateu Ly social meuia
seivices like FaceLook anu Twittei piesent unpieceuenteu oppoitunities loi
ieseaich.
Vhile some acauemics have Leen aLle to leveiage this uata loi inteiesting
woik, theie seems to Le a clash Letween these seivices` teims ol seivice anu
with the uesiie loi scientists to collect uata anu geneiate iepiouuciLle linuings
liom this uata. I wiote aLout my own expeiience using Twittei uata loi ie-
seaich, Lut theie aie many otheis ieseaicheis liom all uisciplines that have iun
into similai pioLlems.
Vith iespect to how acauemics have Leen impacteu Ly uata science, I think
the impact has mostly lloweu in the othei uiiection. One majoi component ol
uata science is the aLility to extiact insight liom uata using tools liom math,
statistics anu computei science. Most ol this is inloimeu Ly the woik ol aca-
uemics, anu not the othei way aiounu. That saiu, as moie acauemic ieseaicheis
Lecome inteiesteu in examining laige-scale uatasets (on the oiuei ol Twittei
oi FaceLook), many ol the technical skills ol uata science will have to Le ac-
guiieu Ly acauemics.
How docs data scicncc changc thc wor| oj thc grad studcnt - in tcrns oj ncc-
cssary s|i||s but a|so in tcrns oj acccss to injornation/injornants?
Drew Conway: Unloitunately, having sophisticateu technical skills, i.e.,
those ol a uata scientist, aie still unueivalueu in acauemia. Being involveu in
open-souice piojects, oi piouucing statistical soltwaie is not something that
will help a giauuate stuuent lanu a high-piolile acauemic joL, oi help a young
laculty memLei get tenuie. PuLlications aie still the cuiiency ol success, anu
that ÷ as I mentioneu ÷ clashes with the uata-shaiing policies ol many laige
social meuia seivices.
Giauuate stuuents anu laculty uo themselves a uisseivice Ly not actively stay-
ing technically ielevant. As so much moie uata gets pusheu into the open, I
Lelieve Lasic uata hacking skills ÷ sciaping, cleaning, anu visualization ÷
will Le pieieguisites to any acauemic ieseaich pioject. But, then again, I`ve
94 | Chapter 3:The Application of Data: Products and Processes
always Leen a weiiu acauemic, uouLle majoiing in computei science anu po-
litical science as an unueigiau
How docs thc risc oj data scicncc and its sprcad bcyond thc rca|n oj nath and
statistics changc thc wor|d oj tcchno|ogy, cithcr jron an acadcnic or cntrcprc-
ncuria| pcrspcctivc?
Drew Conway: Fiom an entiepieneuiial peispective I think it has uiamati-
cally changeu the way new Lusinesses think aLout Luiluing a team. Vhethei
it is at Stiata, oi any ol the othei conleiences in the same vein, you will see a
glut ol joL openings oi panels on how to ¨Luilu a uata team.¨ At piesent, people
who have the Llenu ol skills I associate with uata science ÷ hacking, math/
stats, anu suLstantive expeitise ÷ aie a iaie commouity. This ueaith ol talent,
howevei, will Le shoit-liveu.
I see in my unueigiaus many moie stuuents who giew up with uata anu com-
puting as uLiguitous paits ol theii lives. They`ie inteiesteu in puisuing ioutes
ol stuuy that pioviue them with uata science skills, Loth in teims ol technical
competence, anu also in cieative outlets such as inteiactive uesign.
How docs ¨hunan subjccts conp|iancc¨ wor| whcn you`rc ta||ing about ¨data¨
vcrsus ¨pcop|c¨ - that`s an odd distinction, oj coursc, and an inaccuratc onc at
that. But |`n curious ij sonc oj thc ru|cs and rcgu|ations that govcrn rcscarch on
hunans account jor rcscarch on hunans` data.
Drew Conway: I think it is an excellent guestion, anu one that acaueme is
still stiuggling to ueal with. In some sense, mining social uata that is lieely
availaLle on the Inteinet pioviues ieseaicheis a way to siuestep tiauitional IRB
iegulation. I uon`t think theie`s anything ethically guestionaLle aLout iecoiu-
ing oLseivations that aie lieely maue puLlic. That`s akin to oLseiving the me-
anueiings ol people in a paik.
Vheie things get inteiesting is when ieseaicheis use ciowu souicing technol-
ogy, like Mechanical Tuik, as a suivey mechanism. Heie, this is much moie
ol a giay aiea. I suppose, technically, the Amazon teims ol seivices coveis
ieseaicheis, Lut ethically this is something that woulu seem to me to lall within
the scope ol an IRB. Unloitunately, the likely outcome is that institutions
won`t attempt to unueistanu the uilleience until some pioLlem aiises.
This intcrvicw was cditcd and condcnscd.
Data science is a pipeline between academic disciplines | 95
Big data and open source unlock genetic secrets
CharIie Quinn is mixing data to advance genetic discovery.
Ly Alex Howaiu
The woilu is expeiiencing an unpieceuenteu uata ueluge, a ieality that my
colleague Euu DumLill uesciiLeu as anothei ¨inuustiial ievolution¨ at FeLiu-
aiy`s Stiata Conleience. Many sectois ol the gloLal economy aie waking up to
the neeu to use uata as a stiategic iesouice, whethei in meuia, meuicine, oi
moving tiucks. Open uata has Leen a majoi locus ol Gov 2.0, as leueial anu
state goveinments move loiwaiu with cieating new online platloims loi open
goveinment uata.
The explosion ol uata ieguiies new tools anu management stiategies. These
new appioaches incluue moie than technical evolution, as a iecent conveisa-
tion with Chailie Quinn, uiiectoi ol uata integiation technologies at the Be-
naioya Reseaich Institute, ievealeu: they involve cultuial changes that cieate
gieatei value Ly shaiing uata Letween institutions. In Quinn`s lielu, genomics,
Lig uata is lai liom a Luzzwoiu, with scanneu seguences now iating on the
teiaLyte scale.
In the inteiview Lelow, Quinn shaies insights aLout applying open souice to
uata management anu comLining puLlic uata with expeiimental uata. You can
heai moie aLout open uata anu open souice in auvancing peisonalizeu meu-
icine liom Quinn at the upcoming OSCON Conleience.
96 | Chapter 3:The Application of Data: Products and Processes
How did you bcconc invo|vcd in data scicncc?
CharIie Quinn: I got into the lielu thiough a liienu ol mine. I hau Leen uoing
uata mining loi liauu on cieuit caius anu the piincipal investigatoi, who I woik
with now, was going to woik in Texas. Ve hau a novel iuea that to Luilu the
tools loi ieseaicheis, we shoulu hiie soltwaie people. Vhat hau happeneu in
the past was you hau Lioinloimaticians wiiting sciipts. They lounu the pio-
giams that they neeueu uiu aLout S0º ol what they wanteu, anu they hau a
haiu time gaining the last 20º. So we hau hau a talk way Lack when saying,
¨il you ieally want piopei soltwaie tools, you ought to hiie soltwaie people to
Luilu them loi you.¨ He calleu my Loss to come on uown anu take a look. I
uiu, anu the iest is histoiy.
You`vc said that thcrc`s a ¨data cxp|osion¨ in gcnonics rcscarch. What do you
ncan? What docs this ncan jor your jic|d?
CharIie Quinn: It`s like the uilleience Letween analog anu uigital technology.
The amount ol uata you`u have with analog is still suLstantial, Lut as we move
towaiu uigital, it giows exponentially. Il we`ie looking at technology in gene
expiession values, which is what we`ve Leen locusing on in genomics, it`s
aLout a gigaLyte pei scan. As we move into uoing taigeteu RNA seguencing,
oi even high lieguency seguencing, il you take the iaw output liom the se-
guence, you`ie looking at teiaLytes pei scan. It`s oiueis ol magnituue moie
uata.
Vhat that means liom a piactical peispective is theie`s moie uata Leing gen-
eiateu than just loi youi ieguest. Theie`s moie uata Leing geneiateu than a
single ieseaichei coulu possiLly evei hope to get theii heau wiappeu aiounu.
Vheie the uata explosion Lecomes inteiesting is how we engage ieseaicheis
to take uata they`ie geneiating anu shaie it with otheis, so that we can ieuse
uata, anu othei people might Le aLle to linu something inteiesting in it.
Big data and open source unlock genetic secrets | 97
HeaIth IT at OSCON 2011 ÷ The conjunction ol open souice anu open
uata with health technology piomises to impiove cieaking inliastiuctuie anu
give gieatei contiol anu engagement loi patients. These topics will Le exploieu
in the healthcaie tiack at OSCON (]uly 25-29 in Poitlanu, Oie.)
Save 20% on registration with the code OS11RAD
What arc thc too|s you`rc using to organizc and na|c scnsc oj a|| that data?
CharIie Quinn: A lot ol it`s Leen homegiown so lai, which is a Lit ol an issue
as you stait to integiate with othei oiganizations Lecause eveiyLouy seems to
have theii own homegiown system. Theie`s an open souice gioup in Seattle
calleu LaL Key, which a lot ol people have staiteu to use. Ve`ie taking anothei
look at them to see il we might Le aLle to use some ol theii technology to help
us move loiwaiu in oiganizing the Lackenu. A lot ol this is so new. It`s haiu
to keep up with wheie we`ie at anu guite olten, we`ie outpacing it. It`s a gues-
tion ol homegiown anu integiating with othei applications as we can.
How docs opcn sourcc rc|atc to that wor|?
CharIie Quinn: Ve tiy anu use open souice as much as we can. Ve tiy anu
contiiLute Lack wheie we can. Ve haven`t Leen contiiLuting Lack anywheie
neai as much as we`u like to, Lut we`ie going to tiy anu get into that moie.
Ve`ie huge pioponents not only ol open souice, Lut ol open uata. Vhat we`ve
Leen uoing is going aiounu anu tiying to convince people that we unueistanu
they have to keep uata piivate up to a ceitain point, Lut let`s tiy anu ielease as
much uata as we can as eaily as we can.
98 | Chapter 3:The Application of Data: Products and Processes
Vhen we go Lack to talking aLout the explosion ol uata, il we`ie looking at
Gene X anu we happen to see something that might Le inteiesting on Y oi Z,
we can post a guick uiscoveiy note oi a shoit LluiL. In that way, you`ie tiying
to push iueas out anu take the uata Lehinu those iueas anu make it puLlic.
That`s wheie I think we`ie going to get tiaction: tiying to shaie uata eailiei
iathei than latei.
At OSCON, you`|| ta|| about how cxpcrincnta| data conbincs with pub|ic
data. Whcn did you start jo|ding thc two togcthcr?
CharIie Quinn: Ve`ve Leen playing with it loi a while. Vhat we`ie hoping
to uo is make moie ol it puLlic, now that we`ie getting the institutional suppoit
loi it. Yeais ago, we went anu inuexeu all ol the aLstiacts at PuLnet Ly gene
so that when people went to a text engine, you coulu type in youi gueiy anu
you woulu get a list ol genes, as opposeu to a list ol aiticles. That helpeu
ieseaicheis linu what they weie looking loi ÷ anu that`s just leveiaging openly
availaLle uata. Now, with NIH`s manuate loi moie people to puLlish theii
iesults Lack into iepositoiies, we`ie uownloauing that uata anu comLining it
with the uata we have inteinally. Now, as we go acioss a pioject oi acioss a
uisease tiying to linu how a gene is acting oi how a piotein is acting, it`s just
giving us a Liggei uataset to woik with.
What arc sonc oj thc cha||cngcs you`vc cncountcrcd in your wor|?
CharIie Quinn: The issues we`ve hau aie with the guality ol the uatasets in
the puLlic iepositoiies. You neeu to hiie a cuiatoi to valiuate il the uata is going
to Le usaLle oi not, to make suie it`s compaiaLle to the uata that we want to
use it with.
What`s thc juturc oj opcn data in rcscarch and pcrsona|izcd ncdicinc?
CharIie Quinn: Ve`ie going to Le seeing multiple tieis ol uata shaiing. In the
long iun, you`ve going to have veiy well cuiateu puLlic iepositoiies ol uata.
Ve`ie a laii ways away liom theie in ieality Lecause theie`s still a lot ol ineitia
against uoing that within the ieseaich community. The hall-step to get theie
will Le laige pioject consoitiums wheie we stait shaiing uata intei-institu-
tionally. As people get moie comloitaLle with that, we`ll Le aLle to open it up
to a wiuei auuience.
This intcrvicw was cditcd and condcnscd.
Photo: Rcp|icating Nanonachincs by jurvctson, on I|ic|r
Big data and open source unlock genetic secrets | 99
Visualization deconstructed: Mapping Facebook’s
friendships
A deep Iook at PauI ButIer's popuIar Facebook visuaIization.
Ly SéLastien Pieiie
In the liist post in Rauai`s new ¨visualization ueconstiucteu¨ seiies, I talkeu
aLout how uata visualization oiiginateu liom caitogiaphy (which some now
just call ¨mapping¨). Caitogiaphy initially locuseu on mapping physical
spaces, Lut at the enu ol the 20th centuiy we cieateu anu uiscoveieu new spaces
that weie maue possiLle Ly the Inteinet. By aLstiacting away the constiaints
ol the physical space, social netwoiks such as FaceLook emeigeu anu openeu
up new teiiitoiies, wheie topology is piimaiily uelineu Ly the social laLiic
iathei than physical space. But is this laLiic completely ue-coiielateu liom the
physical space?
Mapping Facebook’s friendships
Last DecemLei, Paul Butlei, an intein on FaceLook`s uata inliastiuctuie en-
gineeiing team, posteu a visualization that examineu a suLset ol the ielations
Letween FaceLook useis. Useis weie positioneu in theii iespective cities anu
aics uenoteu liienuships.
Paul extiacteu the uata anu staiteu playing with it. As he put it:
Visualizing uata is like photogiaphy. Insteau ol staiting with a Llank canvas,
you manipulate the lens useu to piesent the uata liom a ceitain angle.
Theie is uelinitely uiscoveiy involveu in the piocess ol cieating a visualization,
wheie Ly giving visual attiiLutes to otheiwise invisiLle uata, you cieate a loim
loi uata to emLouy.
100 | Chapter 3:The Application of Data: Products and Processes
The most stiiking uiscoveiy that Paul maue while cieating his visualization
was the uniaveling ol a veiy uetaileu map ol the woilu, incluuing the shapes
ol the continents (iememLei that only lines iepiesenting ielationships aie
uiawn).
Il you compaie the FaceLook visualization with NASA`s woilu at night pic-
tuies, you can see how close the two maps aie, except loi Russia anu paits ol
China. It seems that FaceLook has a Lig giowth oppoitunity in these iegions!
So let`s have a look at Paul`s visualization:
º A complex netwoik ol aics anu lines uoes a gieat joL communicating the
notions ol human activity anu oiganic social laLiic.
º The choice ol coloi palette woiks veiy well, as it immeuiately make us
think aLout night shots ol eaith, wheie the light ol the city makes human
activity visiLle. The coloi contiast is well Lalanceu, so that we uon`t see
too much Lluiiing oi Lleeuing ol colois.
Visualization deconstructed: Mapping Facebook’s friendships | 101
º Choosing to uiaw only lines anu aics makes the visualization veiy intei-
esting, as at liist sight, we woulu think that the outlines ol continents anu
the cities have Leen pie-uiawn. Insteau, they emeige liom the uiawing ol
aics iepiesenting liienuships Letween people in uilleient cities, anu we
can make the inteiesting uiscoveiy ol a possiLle coiielation Letween phys-
ical location anu social liienuships on the Inteinet.
Strata: Making Data Work, Leing helu FeL. 1-3, 2011 in Santa Claia, Calil.,
will locus on the Lusiness anu piactice ol uata. The conleience will pioviue
thiee uays ol tiaining, Lieakout sessions, anu plenaiy uiscussions÷along with
an Executive Summit, a Sponsoi Pavilion, anu othei events showcasing the
new uata ecosystem.
Save 30% off registration with the code STR11RAD
Oveiall, this is a gieat visualization that hau a lot ol success last DecemLei,
Leing mentioneu in numeious Llogs anu likeu Ly moie than 2,000 people on
FaceLook. Howevei, I can see a couple ways to impiove it anu open up new
possiLilities:
º PIay with the coIor scaIe÷By using a less lineai giauient as a coloi scale,
oi Ly using moie than two colois, some othei patteins may emeige. Foi
instance, Ly using a cleaiei cut-oll in the giauient, we coulu Lettei see
ielations with a weight aLove a specilic thiesholu. Also, using moie than
one coloi in the giauient might ieveal the pieuominance ol one coloi ovei
anothei in specilic iegions. Again, it`s something to tiy, anu we`ll pioLaLly
lose some ol the giaphic appeal in lavoi ol (peihaps) moie insights into
the uata.
102 | Chapter 3:The Application of Data: Products and Processes
º PIay with the drawing of the Iines÷Because the lines aie spieau all ovei
the map, it`s a little uillicult to iuentily ¨stieams¨ ol lines that all llow in
the same uiiection. It woulu Le inteiesting to uiaw the lines in thiee paits,
wheie the miuule pait woulu Le shaieu Ly many lines, cieating ¨pipelines¨
ol ielationships liom one iegion to anothei. Ol couise, this woulu ieguiie
a lot ol expeiimentation anu it might not even Le possiLle with the tools
useu to uiaw the visualization.
º Use a different reference to position cities÷Cities in the visualization
aie positioneu using theii geogiaphical position, Lut theie aie othei ways
they coulu Le placeu. Foi instance, we coulu position them on a giiu,
oiueieu Ly theii population, oi GDP. Vhat kinu ol patteins anu tienus
woulu emeige Ly changing this peispective ?
Static requires storytelling
In last week`s post, I lookeu at an inteiactive visualization, wheie useis can
exploie the uata anu its uilleient iepiesentations. Vith the FaceLook uata, we
have a static visualization wheie we can only look, not touch ÷ it`s like gazing
at the stais.
Although a static visualization has the potential to evolve into an inteiactive
visualization, I think cieating a static image involves a little Lit moie caie.
Inteiactive visualizations can Le useu as exploiation tools, Lut static visuali-
zations neeu to piesent insight the uata exploiei hau when cieating the visu-
alization. It has to tell a stoiy to Le inteiesting.
Data science democratized
With new tooIs arriving, data science may soon be in the hands of non-
programmers.
Ly Mac Slocum
I am not a uata scientist. Noi am I a piogiammei. I`ve got an inclination towaiu
technology, Lut my coie skill set veiy much iesiues in the humanities uomain.
I ollei this Liogiaphical sketch up liont Lecause I think I have a lot in common
with the people who woik aiounu anu neai tech spaces: acauemics, Lusiness
Data science democratized | 103
useis, enteitainment piolessionals, euitois, wiiteis, piouuceis, etc. The intei-
esting thing aLout uata science÷anu the ieason why I`m glau Mike Loukiues
wiote ¨Vhat is uata science?¨÷is that vast stoies ol uata have ielevance to all
soits ol lolks, incluuing people like me who lack a puie technical peuigiee.
Data science`s uemociatizing moment will come when its associateu tools can
Le pickeu up Ly tech-savvy non-piogiammeis. I`m thinking ol the HTML
coueis anu the Excel powei useis: the people who aien`t lull-lleugeu mechan-
ics, Lut they`ie skilleu enough to pop the hoou anu change theii own oil.
I`m encouiageu Lecause that uemociatizing moment is close. I saw a uemo
iecently that connects a weL-Laseu spieausheet with huge uata stoies anu
clouu inliastiuctuie. This type ol system÷anu I`m suie theie aie many otheis
in the pipeline÷takes a piocess that once hau immense technical anu linancial
Laiiieis anu makes it almost as easy as phpMyAumin. That`s an impoitant
step. Vithin a yeai oi two, I expect to see luithei usaLility impiovements in
these tools. A uata science uashLoaiu that mimics Google Analytics can`t Le
lai oll.
Duiing that uemo, Datameei CTO Stelan Gioschupl tolu me aLout a lun
Twittei inguiiy he instigateu. Gioschupl hau pieviously gatheieu aiounu +5
million tweets anu leu them into EC2. Latei, ovei the couise ol a two-Leei
evening, Gioschupl pokeu at that uata to see il any inteiesting patteins tuineu
up when compaiing two vastly uilleient hashtags (=justinLieLei vs. =tea-
paity). He useu his company`s system to paise the uata, then he leu iesults
thiough a liee visualization tool.
Heie`s the =justinLieLei clustei:
104 | Chapter 3:The Application of Data: Products and Processes
Anu heie`s the =teapaity clustei:
As you can see, the =teapaity lolks aie lai moie connecteu then theii uistant
=justinLieLei cousins. That`s inteiesting, Lut not ieally suipiising. The polit-
ical woilu has moie connective tissue than ol-the-moment enteitainment.
But that specilic conclusion isn`t what`s impoitant heie. Even il youi enu-point
is inevitaLle, a uata-uiiven conveisation has moie powei anu iesonance than
an anecuotal oLseivation. Gioschupl uiun`t tell me the Tea Paity movement
is moie connecteu. He showcd me.
Signilicant implications emeige when you can Lounce a guestion, even an in-
nocuous one, against a huge stoiehouse ol uata. Il someone like me can plug
guestions into a system anu have it uo the same kinu ol piocessing once ie-
seiveu loi a skilleu minoiity, that will inspiie me to ask a lot moie guestions.
It`ll inspiie a lot ol othei people to ask guestions, too. Anu some ol those
guestions might even Le impoitant.
That`s a Lig ueal. Mysell anu otheis may nevei Lecome lull-lleugeu uata sci-
entists, Lut having access to easy-to-use uata tools will get people thinking anu
exploiing in all soits ol uomains.
Data science democratized | 105
CHAPTER 4
The Business of Data
There’s no such thing as big data
Even if you have petabyes of data, you stiII need to know how to ask the
right questions to appIy it.
Ly Alistaii Cioll
¨You know,¨ saiu a goou liienu ol mine last week, ¨theie`s ieally no such thing
as Lig uata.¨
I sigheu a Lit insiue. In the past lew yeais, clouu computing ciitics have saiu
similai things: that clouus aie nothing new, that they`ie just mainliames, that
they`ie just painting olu technologies with a clouu Liush to help sales. I`m waiy
ol this soit ol techno-Luuuism. But this peison is shaip, anu not usually pione
to veiLal linkLait, so I uug ueepei.
He`s a iiuiculously heavy tiavelei, iacking up hunuieus ol thousanus ol miles
in the aii each yeai. He`s the kinu ol lliei aiilines uieam ol: loyal, well-heeleu,
anu pione to last-minute, Lusiness-class tiips. He`s is exactly the kinu ol peison
an aiiline neeus to couit aggiessively, one who iepiesents a uispiopoitionally
laige amount ol ievenues. He`s an outliei ol the Lest kinu. He`u Leen a top-
iankeu passengei with Uniteu Aiilines loi neaily a uecaue, using theii Mileage
Plus piogiam loi eveiything liom hotels to cai ientals.
Anu then his company was acguiieu.
107
The acguiiing liim hau a contiactual ielationship with Ameiican Aiilines, a
competitoi ol Uniteu with a completely sepaiate loyalty piogiam. My liienu`s
aii tiavel on Uniteu anu its paitnei aiilines uioppeu to neaily nothing.
He continueu to Look hotels in Shanghai, ient cais in Baicelona, anu Luy meals
in Tahiti, anu eveiy one ol those tiansactions was tieu to his loyalty piogiam
with Uniteu. So the aiiline knew he was tiaveling÷just not with them.
Astonishingly, noLouy evei calleu him to inguiie aLout why he`u stoppeu lly-
ing with them. As a iesult, he`s lai less loyal than he was. But moie impoitantly,
Uniteu has lost a huge oppoitunity to tiy to win ovei a laige company`s Lusi-
ness, with a passionate anu motivateu insiue auvocate.
Anu this was his point aLout Lig uata: that givcn how nuch traditiona| con-
panics put it to wor|, it night as wc|| not cxist. Companies have countless ways
they might use the tieasuie tioves ol uata they have on us. Yet all ol this uata
lies Luiieu, sitting in silos. It seluom sees the light ol uay.
Vhen a company uoes put uata to use, it`s usually a uisiuptive staitup. Zappos
anu customei seivice. Amazon anu ietailing. Ciaigslist anu classilieu aus. Zil-
low anu house puichases. LinkeuIn anu ieciuiting. eBay anu payments. Ryan-
aii anu aii tiavel. One Ly one, inuustiy incumLents aie witheiing unuei the
haish light ol uata.
Strata Jumpstart New York 2011, Leing helu on SeptemLei 19, is a ciash
couise in how to manage the uata ueluge that`s tiansloiming tiauitional Lusi-
ness piactices acioss the Loaiu. ]umpstait is an intense, uay-long ueep uive loi
manageis, stiategists, anu entiepieneuis who aie putting the piomise ol Lig
uata into piactice.
Save 30% on registration with the code STN11RAD
108 | Chapter 4:The Business of Data
Big data and the innovator’s dilemma
Laige companies with entiencheu Lusiness mouels tenu to cling to theii Luggy-
whips. They have a haiu time Lieaking theii own Lusiness mouels, as Clay
Chiistensen so cleaily stateu in ¨The Innovatoi`s Dilemma,¨ Lut it`s too easy
to point the lingei at simple complacency.
Eaily-stage companies have a seconu auvantage ovei moie estaLlisheu ones:
they can ask loi loigiveness insteau ol peimission. Because they have less to
lose, they can make iisky Lets. In the eaily uays ol PayPal, the company coulu
skiit iegulations moie easily than Visa oi Masteicaiu, Lecause it hau lai less
to leai il it was shut uown. This helpeu it gain maiketshaie while estaLlisheu
cieuit-caiu companies weie Lusy with papeiwoik.
The ieal pioLlem is one ol asking the iight guestions.
At a Lig uata conleience iun Ly Thc Econonist this spiing, one ol the speakeis
maue a gieat point: Archimedes had taken baths before.
(Quick histoiical iecap: In an almost ceitainly apociyphal tale, Hieio ol Syi-
acuse hau askeu Aichimeues to uevise a way ol measuiing uensity, an inuicatoi
ol puiity, in iiiegulaily shapeu oLjects like golu ciowns. Aichimeues iealizeu
that the level ol watei in a Lath changeu as he climLeu in, making it an inuicatoi
ol volume. Euieka!)
The speakei`s point was this: it was the qucstion that piompteu Aichimeues`
iealization.
Small, agile staitups uisiupt entiie inuustiies Lecause they look at tiauitional
pioLlems with a new peispective. They`ie leailess, Lecause they have less to
lose. But Lig, entiencheu incumLents shoulu still Le aLle to compete, Lecause
they have massive amounts ol uata aLout theii customeis, theii piouucts, theii
employees, anu theii competitois. They lail Lecause olten they just uon`t know
how to ask the iight guestions.
In a iecent stuuy, McKinsey lounu that Ly 201S, the U.S. will lace a shoitage
ol 1.5 million manageis who aie lluent in uata-Laseu uecision making. It`s a
lesson not lost on leauing Lusiness schools: seveial ol them aie intiouucing
Lusiness couises in analytics.
Ultimately, this is what my liienu`s aiiline example unueiscoies. It takes an
employee, ueciuing that the loss ol high-value customeis is impoitant, to iun
a gueiy ol all theii uata anu linu him, anu then tuin that into a Lusiness au-
vantage. Vithout the iight guestions, theie ieally is no such thing as Lig uata
÷anu touay, it`s the upstaits that aie asking all the goou guestions.
Vhen it comes to Lig uata, you eithei use it oi lose.
There’s no such thing as big data | 109
This is what we`ie hoping to exploie at Stiata ]umpSsait in New Yoik next
month. Rathei than taking a veitical look at a paiticulai inuustiy, we`ie looking
at the Lasics ol Lusiness auministiation thiough a Lig uata lens. Ve`ll Le look-
ing at apply Lig uata to HR, stiategic planning, iisk management, competitive
analysis, supply chain management, anu so on. In a woilu llooueu Ly too much
uata anu too many answeis, tomoiiow`s Lusiness leaueis neeu to leain how
to ask the iight guestions.
Building data startups: Fast, big, and focused
Low costs and cIoud tooIs are empowering new data startups.
Ly Michael Diiscoll
This is a writtcn jo||ow-up to a ta|| prcscntcd at a rcccnt Strata on|inc cvcnt.
A new Lieeu ol staitup is emeiging, Luilt to take auvantage ol the iising tiues
ol uata acioss a vaiiety ol veiticals anu the matuiing ecosystem ol tools loi its
laige-scale analysis.
These aie uata staitups, anu they aie the sumo wiestleis on the staitup stage.
The weight ol uata is a souice ol theii competitive auvantage. But like theii
sumo mentois, size alone is not enough. The most successlul ol uata staitups
must Le last (with uata), Lig (with analytics), anu locuseu (with seivices).
Setting the stage: The attack of the exponentials
The guestion ol why this style ol staitup is aiising touay, veisus a uecaue ago,
owes to a conlluence ol loices that I call the Attack ol the Exponentials. In
shoit, ovei the past live uecaues, the cost ol stoiage, CPU, anu Lanuwiuth has
Leen exponentially uiopping, while netwoik access has exponentially in-
cieaseu. In 19S0, a teiaLyte ol uisk stoiage cost $1+ million uollais. Touay,
it`s at $30 anu uiopping. Classes ol uata that weie pieviously economically
unviaLle to stoie anu mine, such as machine-geneiateu log liles, now iepiesent
piospects loi piolit.
110 | Chapter 4:The Business of Data
At the same time, these technological loices aie not symmetiic: CPU anu stoi-
age costs have lallen lastei than that ol netwoik anu uisk IO. Thus uata is
heavy; it giavitates towaiu centeis ol stoiage anu compute powei in piopoition
to its mass. Migiation to the clouu is the manilest uestiny loi Lig uata, anu the
clouu is the launching pau loi uata staitups.
Leveraging the big data stack
As the lounuational layei in the Lig uata stack, the clouu pioviues the scalaLle
peisistence anu compute powei neeueu to manulactuie uata piouucts.
At the miuule layei ol the Lig uata stack is analytics, wheie leatuies aie ex-
tiacteu liom uata, anu leu into classilication anu pieuiction algoiithms.
Finally, at the top ol the stack aie seivices anu applications. This is the level
at which consumeis expeiience a uata piouuct, whethei it Le a music iecom-
menuation oi a tiallic ioute pieuiction.
Let`s take each ol layeis anu uiscuss the competitive axes at each.
Building data startups: Fast, big, and focused | 111
Thc conpctitivc axcs and rcprcscntativc tcchno|ogics on thc Big Data stac| arc
i||ustratcd hcrc. At thc botton ticr oj data, jrcc too|s arc shown in rcd (MySQL,
Postgrcs, Hadoop), and wc scc how thcir conncrcia| adaptations (|njoBright,
Grccnp|un, MapR) conpctc principa||y a|ong thc axis oj spccd, ojjcring jastcr
proccssing and qucry tincs. Scvcra| oj thcsc p|aycrs arc pushing up towards thc
sccond ticr oj thc data stac|, ana|ytics. At this |aycr, thc prinary conpctitivc
axis is sca|c: jcw ojjcrings can addrcss tcrabytc-sca|c data scts, and thosc that do
arc typica||y proprictary. Iina||y, at thc top |aycr oj thc big data stac| |ics thc
scrviccs that touch consuncrs and busincsscs. Hcrc, jocus within a spccijic scctor,
conbincd with dcpth that rcachcs downward into thc ana|ytics ticr, is thc dcjining
conpctitivc advantagc.
Fast data
At the Lase ol the Lig uata stack ÷ wheie uata is stoieu, piocesseu, anu gueiieu
÷ the uominant axis ol competition was once scale. But as cheapei commouity
uisks anu Hauoop have ellectively auuiesseu scalaLle peisistence anu pio-
cessing, the locus ol competition has shilteu towaiu speeu. The uemanu loi
112 | Chapter 4:The Business of Data
lastei uisks has leu to an explosion in inteiest in soliu-state uisk liims, such as
Fusion-IO, which went puLlic iecently. Anu seveial staitups, most notaLly
MapR, aie piomising lastei veisions ol Hauoop.
FusionIO anu MapR iepiesent anothei tienu at the uata layei: commeicial
technologies that challenge open souice oi commouity olleiings on an elli-
ciency Lasis, namely watts oi CPU cycles consumeu. Vith eneigy costs uiiving
Letween one-thiiu anu one-hall ol uata centei opeiating costs, these ellicien-
cies have a uiiect linancial impact.
Finally, just as many laige-scale, NoSQL uata stoies aie moving liom uisk to
SSD, otheis have oLseiveu that many tiauitional, ielational uataLases will soon
Le entiiely in memoiy. This is paiticulaily tiue loi applications that ieguiie
iepeateu, last access to a lull set ol uata, such as Luiluing mouels liom cus-
tomei-piouuct matiices. This Liings us to the seconu tiei ol the Lig uata stack,
analytics.
Big analytics
At the seconu tiei ol the Lig uata stack, analytics is the Liains to clouu com-
puting`s Liawn. Heie, howevei, the speeu is less ol a challenge; given an au-
uiessaLle uata set in memoiy, most statistical algoiithms can yielu iesults in
seconus. The challenge is scaling these out to auuiess laige uatasets, anu ie-
wiiting algoiithms to opeiate in an online, uistiiLuteu mannei acioss many
machines.
Because uata is heavy, anu algoiithms aie light, one key stiategy is to push
coue ueepei to wheie the uata lives, to minimize netwoik IO. This olten ie-
guiies a tight coupling Letween the uata stoiage layei anu the analytics, anu
algoiithms olten neeu to Le ie-wiitten as usei-uelineu lunctions (UDFs) in a
language compatiLle with the uata layei. Gieenplum, leveiaging its Postgies
ioots, suppoits UDFs wiitten in Loth ]ava anu R. Following Google`s BigTa-
Lle, HBase is intiouucing copiocessois in its 0.92 ielease, which allows ]ava
coue to Le associateu with uata taLlets, anu minimize uata tianslei ovei the
netwoik. Netezza pushes even luithei into haiuwaie, emLeuuing an aiiay ol
lunctions into FPGAs that aie physically co-locateu with the uisks ol its stoiage
appliances.
The lielu ol what`s alteinatively calleu Lusiness oi pieuictive analytics is nas-
cent, anu while a iange ol enaLling tools anu platloims exist (such as R, SPSS,
anu SAS), most ol the algoiithms uevelopeu aie piopiietaiy anu veitical-spe-
cilic. As the ecosystem matuies, one may expect to see the iise ol liims selling
analytical seivices ÷ such as iecommenuation engines ÷ that inteiopeiate
acioss uata platloims. But in the neai-teim, consultancies like Accentuie anu
Building data startups: Fast, big, and focused | 113
McKinsey, aie positioning themselves to pioviue Lig analytics via LillaLle
houis.
Outsiue ol consulting, liims with analytical stiengths push upwaiu, suilacing
locuseu piouucts oi seivices to achieve success.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
Focused services
The top ol the Lig uata stack is wheie uata piouucts anu seivices uiiectly touch
consumeis anu Lusinesses. Foi uata staitups, these olleiings moie lieguently
take the loim ol a seivice, olleieu as an API iathei than a Lunule ol Lits.
BillGuaiu is a gieat example ol a staitup olleiing a locuseu uata seivice. It
monitois customeis` cieuit caiu statements loi uuLious chaiges, anu even lev-
eiages the collective Lehavioi ol useis to impiove its liauu pieuictions.
Seveial staitups aie woiking on algoiithms that can ciack the content ielevance
nut, incluuing FlipLoaiu anu News.me. Klout ueliveis a puie uata seivice that
uses social meuia activity to measuie online inlluence. My company, Meta-
maikets, ciunches seivei logs to pioviue piicing analytics loi puLlisheis.
Foi uata staitups, uata piocesses anu algoiithms ueline theii competitive au-
vantage. Pooi pieuictions ÷ whethei ol liauu, ielevance, inlluence, oi piice
÷ will sink a uata staitup, no mattei how well-uesigneu theii weL UI oi moLile
application.
114 | Chapter 4:The Business of Data
Focuseu uata seivices aien`t limiteu to staitups: LinkeuIn`s People You May
Know anu FouiSguaie`s Exploie leatuie enhance engagement ol theii com-
panies` coie piouucts, Lut only when they coiiectly suggest people anu places.
Democratizing big data
The axes ol stiategy in the Lig uata stack show analytics to Le sguaiely at the
centei. Data platloim pioviueis aie pushing upwaius into analytics to uillei-
entiate themselves, touting suppoit loi last, uistiiLuteu coue execution close
to the uata. Tiauitional analytics playeis, such as SAS anu SAP, aie expanuing
theii stoiage lootpiints anu challenging the neeu loi alteinative uata platloims
as staging aieas. Finally, uata staitups anu many estaLlisheu liims aie cieating
seivices whose success hinges uiiectly on piopiietaiy analytics algoiithms.
The emeigence ol uata staitups highlights the uemociatizing conseguences ol
a matuiing Lig uata stack. Foi the liist time, companies can successlully Luilu
olleiings without ueep inliastiuctuie know-how anu locus at a highei level,
ueveloping analytics anu seivices. By all inuications, this is a uemociatic loice
that piomises to unleash a wave ol innovation in the coming uecaue.
Data markets aren’t coming: They’re already here
Gnip's Jud VaIeski on data reseIIers, end-user responsibiIity, and the
threat of bIack markets.
Ly ]ulie Steele
]uu Valeski (¿jvaleski) is colounuei anu CEO ol Gnip, a social meuia uata
pioviuei that aggiegates leeus liom sites like Twittei, FaceLook, Flicki, ueli-
cious, anu otheis into one API.
]uu will Le speaking at Stiata next week on a panel titleu ¨Vhat`s Mine is
Youis: the Ethics ol Big Data Owneiship.¨
Il you`ie attenuing Stiata, you can also linu out moie aLout giowing Lusiness
ol uata maiketplaces at a ¨Data Maiketplaces¨ panel with Ian Vhite ol UiLan
Mapping, Petei Mainey ol Thomson Reuteis, Moe Khosiavy ol Miciosolt, anu
Dennis Yang ol Inlochimps.
Data markets aren’t coming: They’re already here | 115
My inteiview with ]uu lollows.
Why is socia| ncdia data inportant? What can wc do with it or |carn jron it?
Jud VaIeski: Social meuia touay is the liist time a ieasonaLly laige population
has communicateu uigitally in ielative puLlic. The aLility to piogiammatically
analyze collective conveisation has nevei ieally existeu. Being aLle to analyze
the collective human consciousness has Leen the uieam ol ieseaicheis anu
analysts since uay one.
The uata itsell is impoitant Lecause it can Le analyzeu to assist in uisastei
uetection anu ieliel. It can Le analyzeu loi piolit in an inuustiy that has always
stiuggleu to pinpoint how anu wheie to spenu money. It can Le analyzeu to
ueteimine linancial maiket viaLility (stock tiauing, loi example). It can Le
analyzeu to unueistanu community sentiment, which has political iamilica-
tions; we all want oui voices heaiu in oiuei to shape puLlic policy.
What arc sonc oj thc nost connon or surprising qucrics run through Gnip?
Jud VaIeski: Ve uon`t look at the gueiies oui customeis use. One pattein we
have seen, howevei, is that theie aie some people who tiy to use the soltwaie
to siphon as much uata as possiLle out ol a given puLlishei. ¨Moie uata, moie
uata, moie uata.¨ Ve heai that all the time. But how oui customeis conliguie
the Gnip soltwaie is up to them.
116 | Chapter 4:The Business of Data
Strata: Making Data Work, Leing helu FeL. 1-3, 2011 in Santa Claia, Calil.,
will locus on the Lusiness anu piactice ol uata. The conleience will pioviue
thiee uays ol tiaining, Lieakout sessions, anu plenaiy uiscussions÷along with
an Executive Summit, a Sponsoi Pavilion, anu othei events showcasing the
new uata ecosystem.
Save 30% off registration with the code STR11RAD
With Gnip, custoncrs can choosc thc data sourccs thcy want not just by sitc but
a|so by catcgory within thc sitc. Can you tc|| nc norc about thc options jor
Twittcr, which inc|udc Dccahosc, Ha|jhosc, and Spritzcr?
Jud VaIeski: Ve tenu to categoiize social meuia souices into thiee Luckets:
Volume, Coveiage, oi Both. Volume stieams pioviue a consumei with a sam-
pleu iate ol volume (Decahose is 10º, loi example, while a lull liiehose is
100º ol some seivice`s activities). Statisticians anu analysts like the Volume
stull.
Coveiage stieams exist to pioviue lull coveiage ol a ceitain set ol things (e.g.,
keywoius, oi the Usei Mention Stieam loi Twittei). Auveitiseis like Coveiage
stieams Lecause theii inteiests aie veiy taigeteu. Theie aie some piouucts that
lall into Loth categoiies, Lut Volume anu Coveiage tenu to uesciiLe the oveiall
view.
Foi Twittei in paiticulai, we use theii algoiithm as uesciiLeu on theii uev
pages, aujusteu loi each paiticulai volume iate uesiieu.
Data markets aren’t coming: They’re already here | 117
Gnip is currcnt|y thc on|y |iccnscd rcsc||cr oj thc ju|| Twittcr jirchosc. Arc thcrc
othcr partncrships coning up?
Jud VaIeski: ¨Cuiiently¨ is the opeiative woiu heie. Vhile we`ie enjoying the
implieu exclusivity ol the cuiient conuitions, we lully expect Twittei to giow
its VAR tiei to ensuie a moie competitive maiketplace.
Fiom my peispective, Twittei enaLling VARs allows them to locus on what is
neai anu ueai to theii heaits ÷ uevelopei use cases, piomoteu Tweets, enu
useis, anu the uisplay ecosystem ÷ while enaLling liims locuseu on the uata-
ueliveiy Lusiness to uistiiLute unueilying uata loi non-uisplay use. Gnip pio-
viues stieam eniichments loi all ol the uata that llows thiough oui soltwaie.
Those eniichments incluue loimat anu piotocol noimalization, as well as
stieam augmentation leatuies such as gloLal URL unwinuing. Those value-
auus make social meuia API integiation anu uata leveiage much easiei than
uoing a Lunch ol one-oll integiations youisell.
Ve`ie ceitainly woiking on othei paitneiships ol this level ol signilicance, Lut
we have nothing to announce at this time.
What do you wish norc pcop|c undcrstood about data nar|cts and/or thc way
|argc datascts can bc uscd?
Jud VaIeski: Fiist, uata is not liee, anu theie`s always someone out theie that
wants to Luy it. As an enu-usei, euucate youisell with how the content you
cieate using someone else`s seivice coulu ultimately Le useu Ly the seivice-
pioviuei.
Seconu, Llack maikets aie a ieal pioLlem, anu just Lecause ¨eveiyone else is
uoing it¨ uoesn`t mean it`s okay. As an example, Lotnet-like uistiiLuteu IP
auuiess polling inliastiuctuie is commonly useu to extiact moie uata liom a
puLlishei`s seivice than theii API usage teims allow. Vhile peihaps lun to
Luilu anu iun (sometimes), these appioaches cleaily iesult in aggiegateu pools
ol puLlishei uata that the puLlishei nevei intenueu to piomote. Once collec-
teu, the aggiegateu pools ol uata aie solu to uata-hungiy analytics liims. This
iesults in enu-usei liustiation, in that the content they piouuceu was useu in
a mannei that llagiantly violateu the teims unuei which they signeu up. These
uataLases aie lieguently calleu out as inliinging on piivacy.
Eveiyone loves a goou RoLin Hoou stoiy, anu that`s how I`u chaiacteiize the
oveiall state ol uata collection touay.
How has rca|-tinc data changcd thc jic|d oj custoncr rc|ationship nanagcncnt
(CRM)?
Jud VaIeski: CRM liims have a new level ol awaieness. They no longei iely
exclusively on uateu usei stuuies. A customei seivice iep may know aLout youi
118 | Chapter 4:The Business of Data
social lile thiough theii uashLoaiu the moment you aie connecteu to them
ovei the phone.
I ultimately see the powei ol unueistanuing collective consciousness in ie-
sponuing to customei seivice issues. Ve haven`t even sciatcheu the suilace
heie. Imagine il Company X ieacheu out to you uiiectly eveiy time you hau a
pioLlem with theii piouuct oi seivice. Pioactivity can pay huge uiviuenus.
Companies haven`t tappeu even 10º ol the potential heie, anu pait ol that is
Lecause they`ie not spenuing enough money in the aiea yet.
Touay, ¨social¨ is a checkLox that CRM tools attempt to check oll just to keep
the Loss happy. Tomoiiow, social uata anu metaphois will ueline the tools
outiight.
Havc you |carncd anything as a socia| ncdia uscr yoursc|j jron wor|ing on Gnip?
|s thcrc anything socia| ncdia uscrs shou|d bc norc awarc oj?
Jud VaIeski: Reau the teims ol seivice loi social meuia seivices you`ie using
Leloie you complain aLout piivacy policies oi how anu wheie youi uata is
Leing useu. Unless you aie on a piivate netwoik, youi uata is tieateu as puLlic
loi all to use, see, sell, oi Luy. Don`t kiu youisell. Ol couise, this Liings us all
the way Lack aiounu to Llack maikets. Black maikets ÷ anu puLlisheis` gen-
eially lackauaisical iesponse to them ÷ clouu these wateis.
|j you can`t na|c it to Strata, you can |carn norc about thc architcctura| cha|-
|cngcs oj distributing socia| and |ocation data across thc wcb in rca| tinc, and
how Gnip has cvo|vcd to addrcss thosc cha||cngcs, in jud`s contribution to
¨Bcautiju| Data.¨
An iTunes model for data
Datasets as aIbums? Entities as singIes? How an iTunes for data might
work.
Ly Auuiey Vatteis
As we move towaiu a uata economy, can we take the uigital content mouel
anu apply it to uata acguisition anu sales? That`s a suggestion that Gil ElLaz
(¿gilelLaz), CEO anu co-lounuei ol the uata platloim Factual maue in passing
at his iecent talk at VeL 2.0 Expo.
An iTunes model for data | 119
ElLaz spoke aLout some ol the huiules that staitups lace with Lig uata ÷ not
just the guestion ol stoiage, Lut the guestion ol access. But as he auuiesseu
the emeiging uata economy, ElLaz saiu we will likely see novel access methous
anu new maiketplaces loi uata. Staitups will Le aLle to Luilu value-auueu
seivices on top ol Lig uata, iathei than having to woiiy aLout gatheiing anu
stoiing the uata themselves. ¨An iTunes loi uata,¨ is how he uesciiLeu it.
So what woulu it mean to apply the iTunes mouel to uata sales anu uistiiLu-
tion? I askeu ElLaz to expanu on his thoughts.
What prob|cns docs an iTuncs nodc| jor data so|vc?
GiI EIbaz: One key liamewoik that will catalyze uata shaiing, licensing anu
consumption will Le an open uata maiketplace. It is a place wheie uata can Le
piogiammatically seaicheu, licenseu, accesseu, anu integiateu uiiectly into a
consumei application. One might call it the ¨eBay ol uata¨ oi the ¨iTunes ol
uata.¨ iTunes might Le the Lettei metaphoi Lecause it`s not just the content
that is valuaLle, Lut also the convenience ol the uistiiLution channel anu the
aLility to pay loi only what you will consume.
120 | Chapter 4:The Business of Data
How wou|d an iTuncs nodc| jor data addrcss |iccnsing and owncrship?
GiI EIbaz: In the case ol iTunes, in a single click I puichase a tiack, uownloau
it, estaLlish licensing iights on my iPhone anu up to loui othei authoiizeu
uevices, anu it`s immeuiately integiateu into my uaily lile. Similaily, the ueep-
est value will come loi a maiketplace that, with a single click, allows a uevel-
opei to license uata anu have it automatically integiateu into theii paiticulai
application uevelopment stack. That might mean having the uata instantly
accessiLle via API, automatically ieplicateu to a MySQL seivei on EC2,
synchionizeu at DataLase.com, oi copieu to Google App Engine.
An iTunes loi uata coulu Le piiceu liom a single iecoiu/entity to a complete
uataset. Anu it coulu Le licenseu loi single use, caching alloweu loi 2+ houis,
oi peipetual iights loi a specilic application.
What nccds to happcn jor us to novc away jron ¨buying thc who|c a|bun¨ to
buying thc data cquiva|cnt oj a sing|c?
GiI EIbaz: The maiketplace will eventually lacilitate competitive Liuuing,
which will Liing the piice uown loi uevelopeis. iTunes is Laseu on a laiily
simple set-piicing mouel. But, in a woilu ol multiple uata venuois with com-
mouity uata, only tiuly unigue uata will commanu a piemium piice. Anu, ol
couise, we`ll neeu gieat seaich technology to linu the iight uata oi uata API
Laseu on the uevelopei`s couilieu ieguiiements: specilieu uata schema, uata
guality Lai, licensing neeus, anu the Liu piice.
Anothei uimension that is ielevant to Factual`s cuiient mouel: uata as a cui-
iency. Some ol oui most inteiesting paitneiships aie Laseu on an open ex-
change ol inloimation. Paitneis access oui uata anu also contiiLute Lack
stieams ol euits anu othei Lulk uata into oui ecosystem. Ve highly value the
contiiLutions oui paitneis make. ¨Cuiiency¨ is a meuium ol exchange anu a
Lasis loi accessing othei scaice iesouices. In a woilu wheie not eveiyone is yet
actively looking to license uata, unigue uata is incieasingly an impoitant me-
uium ol exchange.
This intcrvicw was cditcd and condcnscd.
Photos: iTuncs intcrjacc courtcsy App|c, |nc, Sojtwarc Dcvc|opncnt LijcCyc|c
Tcnp|atcs By Phasc Sprcadshcct by |van Wa|sh, on I|ic|r
An iTunes model for data | 121
Data is a currency
The trade in data is onIy in its infancy
Ly Euu DumLill
Il I talk aLout uata maiketplaces, you pioLaLly think ol laige ieselleis like
BloomLeig oi Thomson Reuteis. Oi staitups like InloChimps. Vhat you
pioLaLly uon`t think ol is that we as consumeis tiaue in uata.
Since the auvent ol computeis in enteipiises, oui inteiaction with Lusiness has
causeu us to leave a uata impiint. In ietuin loi this uata, we might get lowei
piices oi some othei seivice. The weL has only acceleiateu this, piimaiily
thiough auveitising, anu Lig uata technologies aie auuing luithei luel to this
change.
Vhen I use FaceLook I`m tiauing my uata loi theii seivice. I`ve enteieu into
this commeice peihaps unwittingly, Lut using the same mechanism human-
kinu has known thioughout oui histoiy: tiauing something ol mine loi some-
thing ol theiis.
So let`s guaiu oui piivacy Ly all means, Lut iecognize this is a Laigain anu a
maiketplace we entei into. Consumeis will giow moie sophisticateu aLout the
natuie ol this tiaue, anu auopt tools to manage the uata they give up.
Is this all one-way tiallic? Business is ceitainly aheau ol the consumei in the
uata management game, Lut theie`s a iace loi contiol on Loth siues. To con-
tinue the cuiiency analogy, Liowseis have hau ¨wallets¨ loi a while, so we can
keep oui uata in one place.
The matuiity ol the uata cuiiency will Le signalleu Ly peisonal uata Lank ac-
counts, that give us the consumei contiol anu tiaceaLility. The Lockei pioject
is a liist step towaius this goal, giving useis a way to get theii uata Lack liom
uispaiate sites, Lut is one ol many lutuie mouels.
Vho iuns uata Lanks themselves will Le anothei point ol contiol in the stiuggle
loi uata owneiship.
122 | Chapter 4:The Business of Data
Big data: An opportunity in search of a metaphor
Big data as a discipIine or a conference topic is stiII in its formative years.
Ly Tylei Bell
The ciowu at the Stiata Conleience coulu Le uiviueu into two Lioau contin-
gents:
1. Those attenuing to leain moie aLout uata, having iecently uiscoveieu its
potential.
2. Long-time uata enthusiasts watching with mixeu emotions as theii inteiest
is legitimizeu, expeiiencing a leeling not unlike when a Lanu that you`ve
Leen lollowing loi yeais suuuenly Lecomes populai.
Big data: An opportunity in search of a metaphor | 123
A uata-oiienteu event like this, outsiue a specilic veitical, coulu not have uiawn
a laige ciowu with this level ol inteiest, even two yeais ago. Until iecently,
uata was mainly an aitilact ol Lusiness piocesses. It now takes centei stage;
oiganizationally, uata has lelt the IT uepaitment anu Lecome the iesponsiLility
ol the piouuct team.
Ol couise ¨uata,¨ in its aLstiact sense, has not changeu. But oui aLility to
oLtain, manipulate, anu compiehenu uata ceitainly has. Touay, uata meiits
top Lilling uue to a numLei ol conlluent lactois, not least its incieaseu acces-
siLility via on-uemanu platloims anu tools. Seivei logs aie the new cash-loi-
golu: act now to iealize the neglecteu iiches within youi uppei uiive Lay.
But the iuea ol ¨Lig uata¨ as a uiscipline, as a conleience suLject, oi as a Lusi-
ness, iemains in its loimative yeais anu has yet to Le satislactoiily uelineu.
This immatuiity is peihaps Lest illustiateu Ly the aiiay ol language employeu
to ueline Lig uata`s meiits anu its associateu challenges. Commentatois aie
employing veiy uistinct woiuing to make the ill-uelineu iuea ol ¨Lig uata¨ moie
lamiliai; theii metaphois lall cleanly into thiee categoiies:
º NaturaI resources (¨the new oil,¨ ¨goluiush¨ anu ol couise ¨uata min-
ing¨): Highlights the singulai value inheient in uata, tempeieu Ly the elloit
ieguiieu to iealize its potential.
º NaturaI disasters (¨uata toinauo,¨ ¨uata ueluge,¨ uata tiual wave¨):
Fiames uata as a pioLlem ol neai-LiLlical scale, with suLtle unueitones ol
assuieu uisastei il piopei anu timely piepaiations aie not consiueieu.
º IndustriaI devices (¨uata exhaust,¨ ¨liiehose,¨ ¨Inuustiial Revolution¨):
A convenient giaL-Lag ol teiminologies that usually poitiays uata as a
mechanism cieateu anu contiolleu Ly us, Lut one that will piove haimlul
il useu incoiiectly.
Il Stiata`s Biius-ol-a-Feathei conleience sessions aie anything to go Ly, the
iuea ol ¨Lig uata¨ ieguiies the uelinition anu scope these metaphois attempt
to pioviue. Ovei lunch you coulu have met with like-minueu uelegates to uis-
cuss Lig uata analysis, clouu computing, Vikipeuia, peei-to-peei collaLoia-
tion, ieal-time location shaiing, visualization, uata philanthiopy, Hauoop
(natch`), uata mining competitions, uev ops, uata tools (Lut ¨not tiivial visu-
alizations¨), Cassanuia, NLP, GPU computing, oi health caie uata. Theie aie
two takeaways heie: the liist is that we aie still liguiing out what Lig uata is
anu how to think aLout it; the seconu is that any alteinative is pioLaLly an
impiovement on ¨Lig uata.¨
Stiata is aLout ¨making uata woik¨ ÷ the tenoi ol the conleience was less ol
a ¨how-to¨ guiue, anu moie aLout uelining the pioLlem anu shaping the
124 | Chapter 4:The Business of Data
uiscussion. Big uata is a massive oppoitunity; we aie seaiching loi its iuentity
anu the language to ueline it.
Data and the human-machine connection
Opera SoIutions' Arnab Gupta says human pIus machine aIways trumps
human vs machine.
Ly ]ulie Steele
AinaL Gupta is the CEO ol Opeia Solutions, an inteinational company ollei-
ing Lig uata analytics seivices. I hau the chance to chat with him iecently aLout
the massive task ol managing Lig uata anu how humans anu machines intei-
sect. Oui inteiview lollows.
Tc|| nc a bit about your approach to big data ana|ytics.
Arnab Gupta: Oui company is a science-oiienteu company, anu the coie Le-
liel is that Lehavioi ÷ human oi otheiwise ÷ can Le mathematically ex-
piesseu. Yes, people make iiiational value juugments, Lut they aie uiiven Ly
common motivation lactois, anu the math expiesses that.
I look at the so-calleu ¨Lig uata phenomenon¨ as the instantiation ol human
expeiience. Pieviously, we coulu not guantitatively measuie human expeii-
ence, Lecause the uata wasn`t Leing captuieu. But Twittei iecently announceu
that they now seive 350 Lillion tweets a uay. Vhat we say anu what we uo has
a physical manilestation now. Once theie is a physical manilestation ol a phe-
nomenon, then it can Le mathematically expiesseu. Anu il you can expiess it,
then you can shape Lusiness iueas aiounu it, whethei that`s in goveinment oi
health caie oi Lusiness.
Data and the human-machine connection | 125
How do you hand|c rapid|y incrcasing anounts oj data?
Arnab Gupta: It`s an impossiLle Lattle when you think aLout it. The amount
ol uata is going to giow exponentially eveiy uay, evei week, eveiy yeai, so
captuiing it all can`t Le uone. In the economic ecosystem theie is extiaoiuinaiy
waste. Companies spenu vast amounts ol money, anu the iatio ol investment
to insight is giowing, with much moie investment loi similai levels ol insight.
This methou just mathematically cannot woik.
So, we uon`t look loi uata, we look loi signal. Vhat we`ve saiu is that the
shoitcut is a piioii iuentilying the signals to know wheie the lish aie swimming,
insteau ol tiying to uam the watei to linu out which lish aie in it. Ve locus on
the llow, not a static uata captuie.
Strata Conference New York 2011, Leing helu Sept. 22-23, coveis the latest
anu Lest tools anu technologies loi uata science÷liom gatheiing, cleaning,
analyzing, anu stoiing uata to communicating uata intelligence ellectively.
Save 30% on registration with the code STN11RAD
What ro|c docs visua|ization p|ay in thc scarch jor signa|?
Arnab Gupta: Visualization is essential. People uumL it uown sometimes Ly
calling it ¨UI¨ anu ¨uashLoaius,¨ anu they uon`t apply science to the guestion
ol how people peiceive. Ve neeu unueistanuing that leeus into the lelt Liain
thiough the iight Liain via visual metaphoi. At Opeia Solutions, we aie in-
cieasingly tiying to liguie out the ways in which the minu unueistanus anu
tiansloims the visualization ol algoiithms anu uata into insights.
126 | Chapter 4:The Business of Data
|j undcrstanding is a priority, thcn which do you prcjcr: a b|ac|-box nodc| with
bcttcr prcdictabi|ity, or a transparcnt nodc| that nay bc |css accuratc?
Arnab Gupta: People Liluicate, anu think in teims ol Llack-Lox machines vs.
the human minu. But the guestion is whethei you can use machine leaining
to leeu human insight. The powei lies in expiessing the Llack Lox anu making
it tianspaient. You uo this Ly stiess testing it. Foi example, il you weie looking
at a mouel loi moitgage uelaults, you woulu say, ¨Vhat happens il home piices
went uown Ly X peicent, oi inteiest iates go up Ly X peicent?¨ You make youi
own heuiistics, so that when you make a Let you unueistanu exactly how the
machine is inloiming youi Let.
Humans can uo analysis veiy well, Lut the machine uoes it consistcnt|y well;
it uoesn`t make mistakes. Vhat the machine lacks is the aLility to consiuei
oithogonal lactois, anu the cieativity to consiuei what cou|d Le. The human
minu lills in those gaps anu enhances the powei ol the machine`s solution.
So you advocatc a partncrship bctwccn thc nodc| and thc data scicntist?
Arnab Gupta: Ve olten cieate lalse uichotomies loi ouiselves, Lut the tiuth
is it`s nevei Leen man vs. machine; it has always Leen man p|us machine. In-
cieasingly, I think it`s an aiticle ol laith that the machine Leats the human in
most laige-scale pioLlems, even chess. But though the pieuictive powei ol
machines may Le Lettei on a laige-scale Lasis, il the human minu is tiaineu to
use it poweilully, the possiLilities aie limitless. In the iecent ]eopaiuy show-
uown with IBM`s Vatson, I woulu have hau a thiee-way competition with
Vatson, a ]eopaiuy champion, anu a conbination ol the two. Then you woulu
have seen wheie the lutuie lies.
Docs this ncan wc nccd to changc our approach to cducation, and train pcop|c
to usc nachincs dijjcrcnt|y?
Arnab Gupta: ALsolutely. Il you look Lack in time Letween now anu the
1S50s, eveiything in the woilu has changeu except the classioom. But I think
we aie uealing with a phase-shilt occuiiing. Like most things, the ineitia ol
powei is veiy haiu to shilt. Change can take a long time anu theie will Le a lot
ol ueLiis in the piocess.
One majoi huiule is that the language ol machine-plus-human inteiaction has
not yet Legun to Le uevelopeu. It`s paitly a silent language, with uata visuali-
zation as a signilicant key. The tiouLle is that language is so poweilul that the
lelt Liain easily staits uominating, Lut ieally almost all ol oui ciitical inputs
come liom non-veiLal signals. Ve have no way ol cieating a new loim ol
language to uesciiLe these things yet. Ve aie at the Leginning ol tiying to
uevelop this.
Data and the human-machine connection | 127
Anothei open guestion is: Vhat`s the skill set anu the capaLilities necessaiy
loi this? At Opeia we have locuseu on the aLility to teach machines how to
leain. Ve have 150-160 people woiking in that aiea, which is pioLaLly the
laigest piivate concentiation in that aiea outsiue IBM anu Google. One ol the
ieasons we aie hiiing all these scientists is to tiy to innovate at the level ol coie
competencies anu the science ol compiehension.
The Lusiness outcome ol that is simply piactical. At the enu ol the uay, much
ol what we uo is piosaic; it makes money oi it uoesn`t make money. It`s a
Lusiness. But the philosophical lountain liom which we uiink neeus to Le a
ueep one.
Associatcd photo on honc and catcgory pagcs: prd brain scan by Patric| Dcn|cr,
on I|ic|r
128 | Chapter 4:The Business of Data

Big Data Now

Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo

Big Data Now

Printing History:

. . . . . . . . . . . .Table of Contents Foreword . . . . . . . . . 61 iii . . . . . . . Data Issues . . . . . . . . . . . . . . . . . . . . . . Data Science and Data Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 75 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . 107 iv | Table of Contents . The Business of Data . . . . . . . . . . . . . . . . . The Application of Data: Products and Processes . . . . . . . . .

Table of Contents | v .

.

Foreword vii .

.

CHAPTER 1 Data Science and Data Tools What is data science? 1 .

What is data science? 2 | Chapter 1: Data Science and Data Tools .

Flu trends What is data science? | 3 .

Where data comes from 4 | Chapter 1: Data Science and Data Tools .

What is data science? | 5 .

1956 disk drive 6 | Chapter 1: Data Science and Data Tools .

What is data science? | 7 .

Working with data at scale 8 | Chapter 1: Data Science and Data Tools .

What is data science? | 9 .

10 | Chapter 1: Data Science and Data Tools .

What is data science? | 11 .

Making data tell its story Data scientists 12 | Chapter 1: Data Science and Data Tools .

What is data science? | 13 .

Hiring trends for data science 14 | Chapter 1: Data Science and Data Tools .

What is data science? | 15 .

The SMAQ stack for big data → → → → 16 | Chapter 1: Data Science and Data Tools .

MapReduce The SMAQ stack for big data | 17 .

18 | Chapter 1: Data Science and Data Tools .

get(). } } The SMAQ stack for big data | 19 . for (IntWritable val : values) { sum += val. } context.toString().set(tokenizer. Text value. Text.hasMoreTokens()) { word. Text.write(key. IntWritable> { private final static IntWritable one = new IntWritable(1). StringTokenizer tokenizer = new StringTokenizer(line). public void map(LongWritable key. Text. while (tokenizer.nextToken()). InterruptedException { int sum = 0. context. private Text word = new Text().write(word. IntWritable> { public void reduce(Text key. Context context) throws IOException. new IntWritable(sum)). InterruptedException { String line = value. } } } public static class Reduce extends Reducer<Text. one). IntWritable. Context context) throws IOException. Iterable<IntWritable> values.Hadoop MapReduce public static class Map extends Mapper<LongWritable.

Other implementations Storage 20 | Chapter 1: Data Science and Data Tools .

the Hadoop Database The SMAQ stack for big data | 21 .Hadoop Distributed File System HBase.

Hive Cassandra and Hypertable 22 | Chapter 1: Data Science and Data Tools .

NoSQL database implementations of MapReduce The SMAQ stack for big data | 23 .

Integration with SQL databases Integration with streaming data sources Commercial SMAQ solutions 24 | Chapter 1: Data Science and Data Tools .

Query The SMAQ stack for big data | 25 .

counts = FOREACH grouped GENERATE group. 26 | Chapter 1: Data Science and Data Tools . grouped = GROUP words BY $0. STORE ordered INTO 'output/wordCount' USING PigStorage(). ordered = ORDER counts BY $0. words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)). COUNT(words).txt' USING TextLoader().Pig input = LOAD 'input/sentences.

the API Approach The SMAQ stack for big data | 27 .Hive Cascading.

(stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/count ?count)) Search with Solr Conclusion 28 | Chapter 1: Data Science and Data Tools .split sentence "\\s+"))) (?<.(defmapcatop split [sentence] (seq (.

and selling big data Scraping.Scraping. and selling big data | 29 . cleaning. cleaning.

30 | Chapter 1: Data Science and Data Tools .

and selling big data | 31 . cleaning.Scraping.

32 | Chapter 1: Data Science and Data Tools .

Data hand tools

Data hand tools | 33

QSO: QSO: QSO: QSO: ...

14000 14000 21000 21000

CW CW CW CW

2011-03-19 2011-03-19 2011-03-19 2011-03-19

1229 1232 1235 1235

W1JQ W1JQ W1JQ W1JQ

599 599 599 599

0001 0002 0003 0004

UV5U SO2O RG3K UD3D

599 599 599 599

0041 0043 VR MO

$ grep '599 [A-Z][A-Z]' rudx-log.txt | head -2 QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0003 RG3K QSO: 21000 CW 2011-03-19 1235 W1JQ 599 0004 UD3D

599 VR 599 MO

34 | Chapter 1: Data Science and Data Tools

$ grep '599 [A-Z][A-Z]' rudx-log.txt VR MO ...

| colrm 1 72 | head -2

$ grep '599 [A-Z][A-Z]' rudx-log.txt uniq | head -2 AD AL

| colrm 1 72 | sort |\

Data hand tools | 35

txt uniq | grep 14000 | wc 52 234 | awk '{print $2 " " $11}' |\ | awk '{print $2 " " $11}' |\ 36 | Chapter 1: Data Science and Data Tools .. | awk '{print $2 " " $11}' |\ $ grep '599 sort | 20 $ grep '599 sort | 26 .txt sort | uniq 14000 AD 14000 AL 14000 AN .. [A-Z][A-Z]' rudx-log.txt uniq | grep 21000 | wc 40 180 [A-Z][A-Z]' rudx-log.$ grep '599 [A-Z][A-Z]' rudx-log.txt 38 38 342 | colrm 1 72 | sort | uniq | wc $ grep '599 [A-Z][A-Z]' rudx-log...

$ grep '599 [A-Z][A-Z]' `find .... . 599 0054 \\ 599 0015 \\ Data hand tools | 37 ../2008/rudx-log.txt -print` |\ awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc 48 96 432 .txt:QSO: 14000 CW 2008-03-15 1526 W1JQ UA6YW 599 AD .txt:QSO: 14000 CW 2009-03-21 1225 W1JQ RG3K 599 VR . -name rudx-log./2009/rudx-log.

txt -print | xargs grep '599 [A-Z][A-Z]' awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc 48 96 432 |\ 38 | Chapter 1: Data Science and Data Tools . -name rudx-log.$ find .

$ find . -name rudx-log.41kB 0:00:00 [ 20kB/s] [<=> 48 96 432 |\ Data hand tools | 39 .txt -print | xargs grep '599 [A-Z][A-Z]' awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc 3.

and what it can do 40 | Chapter 1: Data Science and Data Tools .Hadoop: What it is. how it works.

how it works. and what it can do | 41 .Hadoop: What it is.

42 | Chapter 1: Data Science and Data Tools .

Four free data tools for journalists (and snoops) WHOIS Four free data tools for journalists (and snoops) | 43 .

Blekko 44 | Chapter 1: Data Science and Data Tools .

Four free data tools for journalists (and snoops) | 45 .

ly 46 | Chapter 1: Data Science and Data Tools .bit.

Compete Four free data tools for journalists (and snoops) | 47 .

The quiet rise of machine learning 48 | Chapter 1: Data Science and Data Tools .

The quiet rise of machine learning | 49 .

50 | Chapter 1: Data Science and Data Tools .

linked data will succeed | 51 . linked data will succeed Where the semantic web stumbled.Where the semantic web stumbled.

52 | Chapter 1: Data Science and Data Tools .

Where the semantic web stumbled. linked data will succeed | 53 .

Social data is an oracle waiting for a question 54 | Chapter 1: Data Science and Data Tools .

Social data is an oracle waiting for a question | 55 .

The challenges of streaming real-time data 56 | Chapter 1: Data Science and Data Tools .

The challenges of streaming real-time data | 57 .

58 | Chapter 1: Data Science and Data Tools .

The challenges of streaming real-time data | 59 .

60 | Chapter 1: Data Science and Data Tools .

CHAPTER 2 Data Issues Why the term “data science” is flawed but useful It’s not a real science 61 .

It’s an unnecessary label The name doesn’t even make sense 62 | Chapter 2: Data Issues .

There’s no definition Time for the community to rally Why you can’t really anonymize your data Why you can’t really anonymize your data | 63 .

64 | Chapter 2: Data Issues .

Keep the anonymization Acknowledge there’s a risk of de-anonymization Limit the detail Why you can’t really anonymize your data | 65 .

Learn from the experts Big data and the semantic web Google and the semantic web 66 | Chapter 2: Data Issues .

Metadata is hard: big data can help Big data and the semantic web | 67 .

Big data: Global good or zero-sum arms race? 68 | Chapter 2: Data Issues .

Big data: Global good or zero-sum arms race? | 69 .

70 | Chapter 2: Data Issues

The truth about data: Once it’s out there, it’s hard to control

The truth about data: Once it’s out there, it’s hard to control | 71

72 | Chapter 2: Data Issues

The truth about data: Once it’s out there, it’s hard to control | 73

CHAPTER 3 The Application of Data: Products and Processes How the Library of Congress is building the Twitter archive 75 .

76 | Chapter 3: The Application of Data: Products and Processes .

How the Library of Congress is building the Twitter archive | 77 .

data tools. and the newsroom stack 78 | Chapter 3: The Application of Data: Products and Processes .Data journalism.

and the newsroom stack | 79 . data tools.Data journalism and data tools Data journalism.

80 | Chapter 3: The Application of Data: Products and Processes .

data tools.The newsroom stack Data journalism. and the newsroom stack | 81 .

Bridging the data divide 82 | Chapter 3: The Application of Data: Products and Processes .

followed by action The data analysis path is built on curiosity. followed by action | 83 .The data analysis path is built on curiosity.

84 | Chapter 3: The Application of Data: Products and Processes .

The data analysis path is built on curiosity. followed by action | 85 .

It doesn’t have to be fancy. You can’t just sit there and expect it to happen. but you have to get started. How data and analytics can improve education 86 | Chapter 3: The Application of Data: Products and Processes .The last thing is that you need to actually do the work. Experience and practice are really important. Find a dataset that you’re interested in and work on it.

How data and analytics can improve education | 87 .

88 | Chapter 3: The Application of Data: Products and Processes .

How data and analytics can improve education | 89 .

90 | Chapter 3: The Application of Data: Products and Processes .

How data and analytics can improve education | 91 .

Data science is a pipeline between academic disciplines 92 | Chapter 3: The Application of Data: Products and Processes .

Data science is a pipeline between academic disciplines | 93 .

94 | Chapter 3: The Application of Data: Products and Processes .

Data science is a pipeline between academic disciplines | 95 .

Big data and open source unlock genetic secrets 96 | Chapter 3: The Application of Data: Products and Processes .

Big data and open source unlock genetic secrets | 97 .

98 | Chapter 3: The Application of Data: Products and Processes .

Big data and open source unlock genetic secrets | 99 .

Visualization deconstructed: Mapping Facebook’s friendships Mapping Facebook’s friendships 100 | Chapter 3: The Application of Data: Products and Processes .

Visualization deconstructed: Mapping Facebook’s friendships | 101 .

102 | Chapter 3: The Application of Data: Products and Processes .

Static requires storytelling Data science democratized Data science democratized | 103 .

104 | Chapter 3: The Application of Data: Products and Processes .

Data science democratized | 105 .

.

CHAPTER 4 The Business of Data There’s no such thing as big data 107 .

108 | Chapter 4: The Business of Data .

Big data and the innovator’s dilemma There’s no such thing as big data | 109 .

big. and focused Setting the stage: The attack of the exponentials 110 | Chapter 4: The Business of Data .Building data startups: Fast.

and focused | 111 .Leveraging the big data stack Building data startups: Fast. big.

Fast data 112 | Chapter 4: The Business of Data .

and focused | 113 .Big analytics Building data startups: Fast. big.

Focused services 114 | Chapter 4: The Business of Data .

Democratizing big data Data markets aren’t coming: They’re already here Data markets aren’t coming: They’re already here | 115 .

116 | Chapter 4: The Business of Data .

Data markets aren’t coming: They’re already here | 117 .

118 | Chapter 4: The Business of Data .

An iTunes model for data An iTunes model for data | 119 .

120 | Chapter 4: The Business of Data .

An iTunes model for data | 121 .

Data is a currency 122 | Chapter 4: The Business of Data .

Big data: An opportunity in search of a metaphor Big data: An opportunity in search of a metaphor | 123 .

124 | Chapter 4: The Business of Data .

Data and the human-machine connection Data and the human-machine connection | 125 .

126 | Chapter 4: The Business of Data .

Data and the human-machine connection | 127 .

128 | Chapter 4: The Business of Data .

Sign up to vote on this title
UsefulNot useful