Professional Documents
Culture Documents
1 / 49
What is Statistics?
2 / 49
Some more definitions
3 / 49
Google trends
4 / 49
Keeping things in perspective
5 / 49
One view of their relationship
6 / 49
20th Century Innovation
7 / 49
Statistics also played a key role
8 / 49
... to answer these questions
Answers:
• Does fertilizer increase crop yields? Collect and analyze
agricultural experimental data
• Does Streptomycin cure Tuberculosis? Analyze randomized
trials data
• Does smoking cause lung-cancer? Collect and analyze
observational studies data
• Who will win the election? Collect data and try to
extropolate
Answering these was the job of: boring old statisticians
9 / 49
Big data explosion
21st*Century*
12 / 49
In God we trust. All others bring data.
attributed to W. Deming
13 / 49
Emerging themes
14 / 49
Emerging themes
14 / 49
The availability of massive amounts of good data
changes everything
15 / 49
21st century examples
17 / 49
The Supervising Learning Paradigm
19 / 49
The sexiest job?
“I keep saying the sexy job in the next ten years will be
statisticians. People think I’m joking, but who would’ve
guessed that computer engineers would’ve been the sexy job of
the 1990s”
- Hal Varian, Google’s Chief Economist
20 / 49
21st
century
version
brain:normal
liver:normal
2.5
spleen:normal
●
● ●
●
●
●
● ● ●
1.5
● ● ● ●
● ●
● ●
●
● ● ●
● ●
●
● ● ●
● ● ● ● ●
● ●
●
M
● ● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ●
●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
0.5
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ●
−0.5
● ● ● ●
● ●
● ●
●
● ●
● ●
● ●
●
CpG density
0.00 0.10
Shore Island
+
Genes
− PRTFDC1
Modern high-‐throughput technology Sta8s8cs has helped bring data into focus
29 / 49
Business and marketing: a new era
30 / 49
The Netflix Recommender
31 / 49
NeUlix*Challenge*
In&Sept&2009&a&team&lead&by&&Chris&
Volinsky&from&StaGsGcs&Research&
AT&T&Research&was&announced&as&
winner!&
32 / 49
NeUlix*
• &A&USPbased&DVD&rentalPby&mail&company&
• &>10M&customers,&100K&Gtles,&ships&1.9M&DVDs&per&day&
Good&recommendaGons&=&happy&customers&
Courtesy&of&Chris&Volinsky&
33 / 49
NeUlix*Prize**
• &October,&2006:&
• Offers*$1,000,000&for&an&improved&recommender&algorithm&
& user movie score date
• Training&data& 1 21 1 2002-01-03
2 345 4 2002-05-05
• 480,000&users& 2 123 4 2002-05-05
3 76 5 2003-10-10
• 6&years&of&data:&2000P2005&
4 45 4 2004-10-11
5 568 1 2004-10-11
5 234 2 2004-12-12
• Last&few&raGngs&of&each&user&(2.8&million)& 6 76 5 2005-01-02
• EvaluaGon&via&RMSE:&&root&mean&squared&error&& 6 56 4 2005-01-31
• Neqlix&Cinematch&RMSE:&0.9514&
&
• &CompeGGon&
• $1*million*grand&prize&for&10%*improvement&
• If&10%¬&met,&$50,000&annual&“Progress&Prize”&for&
best&improvement&
Courtesy&of&Chris&Volinsky&
34 / 49
NeUlix*Prize**
• &October,&2006:&
• Offers*$1,000,000&for&an&improved&recommender&algorithm&
& user movie score date
• Training&data& 1 21 1 2002-01-03
• 100&million&raGngs& user
1 213
movie score
5 2002-04-04
date
2 345 4 2002-05-05
1 212 ? 2003-01-03
• 480,000&users& 2 123 4 2002-05-05
1 1123 ? 2002-05-04
• 17,770&movies& 2
2
25
768
?
3 2003-05-03
2002-07-05
3 76 5 2003-10-10
• 6&years&of&data:&2000P2005& 2 8773 ? 2002-09-05
4 45 4 2004-10-11
2 98 ? 2004-05-03
5 568 1 2004-10-11
3 16 ? 2003-10-10
• &Test&data& 4
5
2450
342
?
2 2004-10-11
2004-10-11
5 234 2 2004-12-12
• Last&few&raGngs&of&each&user&(2.8&million)& 5
6
2032
76
?
5
2004-10-11
2005-01-02
5 9098 ? 2004-10-11
• EvaluaGon&via&RMSE:&&root&mean&squared&error&& 5
6
11012
56
?
4 2005-01-31
2004-12-12
• &CompeGGon&
• $1*million*grand&prize&for&10%*improvement&
• If&10%¬&met,&$50,000&annual&“Progress&Prize”&for&
best&improvement&
Courtesy&of&Chris&Volinsky&
35 / 49
Latent*Factors*Model*
FratPhouse,&grossPout&comedies& CriGcally&–&acclaimed/&Strong&female&leads&
Artsy,&edgy,&
movies& A&latent*
factors*model&
idenGfies&
factors&with&
maximum&
discriminaGon&
between&
movies&
Hollywood&Big&budget&
Courtesy&of&Chris&Volinsky&
36 / 49
Data science in sports and politics
37 / 49
The*Data*Scien6st*
Actual& Hollywood&
38 / 49
Nate Silver and 538
2012*results*
Observed&difference&
Silver’s&predicted&difference&
42 / 49
How Google has changed advertising
43 / 49
More examples of current Data Science applications
44 / 49
Why Statistical thinking is still important
45 / 49
Statistics versus Machine Learning
46 / 49
Statistics versus Machine Learning
46 / 49
Why statistical inference is important
• In many situations we care about the identity of the
features— e.g. biomarker studies: Which genes are related
Figure 2:
toCumulative
cancer? odds ratio as a function of total genetic information as determined by
successive meta-analyses. a) Eight cases in which the final analysis di↵ered by more than
• from
chance There is a crisis
the claims in reproducibility
of the initial in inScience:
study. b) Eight cases which a significant association
Ioannidis
was found at the end(2005) “Why Most
of the meta-analysis, Published
but was Research
not claimed Findings
by the initial study. From
Ioannidis et al. 2001.
Are False”
47 / 49