3 views

Uploaded by Rahulsinghoooo

ahar

ahar

© All Rights Reserved

- ebon
- Sampling and Sampling Methods 2
- YMS Ch3: Examining Relationships AP Statistics at LSHS Mr. Molesky
- Statistics
- Garment Inspection
- jurnal komunikasi
- Syllabus 01.M.E.cem 2017
- GR&R Study
- Some Books
- Correlation
- Global Warming Fun With Statistics
- Asparouhov - 2015 - Bayesian Structural Equation Modeling
- contoh kerja kursus
- 49 - correlation and causation practice
- DCA 102 Chap 5 - Correlation & Regression
- Chemical Composition - Copy
- Ditadores Tem Mais Filhos
- Answers Chapter 4 In
- 7885c1e9-5e95-4341-971a-d78fc52f88ba-160702042930.pdf
- Correlation.docx

You are on page 1of 6

So we motivated the discussion of

statistical inference and estimation by

bringing up this so called decline

effect, where the, effect size of

scientific results seems to be going down

over time and reproducibility is

suffering.

And so we gave some reasons for this;

publication bias, you know mistakes and

fraud and this multiple hypothesis

problem.

And so we used these, motivations, these

scenario, to bring up various topics and

techniques.

So we talked a little bit about basic

statistical inference Where I just give

you an overview and, and that's it.

we talked about effect size.

We brought up the specific term,

heteroskedasticity.

For fraud detection, we brought up

Benford's Law.

And then we multiple hypothesis testing

which is.

Perhaps the most important part of the

discussion we talked about the familywise

error rate and the false discovery rate

and gave correction procedures for both

of these.

Okay, and so this hopefully was a tour of

not just some basic concepts but also

some, if not advanced at least things

that don't necessarily come up in a a you

know, Stats 101 course.

But I think it's pretty important for us

data scientists to understand.

In fact, as a data scientist, there's a

view amongst statisticians that these

topics are not very well understood.

And in fact, they'll point to typical

machine learning classes where understand

the population, understanding the various

biases, understanding how to correct for,

for the problems that can arise is not

taught at all and it's more of a.

Of a, you know, blind application of

algorithms.

So I think it's pretty important to go

over this choice of topics now.

So, What about big data?

What changes?

Well, so, Brad Efron.

Who's a world renowned statistician you

know, describes it this way.

Says classical statistics was fashioned

for small problems, a few hundred data

points at most, in just a few parameters.

And the bottom line is that we've entered

collection, with a demand for answers to

large scale inference problems that lie

beyond the scope of classical statistics.

And so, Suggest that something is

changing in the area of big data.

Now, what can go wrong here?

Well, as we've talked about, you can find

spurious relationships in big data and so

this is a picture that I got from a

colleague recently that was emailed to

him.

Which is a plot that someone took the

time to make, may or may not have been as

a joke, but as you can see here, it says

"Internet Explorer versus the murder

rate." Rate, OK.

And so this is the murders in the US in

blue, along with the market share of

internet explorer in the green.

And the, you know, corresponding

discussion that went along with this

plot, you know, was, was somewhat

amusing.

Talking about various theories for why

The murder rate might be going up as in

the next four market share.

Our murder rate goes down as, as in the

next four market share also goes down.

But, the point here is that without some

common sense or without the [UNKNOWN] the

application of understanding the scenario

of the problem you can make, you know

discoveries Of, of this form.

Okay.

Alright, and so other examples that have

been talked about in the literature,

again brought up, as as, you know, bad

examples, the number of police officers

and the number of crimes.

So, why might these 2 things be

correlated?

You know, maybe police officers cause

crimes.

Well, no, probably because there's in

pop, in densely populated areas, there

are both more police officers and there

are more, and there are more crimes.

By the way, just to point out again, you

know, these, these authors here are not

authors that claim, made these claims.

These are authors that brought up the

mistake.

Okay, amount of ice cream sold and deaths

by drowning.

Why would these things be correlated?

Well, there's a seasonality.

Right?

In the summertime you sell more ice cream

and more people go swimming.

and population increase used as you know,

evidence that storks do indeed bring

newborns to families.

Well again, in more densely populated

areas there's more people actually

actually see the Storks and so you get an

increase in sightings.

So, these kind of procedures to remove

bias and these procedures to understand

the population you are sampling from and

understand the possibilities as far as

correlations.

These things are taught in statistics

programs, but are not typically taught in

machine learning classes.

Okay.

So what does that have to do with big

data?

Well.

Might be a view that there's, you know,

the curse of big data, as Vincent

Granville put it, is the fact that when

you search for patterns in a very, very

large data sets with billions or

trillions of data points and thousands of

metrics, you are bound to identify

coincidences that have Predictive power

and so the example he gives is to

consider stock prices for some large

number of companies over a one month

period.

And then you check for correlations

between all pairs.

And actually doesn't stop there, because

that would be over the same exact one

month period.

But you might want to account for lags.

Maybe the stock price of Google.

a few days later effects the stock price

of smaller companies that depend on you.

So now you're not just comparing every

500 squared checking the paralyzed

correlation of these time series but you

are also checking the paralyzed slightly

offset one okay and so these are the

cross correlation procedures.

So very basic time series analysis this

is just to measure the correlation and I

just wanted to throw the formulas up here

where the covariants of two data sets is

measured this way.

Alright so you take the data point xi and

subtract the mean of x.

And multiply that by y i minus the mean

of y.

And all that up and that's the

covariance.

And then you divide the covariance by the

standard deviation of each data set

And so this gives you the correlation.

Okay.

So, what does this experiment look like?

Well, I generated this plot by running

random walks for stock prices that start

at $10.

They all start at the same, the same

Point, and at each step, which is an hour

of simulated time.

A, draw a sample for a normal

distribution where the mean is the

current stock price.

And the, a standard deviation is one

percent of that current stock price.

Okay.

And this is, not especially defensible,

but you can see just sort of visually

that it does generate stock price looking

things.

And you do get some variance here.

Alright.

So, clearly this is, this is random.

This plot shows the number of corelations

at a level of 0.9.

All right, that's a pretty strong

correlation as a function of the number

of stock prices tracked.

So as I went up from 10 to 100, I didn't

go all the way up to 500 which is what

Vincent Granville described in the

thought experiment.

This is the number of spurious

correlations I- You, you find, okay and

this is also not doing the lagged cross

correlation, alright this is just

directly [INAUDIBLE] the correlation of

these two [INAUDIBLE] of time series

across this month.

And that's a pretty long period to,

across a month.

So what's the point?

Well [INAUDIBLE] gives more opportunities

For spurious findings.

Okay.

Now, it's not all bad news.

So, how is big data different?

Well, there's a notion of big p versus

big n.

Where big p is sort of the number of

columns.

And big n is the number of rows.

And in this experiment we just did with

the time series.

This was sort of a big piece in here.

We looked at more you know an increasing

number of companies and then we looked at

all possible correlations between them so

this was growing sort of quadratically.

Okay.

marginal cost of increasing the number of

records is essentially zero.

It's gotten cheaper and cheaper and

cheaper to collect data.

Okay.

Great.

Now that's very very powerful, right.

We want to, the increase in the number of

records, adds statistical power and helps

us sort of, you know, get lower and lower

p values but it also amplifies bias.

If you 're collecting the wrong data, if

you're looking at the wrong population.

you're going to make, you know, so-called

discoveries that are simply false.

And so, for example, log all the clicks

to your website, you have a very, very

large data set and you can very precisely

model user behavior.

But that would only model your current

users.

When your hope, you know, perhaps the

whole point of modeling.

user behaviors to try to attract new

customers.

Well, for example, if you have early

adopters, and your current user base is

early adopters, you're only going to be

modeling their behavior.

You haven't actually sampled the

population at large.

You know, another example is mobile data.

And this comes up in polling, for say,

the presidential election.

you know, you, you're only sampling

people that have cell phones.

And this may or may not be the same

population, you want, you want to be

sampling.

Okay, this may ignore lower income groups

or different age groups, okay.

You need to be careful on multiple

hypothesis tests as well, as we pointed

out.

So there's a fantastic comment from XKCD

that makes this point very, very clear.

where they sort of demonstrate that green

jelly beans cause acne.

Right, and the story here is that there's

20 different [SOUND] colours of jelly

beans, and for a P value of 0.05 [SOUND]

we do 20 experiments.

And sure enough we find one of the colors

indeed causes acne.

But that would be expected purely by

chance.

And so I encourage you to look up that

comment.

And the other comment I'll make that we

Taleb's Black Swan events.

So this is- Things that are sort of

inherently unpredictable or the

distribution of them does not follow a

normal distribution, sort of a bell curve

distribution, where the tails of the bell

curve mean that extreme values become

exponentially more rare.

That's the sort of definition of the

normal distribution.

But in some cases, extreme values are not

exponentially less common.

They, they, they happen, okay?

And so the example that he uses in this

case is that, you know, that if the, if a

turkey was to model your behavior, it

would get increasingly more confident

that that you mean it, it no hard.

And you mean it, you know, good will.

Every day you come and feed the turkey,

and everyday you take care of it and you

look out for its well being.

But then on the, you know day before

Thanksgiving it gets slaughtered.

Perhaps and so that was Taleb's argument

for a Black Swam event.

A black swan itself refers to the fact

that people didn't believe black swans

existed and then.

Finds out that they did, so it was an

unexpected event, okay.

All right we'll talk more about that in

some detail.

[SOUND]

- ebonUploaded byJonathan Obaña
- Sampling and Sampling Methods 2Uploaded byYashpal Singh Bidla
- YMS Ch3: Examining Relationships AP Statistics at LSHS Mr. MoleskyUploaded byInTerp0ol
- StatisticsUploaded byBENN12
- Garment InspectionUploaded byJason Murphy
- jurnal komunikasiUploaded byYsumaryan Doni
- Syllabus 01.M.E.cem 2017Uploaded byCharan Tej Rudrala
- GR&R StudyUploaded bymr2archer
- Some BooksUploaded bymakarandkurundkar
- Global Warming Fun With StatisticsUploaded byiairesearch6649
- Asparouhov - 2015 - Bayesian Structural Equation ModelingUploaded byLuis Anunciacao
- 49 - correlation and causation practiceUploaded byapi-248774013
- CorrelationUploaded byThakur Sahil Narayan
- contoh kerja kursusUploaded byMasri Baharom
- DCA 102 Chap 5 - Correlation & RegressionUploaded bykenangKEMBALI
- Chemical Composition - CopyUploaded byBasavaraj Morab
- Ditadores Tem Mais FilhosUploaded bygabrielswahili
- Answers Chapter 4 InUploaded byscribdtea
- 7885c1e9-5e95-4341-971a-d78fc52f88ba-160702042930.pdfUploaded byRodjan Moscoso
- Correlation.docxUploaded byjenny mae
- Bản ví dụ đánh giá mức độ hài lòng của khách hàng về sản phẩm thịt theo pp HedonicUploaded byRoronoa Zoro
- 5-Correlation- 28 Sept 11Uploaded byAbhay Yadav
- Practical Reseach ReportUploaded byChristian Jim Polleros
- spssUploaded byilham apriadi
- alfouzan2017Uploaded byzaheerbds
- powerpointUploaded byapi-315751247
- SC 8 03_4106life0801_20_25Uploaded byathifah
- MainBullseyeBulletin_53Uploaded byRajat Sharma
- SetiyaUploaded bynove
- Holm Lemseminar 2010Uploaded byWagner Madeira

- 2 - 4 - Relational Algebra Details- Union, Diff, Select (10-53)Uploaded byRahulsinghoooo
- List of Homoeopathic Medicines,Combinations & Their UsesUploaded bygirish2611
- 06 Logistic RegressionUploaded byRahulsinghoooo
- Cray Graph Engine User GuideUploaded byRahulsinghoooo
- Week1 Recitation AllSlidesUploaded byRahulsinghoooo
- MIT15071XT114-U0101_100Uploaded byRahulsinghoooo
- a Declarative Approach to Ontology Translation With Knowledge PreservationUploaded byRahulsinghoooo
- 3.3_dameron_7-7-04Uploaded byRahulsinghoooo
- 6 - 16 - Lecture 57 BONUSUploaded byRahulsinghoooo
- LinksUploaded byRahulsinghoooo
- Movies Database and ScriptsUploaded byRahulsinghoooo
- Data Science Interview QuestionUploaded byRahulsinghoooo
- HadoopUploaded byKartik Mahadevan
- Stats - IntroUploaded byamalia
- Netwok IpUploaded byRahulsinghoooo
- 2 - 6 - Relational Algebra Details- Theta-Join (8-34)Uploaded byRahulsinghoooo
- 2 - 7 - SQL for Data Science- Interpreting Complicated SQL (12-12)Uploaded byRahulsinghoooo
- 2 - 9 - Physical Optimization (11-14)Uploaded byRahulsinghoooo
- 2 - 2 - Motivating Relational Algebra (8-57)Uploaded byRahulsinghoooo
- 2 - 1 - From Data Models to Databases (10-35)Uploaded byRahulsinghoooo
- 5 - 6 - Recap and Big Data (11-39)Uploaded byRahulsinghoooo
- 5 - 3 - Effect Size, Meta-Analysis, Heteroskedasticity (9-31)Uploaded byRahulsinghoooo
- 5 - 2 - Publication Bias (8-45)Uploaded byRahulsinghoooo
- 5 - 1 - Statistics Intro (10-36)Uploaded byRahulsinghoooo
- 6 - 3 - Lecture 46 DifferencesUploaded byRahulsinghoooo
- 6 - 13 - Lecture 56 Approximation and ErrorUploaded byRahulsinghoooo
- 6 - 5 - Lecture 48 Numerical O.D.E.sUploaded byRahulsinghoooo
- 6 - 15 - Lecture 57 Calculus ReduxUploaded byRahulsinghoooo

- Summative Math ProjectUploaded bykmmutch
- 13Uploaded byPolisetty Guptha
- A Look Into Gassmann’s EquationUploaded byMohand76
- 7th Sem SyllabusUploaded bybsnl_cellone47
- CS-E3190_prob06.pdfUploaded byYuri Shukhrov
- Niti DuggalUploaded byAnusha Naidu
- CE2100 Lecture 1Uploaded bySaiRam
- Torsional Response of Horizontally Curved Bridges Subjected to Earthquake-Induced Pounding.pdfUploaded bycontrerasc_sebastian988
- Math 4uUploaded byOul Kevin
- Exercises Libor Market Model - ICLUploaded bymeko1986
- Selection of Ground Motion Prediction EquationsUploaded byBhushan Raisinghani
- EM II Part IIUploaded byClement Raj
- 1429 Business Math b Com AiouUploaded byMuhammad Salim Ullah Khan
- Phast Multi Component Extension 1012 2 Tcm4-529071Uploaded bythawdar
- Peter Pausigere PhD Final Corrected22222Uploaded bykale sanjay
- Ricardo Gutierrez-Osuna- Multi-layer perceptronsUploaded byAsvcxv
- Exact Solutions of Lotka Volterra EquationsUploaded byjjj_ddd_pierre
- Chapter 5 TMV Examples ExercisesUploaded byKashif
- kiatrabile2016.pdfUploaded byarief_7
- engine thrust analysisUploaded byapi-248987755
- Optimal Pilot Matrix Design for Training- Based Channel Estimation in MIMO CommunicationsUploaded byJournal of Telecommunications
- 04_1PROBABILITYUploaded byeamcetmaterials
- Kangaroo PK 2005 BenjaminUploaded bySJK(C) THUNG HON
- Felix Endres Phd ThesisUploaded byPsycosiado Divertido
- Lec04 PerturbationUploaded byadi.s022
- Losses in PSCUploaded bySandeep Reddy
- Robust Design of EDM of Al-SiC Metal Matrix Composite using Multichannel ElectrodeUploaded bySakthi Mgs
- [MS-VBAL]Uploaded bySurinder Singh
- Java_I_Lecture_3.ppsUploaded byMarin_1wq
- 03 Probability KeyUploaded byKristelNaresEguita