You are on page 1of 54

EE178/EE278A: Plotting an histogram

Andrea Montanari
Stanford University

November 24, 2013

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

1 / 37

Why???

Summarize data (list of numbers)


Validate models
Infer properties of a population from a sample
Predict future outcomes

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

2 / 37

Example

Fuel consumptions of 8773 oil boilers in NYC:

716160, 550000, 492750, 482040, 480000, 459900, 423400, 420042,


395491, 384117, 375584, 374490, 373020, 367920, 365000, 361356,
354364, 350000, 343075, 328500, 328500, 328500, 317952, 306600,
281853, 280000, 279882, 279225, 262800, 262075, 260610, 252945,
246375, 233600, 230607, 229950, . . .
To make it interesting, will use subsample of n

Andrea Montanari (Stanford)

EE178/EE278A

a 400 data points

November 24, 2013

3 / 37

150
100
0

50

Frequency

200

250

A rst shot

0e+00

2e+05

4e+05

6e+05

8e+05

consumpt.

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

4 / 37

40
0

20

Frequency

60

80

Log-scale

3.5

4.0

4.5

5.0

5.5

6.0

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

5 / 37

Log-scale

log values a fX ; X ; X ; : : : ; Xn g

Height@bink A a 5 i X Xi P bink
1

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

6 / 37

15

Frequency

20

10

40

Frequency

20

60

25

80

30

Bin size?

3.5

4.0

4.5

5.0

5.5

6.0

3.0

3.5

log(consumpt.)

4.0

4.5

5.0

5.5

6.0

log(consumpt.)

Something annoying. . .
Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

7 / 37

15

Frequency

20

10

40

Frequency

20

60

25

80

30

Bin size?

3.5

4.0

4.5

5.0

5.5

6.0

3.0

3.5

log(consumpt.)

4.0

4.5

5.0

5.5

6.0

log(consumpt.)

Something annoying. . .
Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

7 / 37

100
0

50

Frequency

150

Need to x the vertical scale

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

8 / 37

Fixing the vertical scale

5i X Xi P bink
Height@bink A a
n length@bink A

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

9 / 37

0.6
0.0

0.2

0.4

Density

0.8

1.0

1.2

Fixing the vertical scale

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

10 / 37

The rationale

Data sample

X ; X ; : : : ; Xn $ f @x A
Nk  5 i X Xi P bink ; bink
1

Andrea Montanari (Stanford)

i.i.d.

EE178/EE278A

 ak ; bk A :

November 24, 2013

11 / 37

The rationale

0.0

0.5

Density

1.0

1.5

density.default(x = logConsumption[logConsumption > 0])

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

12 / 37

The rationale
Data sample

X ; X ; : : : ; Xn $ f @x A
Nk  5 i X Xi P bink ;
1

i.i.d.

Therefore

ENk

an

bk

bink

 ak ; bk A

f @x A dx % n f @ak A jbk ak j

ak

Hence

Hk a Height@bink A
E Nk
E Hk a
n jb a j
k

k
bk

a jb a j
k
k

Andrea Montanari (Stanford)

ak

f @x A dx % f @ak A

EE178/EE278A

November 24, 2013

13 / 37

The rationale
Data sample

X ; X ; : : : ; Xn $ f @x A
Nk  5 i X Xi P bink ;
1

i.i.d.

Therefore

ENk

an

bk

bink

 ak ; bk A

f @x A dx % n f @ak A jbk ak j

ak

Hence

Hk a Height@bink A
E Nk
E Hk a
n jb a j
k

k
bk

a jb a j
k
k

Andrea Montanari (Stanford)

ak

f @x A dx % f @ak A

EE178/EE278A

November 24, 2013

13 / 37

The rationale
Data sample

X ; X ; : : : ; Xn $ f @x A
Nk  5 i X Xi P bink ;
1

i.i.d.

Therefore

ENk

an

bk

bink

 ak ; bk A

f @x A dx % n f @ak A jbk ak j

ak

Hence

Hk a Height@bink A
E Nk
E Hk a
n jb a j
k

k
bk

a jb a j
k
k

Andrea Montanari (Stanford)

ak

f @x A dx % f @ak A

EE178/EE278A

November 24, 2013

13 / 37

Law of large numbers


n

Hk a n jb 1 a j Zi
k k ia
@
Z a 1 if Xi P ak ; bk A;
1

0 otherwise.

By the LLN

Hk P EHk "; EHk C "


EHk % f @ak A
with probablility converging to 1 as n 3 I.
Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

14 / 37

How big should be the bins?

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

15 / 37

a 0  0 25

Density

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Not very detailed!


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

16 / 37

1.0
0.0

0.5

Density

1.5

a 0

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

17 / 37

1.0
0.0

0.5

Density

1.5

2.0

a 0

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

18 / 37

1.0
0.0

0.5

Density

1.5

a 0

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

19 / 37

16

1.5
0.0

0.5

1.0

Density

2.0

2.5

3.0

a 0

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Are these details or noise?


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

20 / 37

16

1.5
0.0

0.5

1.0

Density

2.0

2.5

3.0

a 0

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Are these details or noise?


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

20 / 37

a 0

1.0
0.0

0.5

Density

1.5

Variance:

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

21 / 37

a 0

1.0
0.0

0.5

Density

1.5

Variance:

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

22 / 37

a 0

16

2
0

Density

Variance:

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

23 / 37

A tradeo
Small

Large

Very noisy.
Misses details.

Bias-Variance tradeo

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

24 / 37

A tradeo
Small

Large

Very noisy.
Misses details.

Bias-Variance tradeo

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

24 / 37

Why are details useful?

Example:

Is f @x A Gaussian?

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

25 / 37

0.6
0.0

0.2

0.4

Density

0.8

1.0

1.2

Is f @x A Gaussian?

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Need more detail


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

26 / 37

0.6
0.0

0.2

0.4

Density

0.8

1.0

1.2

Is f @x A Gaussian?

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Need more detail


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

26 / 37

2
0

Density

Is f @x A Gaussian?

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Enough detail, but can we trust it?


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

27 / 37

2
0

Density

Is f @x A Gaussian?

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Enough detail, but can we trust it?


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

27 / 37

0.0

0.5

1.0

Density

1.5

2.0

2.5

A Gaussian sample

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

Enough detail, but can we trust it?


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

28 / 37

1.0
0.0

0.5

Density

1.5

Is f @x A Gaussian?

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log(consumpt.)

About right amount of detail


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

29 / 37

Theory

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

30 / 37

Reminder/Notation (bin length a 2)

Hk a Height@bink A
bink a ak ; bk A  a ; a C A
1

Hk a 2n
Zi a
EHk

Andrea Montanari (Stanford)

Zi

1 if Xi P a ; a C A;
0 otherwise.

1 a C
f @x A dx
2 a
EE178/EE278A

November 24, 2013

31 / 37

Reminder/Notation (bin length a 2)

Hk a Height@bink A
bink a ak ; bk A  a ; a C A
1

Hk a 2n
Zi a
EHk

Andrea Montanari (Stanford)

Zi

1 if Xi P a ; a C A;
0 otherwise.

1 a C
f @x A dx
2 a
EE178/EE278A

November 24, 2013

31 / 37

Error
n

MSE a E

Hk f @a A
n
o
a E Hk E@Hk A C E@Hk A f @a A
n
n
h
i
a E Hk E@Hk A g C 2E Hk E@Hk Ag E@Hk A f @a A
h
i
C E@Hk A f @a A
h
i
a Var{zHk } C C E@Hk A f @a A
@ A
|
2

variance

Andrea Montanari (Stanford)

{z

bias

EE178/EE278A

November 24, 2013

32 / 37

Variance
1

Hk a 2n
Var@Hk A a

4n

1
2

Var@Zi A a E@Zi A E@Zi A

Zi

Var@Zi A

a E@Zi A E@Zi A
a C

E@Zi A a P Xi P a ; a C A a
f @x Adx
a
2

Var@Hk A a
Andrea Montanari (Stanford)

4n

a C
a

f @x Adx 1

EE178/EE278A

a C
a

f @x Adx

November 24, 2013

33 / 37

Variance
1

Hk a 2n
Var@Hk A a

4n

1
2

Var@Zi A a E@Zi A E@Zi A

Zi

Var@Zi A

a E@Zi A E@Zi A
a C

E@Zi A a P Xi P a ; a C A a
f @x Adx
a
2

Var@Hk A a
Andrea Montanari (Stanford)

4n

a C
a

f @x Adx 1

EE178/EE278A

a C
a

f @x Adx

November 24, 2013

33 / 37

Variance
1

Hk a 2n
Var@Hk A a

4n

1
2

Var@Zi A a E@Zi A E@Zi A

Zi

Var@Zi A

a E@Zi A E@Zi A
a C

E@Zi A a P Xi P a ; a C A a
f @x Adx
a
2

Var@Hk A a
Andrea Montanari (Stanford)

4n

a C
a

f @x Adx 1

EE178/EE278A

a C
a

f @x Adx

November 24, 2013

33 / 37

Variance
1

Hk a 2n
Var@Hk A a

4n

1
2

Var@Zi A a E@Zi A E@Zi A

Zi

Var@Zi A

a E@Zi A E@Zi A
a C

E@Zi A a P Xi P a ; a C A a
f @x Adx
a
2

Var@Hk A a
Andrea Montanari (Stanford)

4n

a C
a

f @x Adx 1

EE178/EE278A

a C
a

f @x Adx

November 24, 2013

33 / 37

For not too large


EHk

a C
a 21 a f @x A dx

1 a C h
a 2 a f @a A C f H @a Ax
% f @a A C 1 f HH@a A
6

1
C 2 f HH@a Ax C : : : dx
2

Var@Hk A a

%
Andrea Montanari (Stanford)

4n
1
4n

a C
a

f @x Adx 1
1

a C
a

f @x Adx

f @a A 2 a 2 n f @a A :
EE178/EE278A

November 24, 2013

34 / 37

For not too large


EHk

a C
a 21 a f @x A dx

1 a C h
a 2 a f @a A C f H @a Ax
% f @a A C 1 f HH@a A
6

1
C 2 f HH@a Ax C : : : dx
2

Var@Hk A a

%
Andrea Montanari (Stanford)

4n
1
4n

a C
a

f @x Adx 1
1

a C
a

f @x Adx

f @a A 2 a 2 n f @a A :
EE178/EE278A

November 24, 2013

34 / 37

Hence

bias a EHk

Independent of

n.

1
f @a A % 6 f HH@a A a const
2

Increases with

variance a Var@Hk A %
Decreases with

n.

Andrea Montanari (Stanford)

f @a A a const
2n
n

Decreases with

EE178/EE278A

November 24, 2013

35 / 37

Hence

bias a EHk

Independent of

n.

1
f @a A % 6 f HH@a A a const
2

Increases with

variance a Var@Hk A %
Decreases with

n.

Andrea Montanari (Stanford)

f @a A a const
2n
n

Decreases with

EE178/EE278A

November 24, 2013

35 / 37

0.00

0.02

0.04

MSE

0.06

0.08

0.10

Summing up

0.00

0.05

0.10

0.15

0.20

0.25

Delta

MSE a c

Andrea Montanari (Stanford)

C nc
4

EE178/EE278A

November 24, 2013

36 / 37

Summing up

MSE a c

C nc
4

Optimizing over

G n
opt

MSE

opt

G n

1=5

4=5

More data A Smaller bins A More detail

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

37 / 37

Summing up

MSE a c

C nc
4

Optimizing over

G n
opt

MSE

opt

G n

1=5

4=5

More data A Smaller bins A More detail

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

37 / 37

Summing up

MSE a c

C nc
4

Optimizing over

G n
opt

MSE

opt

G n

1=5

4=5

More data A Smaller bins A More detail

Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

37 / 37

In d dimensions
@ Cd A
G n @ Cd A

G n
opt

MSE

E..g. for d

1= 4

4= 4

opt

a 12:
MSE

opt

G n

0:25

The curse of dimensionality!


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

38 / 37

In d dimensions
@ Cd A
G n @ Cd A

G n
opt

MSE

E..g. for d

1= 4

4= 4

opt

a 12:
MSE

opt

G n

0:25

The curse of dimensionality!


Andrea Montanari (Stanford)

EE178/EE278A

November 24, 2013

38 / 37

You might also like