You are on page 1of 67

Probability and Statistics ì

for Computer Science

“The statement that “The


average US family has 2.6
children” invites mockery” –
Prof. Forsyth reminds us
about critical thinking
Credit: wikipedia

Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 8.24.2023


Last lecture
✺ Welcome/Orientation
✺ Big picture of the contents
✺ Lecture 1 - Data Visualization &
Summary (I)
Warm up question:
✺ What kind of data is a letter grade?
Ordinal.
✺ What do you ask for usually about the
stats of an exam with numerical
scores?
Objectives
✺ Simple visualization of 1-D data
✺ Grasp Summary Statistics
✺ Learn more Data Visualization for
Relationships
Simple Visualization of Data
✺ General principles
✺ Bar chart
✺ Histogram
✺ Conditional histogram
Simple Visualization of Data
✺ General principles
Must not mislead or distort;
Aesthetically pleasing;
Clear, Attractive, Convincing;
Show message/significance.
Simple Visualization of Data
✺ Bar chart
A set of bars that
are organized
by categorical
or ordinal feature

Data: “mtcars”
An example of good, ugly, bad, wrong

The difference a b ugly


between good,
5 5
4 4

ugly, bad and

value
3
value
3
2 2

wrong 1
0
1
0

visualization
A B C A B C

c bad d wrong
a – good 4

3
5
4 4

b – ugly 3 3
value

2
2 2

c – bad 1

0
1
0
1
0

d - wrong
A B C A B C
Q: Is this a good bar chart?

A. Yes
I
B. No
How about using a color scale
Visualizing Data with Histogram
✺ Histogram
A set of bars that
are organized
by bins that
contains
numerical data
(discrete or
continuous)

Data: “iris”
Visualizing Data with Histogram (II)
✺ Conditional
histogram
S

Histogram I
/
/

generated by 1
W
I

I
'

I /
/

subsets of the
I /
, i 1
W I
/

...
9
-

data
/
/

11. - I
- /
/ /

...
,
/

"
/

"I -

iii,
/
/ I
/
I I " I
/
/
W
I
/ ! ↳
,
,
/

I I S

Y 1,
I

i
S
! i I

↳.., /
!! -
Data: “iris”
, - I E
/ /
/
Which group has the higher total scores?

I
W

!
7

I I

I I

W
W

I
-
I
I
I

:
I
I
I

I I 2
W ↳

CS361 Spring 2021


Which group has the higher total scores?

i
7

I I

.........
.
Summarizing 1D continuous data
For a data set {x} or annotated as {xi}, we
summarize with:
✺ Location Parameters
Mean Median Mode

✺ Scale parameters
Std variance IQR
Summarizing 1D continuous data
✺ Mean
1 !N
mean({xi }) = xi
N i=1

It’s the centroid of the data geometrically,


by identifying the data set at that point, you find
the center of balance.
32:3 it [1,8]

9x,3 =
1, 2, 3, 4, 5, 6, 7, 12

I
Mean((x:7) 5
=

X
I -
i2;456
-
S

7 12

p
Properties of the mean
Scaling data scales the mean
mean(((xi c3)
+

3:33
=K. means
+
C
mean({k · xi }) = k · mean({xi })

TranslaAng the data translates the mean

mean({xi + c}) = mean({xi }) + c


Less obvious properties of the mean
The signed distances from the mean
sum to 0 !
N
(xi − mean({xi })) = 0
i=1

The mean minimizes the sum of the


squared distance from any real value
!
N
argmin (xi − µ)2 = mean({xi })
µ
i=1
Proof:I(x: - mean (3x;3)) 0 =

i 1
=

LHS xi
= - mean (3x:3)
i I
=

I
i=

- [xi -N:mean (xi)


i
=
1
N
i N- H.
= ↳xi- -
i
N
N N
- -
0
-
I Xi 2 X.
-
-

i .
L L
Proof: Argmin((x: -n() =

e mean(2:3) i1
=

the
Argument es that minimizes

function that follows &

LHS=Argmin't) e
I

N
e)2
f
e (xi
=
-
=

i1
=

Eh(u) gre
N
=
I
=

i 1
=
i 1
=

X
smooth
b
satisfies 0
=

for a
function.
xY x2

*
y
=

argmin(y' =

?
x

dy
-= 2x 0
=

dx
i= 0
Proof: Argmin((x: -n() =

e mean(2:3) i1
=

f
S

find
g2
=0 to le
Set
del h =

the chain rule


using
at=
del ch =
=
d2
=

n
g2;g
= -
-
xi - M

Bu -I
i1
=
eg.( 1) -

2(
=
-

2g) 0
=
Proof: Argmin((x: -n() =

e mean(2:3)
i1 =

3 2g 0
=

2 z(xi u)
=
0
- -

[(xi )
0
=
-

: N.M
=

-
0

d+(M)
-
2
du
Y
i E
=
Q1:
What is the answer for
mean({mean({xi})}) ?
I
A. mean({xi}) B. unsure C. 0
Standard Deviation (σ)
The standard deviaAon
!
"
"1 $ N
std({xi }) = # (xi − mean({xi }))2
N i=1

!
= mean({(xi − mean({xi }))2 })
How much the

.
data speeds
out ur t mean

!
↑ !
X X
- ·- X X
0 R

Std=
* i1
=
di
Q2. Can a standard deviation of a dataset
be -1?

A. YES
I
B. NO
Properties of the standard deviation
Scaling data scales the standard deviaAon
std({k · xi }) = |k| · std({xi })

TranslaAng the data does NOT change the


standard deviaAon

std({xi + c}) = std({xi })


Standard deviation: Chebyshev’s
inequality (1 look)
st
units of
N
At most items are k standard
k2 X

deviaAons (σ) away from the mean


Rough jusAficaAon: Assume mean =0
N

Hot
N− 2 Hot
0.5N k zewos 0.5N
·K8
0
X

k2 k2
y

−kσ kσ
!
1 N 2 N
std = [(N − )0 + 2 (kσ)2 ] = σ
N k k
Variance 2
(σ )
✺ Variance = (standard deviation)2
1 !N
var({xi }) = (xi − mean({xi }))2
W
N i=1

✺ Scaling and translating similar to standard


deviation var({k · xi }) = k 2 · var({xi })
var({xi + c}) = var({xi })
Q3: Standard deviation
✺ What is the value of
std({mean({xi})}) ?
I
A. 0 B. 1 C. unsure
Standard Coordinates/normalized
data
✺ The mean tells where the data set is and the
standard deviation tells how spread out it is.
If we are interested only in comparing the
shape, we could
define: xi − mean({xi })
x!i =
std({xi })
✺ We say {x!i } is in standard coordinates
Q4: Mean of standard coordinates
✺ mean( {x!i } ) is:
A. 1 B. 0 C. unsure

xi − mean({xi })
x!i =
std({xi })
Q5: Standard deviation (σ) of
standard coordinates
✺ Std( {x!i } ) is:

H
A. 1 B. 0 C. unsure

xi − mean({xi })
x!i =
std({xi })
Q6: Variance of standard
coordinates
✺ Variance of {x!i } is:

I
A. 1 B. 0 C. unsure

xi − mean({xi })
x!i =
std({xi })
Q7: Estimate the range of data in
standard coordinates -

✺ Estimate as close as possible, 90% data


-

N
-
-

is within:
&
-

A. [-10, 10]
1/
/
--
K/ L

- is 5
I S

k -

1
=

C
B. [-100, 100]
Yk 2-

10 Tr
C. [-1, 1]
xi − mean({xi })
-
x!i =
D. [-4, 4] std({xi })

E. others
Standard Coordinates/normalized data to
μ=0, σ=1, σ2=1
✺ Data in standard coordinates always has
mean = 0; standard deviation =1;
variance = 1.
✺ Such data is unit-less, plots based on this
sometimes are more comparable
✺ We see such normalization very often in
statistics
Median
✺ We first sort the data set {xi}
✺ Then if the number of items N is odd
median = middle item's value
if the number of items N is even
median = mean of middle 2 items'
values
Properties of Median
✺ Scaling data scales the median

median({k · xi }) = k · median({xi })

✺ Translating data translates the median

median({xi + c}) = median({xi }) + c


Percentile

✺ kth percentile is the value relative to


which k% of the data items have smaller
or equal numbers
✺ Median is roughly the 50th percentile

(1, 2, 3, 4. 5
5
/
6,7,12}
75th percentile =?
Interquartile range
✺ iqr = (75th percentile) - (25th percentile)
✺ Scaling data scales the interquartile range
iqr({k · xi }) = |k| · iqr({xi })

✺ Translating data does NOT change the


interquartile range
iqr({xi + c}) = iqr({xi })
Box plots
✺ Boxplots Vehicle death by region
✺ Simpler than
histogram

DEATH
✺ Good for outliers
✺ Easier to use
for comparison

Data from
https://www2.stetson.edu/~jrasp/data.htm
Boxplots details, outliers
✺ How to Outlier

define
> 1.5 iqr
outliers? Whisker

(the default) Box


Interquartile
Median Range (iqr)
< 1.5 iqr
Sensitivity of summary statistics to
outliers
✺ mean and standard deviation are
very sensitive to outliers
✺ median and interquartile range are
not sensitive to outliers
Modes
✺ Modes are peaks in a histogram
✺ If there are more than 1 mode, we
should be curious as to why
Multiple modes
✺ We have seen
the “iris” data
which looks to
have several
peaks

Data: “iris” in R
Example Bi-modes distribution
✺ Modes may
indicate
multiple
populations

Data: Erythrocyte cells in


healthy humans

Piagnerelli, JCP 2007


Tails and Skews
Symmetric Histogram
mode, median, mean all on top of
one another

left right
tail tail

Left Skew Right Skew


mean mean
mode
mode

medium medium

left right left right


tail tail tail tail

Credit: Prof.Forsyth
Looking at relationships in data
✺ Finding relationships between
features in a data set or many data
sets is one of the most important
tasks in data science
Heatmap
✺ Display matrix of data via gradient of color(s)

Summarization of 4 locations’ annual mean


temperature by month
3D bar chart
✺ Transparent
3D bar chart
is good for
small # of
samples
across
categories
Relationship between data feature
and time
✺ Example: How does Amazon’s stock change over
1 years?
take out the pair of
features
x: Day
y: AMZN
Relationship between data features
✺ Example: does the weight of people relate to
their height?

✺ x : HIGHT, y: WEIGHT
The visual way for continuous
features
✺ Time series plot
✺ Scatter plot
Time Series Plot: Stock of Amazon
Scatter plot
✺ A most effective tool for geographic
data and 2D data in general. It should
be your first step with a new 2D
dataset.
Scatter plot
✺ Body Fat data set
Scatter plot
✺ Scatter plot with density
Scatter plot
✺ Removed of outliers & standardized
Scatter plot
✺ Coupled with
heatmap to
show a 3rd
feature
Correlation seen from scatter plots
Zero Positive Negative
Correlation correlation correlation

Weights, outliers removed, normalized


No Correlation Positive Correlation Negative Correlation
Normalized heart rate

Density
Normalized body temperature Heights, outliers removed, normalized Body fat
What kind of Correlation?
✺ line of code in a database and number of bugs

✺ GPA and hours spent playing video games

✺ earnings and happiness

Credit: Prof. David Varodayan


Correlation doesn’t mean causation
✺ Shoe size is correlated to reading skills,
but it doesn’t mean making feet grow
will make one person read faster.
Assignments
✺ HW1 due 8/31 Thurs.
✺ Reading upto Chapter 2.1
✺ Next time: the quantitative part of
correlation coefficient
Additional References
✺ Charles M. Grinstead and J. Laurie Snell
"Introduction to Probability”
✺ Morris H. Degroot and Mark J. Schervish
"Probability and Statistics”
See you next time

See
You!

You might also like