Professional Documents
Culture Documents
Data: “mtcars”
An example of good, ugly, bad, wrong
value
3
value
3
2 2
wrong 1
0
1
0
visualization
A B C A B C
c bad d wrong
a – good 4
3
5
4 4
b – ugly 3 3
value
2
2 2
c – bad 1
0
1
0
1
0
d - wrong
A B C A B C
Q: Is this a good bar chart?
A. Yes
I
B. No
How about using a color scale
Visualizing Data with Histogram
✺ Histogram
A set of bars that
are organized
by bins that
contains
numerical data
(discrete or
continuous)
Data: “iris”
Visualizing Data with Histogram (II)
✺ Conditional
histogram
S
Histogram I
/
/
generated by 1
W
I
I
'
I /
/
subsets of the
I /
, i 1
W I
/
...
9
-
data
/
/
11. - I
- /
/ /
...
,
/
"
/
"I -
iii,
/
/ I
/
I I " I
/
/
W
I
/ ! ↳
,
,
/
I I S
Y 1,
I
i
S
! i I
↳.., /
!! -
Data: “iris”
, - I E
/ /
/
Which group has the higher total scores?
I
W
!
7
I I
I I
W
W
I
-
I
I
I
:
I
I
I
I I 2
W ↳
i
7
I I
.........
.
Summarizing 1D continuous data
For a data set {x} or annotated as {xi}, we
summarize with:
✺ Location Parameters
Mean Median Mode
✺ Scale parameters
Std variance IQR
Summarizing 1D continuous data
✺ Mean
1 !N
mean({xi }) = xi
N i=1
9x,3 =
1, 2, 3, 4, 5, 6, 7, 12
I
Mean((x:7) 5
=
X
I -
i2;456
-
S
7 12
p
Properties of the mean
Scaling data scales the mean
mean(((xi c3)
+
3:33
=K. means
+
C
mean({k · xi }) = k · mean({xi })
i 1
=
LHS xi
= - mean (3x:3)
i I
=
I
i=
i .
L L
Proof: Argmin((x: -n() =
e mean(2:3) i1
=
the
Argument es that minimizes
LHS=Argmin't) e
I
N
e)2
f
e (xi
=
-
=
i1
=
Eh(u) gre
N
=
I
=
i 1
=
i 1
=
X
smooth
b
satisfies 0
=
for a
function.
xY x2
*
y
=
argmin(y' =
?
x
dy
-= 2x 0
=
dx
i= 0
Proof: Argmin((x: -n() =
e mean(2:3) i1
=
f
S
find
g2
=0 to le
Set
del h =
n
g2;g
= -
-
xi - M
Bu -I
i1
=
eg.( 1) -
2(
=
-
2g) 0
=
Proof: Argmin((x: -n() =
e mean(2:3)
i1 =
3 2g 0
=
2 z(xi u)
=
0
- -
[(xi )
0
=
-
: N.M
=
-
0
d+(M)
-
2
du
Y
i E
=
Q1:
What is the answer for
mean({mean({xi})}) ?
I
A. mean({xi}) B. unsure C. 0
Standard Deviation (σ)
The standard deviaAon
!
"
"1 $ N
std({xi }) = # (xi − mean({xi }))2
N i=1
!
= mean({(xi − mean({xi }))2 })
How much the
.
data speeds
out ur t mean
!
↑ !
X X
- ·- X X
0 R
Std=
* i1
=
di
Q2. Can a standard deviation of a dataset
be -1?
A. YES
I
B. NO
Properties of the standard deviation
Scaling data scales the standard deviaAon
std({k · xi }) = |k| · std({xi })
↑
Hot
N− 2 Hot
0.5N k zewos 0.5N
·K8
0
X
k2 k2
y
−kσ kσ
!
1 N 2 N
std = [(N − )0 + 2 (kσ)2 ] = σ
N k k
Variance 2
(σ )
✺ Variance = (standard deviation)2
1 !N
var({xi }) = (xi − mean({xi }))2
W
N i=1
xi − mean({xi })
x!i =
std({xi })
Q5: Standard deviation (σ) of
standard coordinates
✺ Std( {x!i } ) is:
H
A. 1 B. 0 C. unsure
xi − mean({xi })
x!i =
std({xi })
Q6: Variance of standard
coordinates
✺ Variance of {x!i } is:
I
A. 1 B. 0 C. unsure
xi − mean({xi })
x!i =
std({xi })
Q7: Estimate the range of data in
standard coordinates -
is within:
&
-
A. [-10, 10]
1/
/
--
K/ L
- is 5
I S
k -
1
=
C
B. [-100, 100]
Yk 2-
↓
10 Tr
C. [-1, 1]
xi − mean({xi })
-
x!i =
D. [-4, 4] std({xi })
E. others
Standard Coordinates/normalized data to
μ=0, σ=1, σ2=1
✺ Data in standard coordinates always has
mean = 0; standard deviation =1;
variance = 1.
✺ Such data is unit-less, plots based on this
sometimes are more comparable
✺ We see such normalization very often in
statistics
Median
✺ We first sort the data set {xi}
✺ Then if the number of items N is odd
median = middle item's value
if the number of items N is even
median = mean of middle 2 items'
values
Properties of Median
✺ Scaling data scales the median
median({k · xi }) = k · median({xi })
(1, 2, 3, 4. 5
5
/
6,7,12}
75th percentile =?
Interquartile range
✺ iqr = (75th percentile) - (25th percentile)
✺ Scaling data scales the interquartile range
iqr({k · xi }) = |k| · iqr({xi })
DEATH
✺ Good for outliers
✺ Easier to use
for comparison
Data from
https://www2.stetson.edu/~jrasp/data.htm
Boxplots details, outliers
✺ How to Outlier
define
> 1.5 iqr
outliers? Whisker
Data: “iris” in R
Example Bi-modes distribution
✺ Modes may
indicate
multiple
populations
left right
tail tail
medium medium
Credit: Prof.Forsyth
Looking at relationships in data
✺ Finding relationships between
features in a data set or many data
sets is one of the most important
tasks in data science
Heatmap
✺ Display matrix of data via gradient of color(s)
✺ x : HIGHT, y: WEIGHT
The visual way for continuous
features
✺ Time series plot
✺ Scatter plot
Time Series Plot: Stock of Amazon
Scatter plot
✺ A most effective tool for geographic
data and 2D data in general. It should
be your first step with a new 2D
dataset.
Scatter plot
✺ Body Fat data set
Scatter plot
✺ Scatter plot with density
Scatter plot
✺ Removed of outliers & standardized
Scatter plot
✺ Coupled with
heatmap to
show a 3rd
feature
Correlation seen from scatter plots
Zero Positive Negative
Correlation correlation correlation
Density
Normalized body temperature Heights, outliers removed, normalized Body fat
What kind of Correlation?
✺ line of code in a database and number of bugs
See
You!