Notes of Introduction To Data Science, Lectured by Jiaxin Ding From SJTU

Introduction to Data Science
Ziqin Gong
March 7, 2023
Contents
1 Data fundamentals 1
1.1 Data attibutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic statistical descriptions of data . . . . . . . . . . . . . . . . . . . . 2
1.3 Measuring data similarity and dissimilarity . . . . . . . . . . . . . . . . . 4
1.4 Probability inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 MapReduce 8
2.1 Distributed file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Lecture 2: Basics of data Tue Feb 21 2023 10:02
1 Data fundamentals
1.1 Data attibutes
• Data object: an entity in the dataset
• A data attribute is aparticular data field, representing a characteristic or feature of
a data object.(feature)
1.1.1 Record data

Usually stored in a relational database.
• Each row represents a data object.
• Each column represents a data attibutes.
1.1.2 Image data

A 3-layer matrix (3 × h × w) of real value in [0, 255].
1.1.3 Text data

A sequence of words/tokens that represents semantic meanings.
1
Lecture 2: Basics of data
1.1.4 Graph data

A directed/undirected graph.
• Possibly with additional information for nodes and edges.
1.1.5 Streaming data

A sequence of readings.
1.1.6 Spatio-temporal data

A sequence of (time, location, info) tuples.
1.2 Basic statistical descriptions of data

Capture the properties of a given dataset:
• Central tendency: describes the center around which the data is distributed
• Dispersion: describes the data spread
1.2.1 Measuring the central tendency

Mean
n
1X
µ= xi
n i=1
• Weighted arithmetic mean: Pn

w i xi
µ = Pi=1
n
i=1 wi
• Geometric mean: µ =
p
n
Q
i xi
– µgeometric ≤ µarithmetic .
– The geometric mean is more sensitive to values near zero.
– Geometric means make sense with ratios.
∗ The ratios 1/2 and 2 should average to 1.
Median Middle value if odd number of values, or average of the middle two values oth-
erwise.
• Mean is meaningful for symmetric distributions without outliers. e.g. height and
weight
• Median is better for skewed distributions or data with outliers. e.g. wealth and
income
1 DATA FUNDAMENTALS 2
Mode Value that occurs most frequently in the data.

• Unimodal, bimodal, trimodal
• Empirical formular (moderately skewed distribution):
mean − mode ≈ 3 × (mean − median)
Example. • Five data points: 1, 1, 1, 1, 1, 2, 2, 2, 3, 3

• Mean: 1.7 Median: 1.5 Mode 1
Symmetric v.s. Skewed data
• Positively skewed data: mode < median

• Symmetric data: mode = median
• Negatively skewed data: mode > median
1.2.2 Measuring the dispersion of data

Variance and standard deviation
n
1X
µ= xi = E[x]
n i=1
n
1X
σ2 = (xi − µ)2 = E[x2 ] − E[x]2
n i=1
√
σ = σ2
The normal distribution curve
• [µ − σ, µ + σ]: contains about 68 % of the measurements

• [µ − 2σ, µ + 2σ]: 95 %
• [µ − 3σ, µ + 3σ]: 99,7 %
Regardless of how data is distributed, at least 1 − 1

k2 of the points must lie within kσ of
the mean.
Quartiles, outliers and plots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range (IQR): IQR = Q3 − Q1
• Five-number summary: min, Q1 , median, Q3 , max
• Outlier: usually, a value higher(lower) than 1.5 × IQR than Q3 (Q1 )
Histograms and boxplots:
• Histograms: graph display of tabultaed frequencies shown as bars.
– shows what proportion of cases fall into each of several categories
• Those have the same boxplot representation may have distict histograms.
Quantile plot: each value xi is paired with fi indicating that approximatedly 100fi % of
data is less than or equal to xi .
Quantile-Quantile (Q-Q) plot: graphs the quantiles of one univariate distribution against
the corresponding quatiles of another
1.3 Measuring data similarity and dissimilarity

Proximity measure:
• Proximity: refers to a similarity of dissimilarity of two data objects
– Similarity:
∗ numerical measure of how alike two data objects are
∗ Value is higher when objects are more alike.
∗ often falls in the range [0, 1]
– Dissimilarity (e.g. distance):
Figure 1: Contingency table for binary data
∗ numerical measure of how different two data objects are

∗ lower when objects are more alike
∗ minimum dissimilarity is often 0 while upper limit varies
• Applications: clustering, anomaly detection, nearest neighbor search
1.3.1 Proximity measure for binary attributes

• Distance measure for symmetric binary variables:
r+s
d(i, j) =
q+r+s+t
• Distance measure for assymetric binary variables (if t is too large):

r+s
d(i, j) =
q+r+s
• Jaccard coefficient:
q
simJaccard (i, j) =
q+r+s
– Jaccard coefficient is the ratio of intersection over union of two sets.
1.3.2 Minkowski distance

Definition 1. Minkowski distance:
xi = (xi1 , xi2 , · · · , xip )

xj = (xj1 , xj2 , · · · , xjp )
h1
d(i, j) = |xi1 − xj1 |h + |xi2 − xj2 |h + · · · + |xip − xjp |h
where h is the order (so the distance is also called Lh norm).
Properties:
• Positive definiteness: d(i, j) > 0 if i ̸= j, d(i, i) = 0
• Symmetry: d(i, j) = d(j, i)
• Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j)
A distance that satisfies these properties is a metric.
Manhatten distance i.e. L1 norm
1.3.3 Cosine similarity
Definition 2. If d1 and d2 are two vectors, then

d1 · d2
cos(d1 , d2 ) =
∥d1 ∥ · ∥d2 ∥
1.4 Probability inequalities

1.4.1 Markov’s inequality
Theorem 1. If X is a non-negative r.v., then for every c > 0:

1
Pr[X ≥ cE[X]] ≤
c
Proof.
X
E[X] = i Pr[X = i]
i
∞
X
≥ i Pr[X = i]
i=cE[X]
∞
X
≥ cE[X] Pr[X = i]
i=cE[X]
∞
X
= cE[X] Pr[X = i]
i=cE[X]
= cE[X] Pr[X ≥ cE[X]]

1
⇒ Pr[X ≥ cE[X]] ≤
c
• Pro: always works

• Cons:
– Not very precise.
– Doesn’t work for the lower tail: Pr [X ≤ cE[X]]
1.4.2 Chebyshev’s inequality
Theorem 2. For every c > 0,

h p i 1
Pr |X − E[X]| ≥ c V ar[X] ≤ 2
c
Lecture 3: MapReduce
Proof.
h p i
Pr |X − E[X]| ≥ c V ar[X] = Pr |X − E[X]|2 ≥ c2 V ar[X]

= Pr |X − E[X]|2 ≥ c2 E[|X − E[X]|2 ]

1
≤ 2
c
1.4.3 Chernoff bound

Theorem 3. Let X1 , X2 , · · · ,P
Xt be i.i.d. random values with range [0, 1] and
expectation µ. Then if X = 1t i Xi and 0 < δ < 1,
µtδ 2

Pr [|X − µ| ≥ δµ] ≤ 2 exp −
3
Example. Let X = 1
Xi ,σ = V ar[Xi ]:
P
t i
V ar[X]
• Chebyshev: Pr [|X − µ| ≥ c′ ] ≤ c′2 = σ
tc′2
2

• Chernoff: Pr [|X − µ| ≥ δµ] ≤ 2 exp − µtδ
3c
If t is very big:
• Chebyshev: Pr [|X − µ| ≥ z] = O 1

t
• Chernoff: Pr [|X − µ| ≥ z] = exp(−Ω(t))
Lecture 3: MapReduce Tue Feb 28 2023 10:00
2 MapReduce
2.1 Distributed file systems
2.1.1 Cluster architecture
c l u s t e r −− s w i t c h −−−−− r a c k _ 0 −−− n o d e _ 0 , n o d e _ 1 , ...
|
+−− r a c k _ 1 −−− n o d e _ 0 , n o d e _ 1 , ...
|
...
• Switch: 2-10 Gbps backbone between racks

– Racks: computing nodes connected by a switch
2 MAPREDUCE 7

Notes of Introduction To Data Science, Lectured by Jiaxin Ding From SJTU

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes of Introduction To Data Science, Lectured by Jiaxin Ding From SJTU

Uploaded by

Copyright:

Available Formats

Introduction to Data Science

Lecture 2: Basics of data Tue Feb 21 2023 10:02

1.1.1 Record data

1.1.2 Image data

1.1.3 Text data

1.1.4 Graph data

1.1.5 Streaming data

1.1.6 Spatio-temporal data

1.2 Basic statistical descriptions of data

1.2.1 Measuring the central tendency

• Weighted arithmetic mean: Pn

Mode Value that occurs most frequently in the data.

mean − mode ≈ 3 × (mean − median)

Example. • Five data points: 1, 1, 1, 1, 1, 2, 2, 2, 3, 3

• Positively skewed data: mode < median

1.2.2 Measuring the dispersion of data

The normal distribution curve

• [µ − σ, µ + σ]: contains about 68 % of the measurements

Regardless of how data is distributed, at least 1 − 1

Quartiles, outliers and plots

1.3 Measuring data similarity and dissimilarity

Figure 1: Contingency table for binary data

∗ numerical measure of how different two data objects are

1.3.1 Proximity measure for binary attributes

• Distance measure for assymetric binary variables (if t is too large):

1.3.2 Minkowski distance

xi = (xi1 , xi2 , · · · , xip )

where h is the order (so the distance is also called Lh norm).

Manhatten distance i.e. L1 norm

1.3.3 Cosine similarity

Definition 2. If d1 and d2 are two vectors, then

1.4 Probability inequalities

Theorem 1. If X is a non-negative r.v., then for every c > 0:

= cE[X] Pr[X ≥ cE[X]]

• Pro: always works

1.4.2 Chebyshev’s inequality

Theorem 2. For every c > 0,

= Pr |X − E[X]|2 ≥ c2 E[|X − E[X]|2 ]

1.4.3 Chernoff bound

• Chernoff: Pr [|X − µ| ≥ z] = exp(−Ω(t))

Lecture 3: MapReduce Tue Feb 28 2023 10:00

• Switch: 2-10 Gbps backbone between racks

You might also like