You are on page 1of 7

Introduction to Data Science

Ziqin Gong

March 7, 2023

Contents
1 Data fundamentals 1
1.1 Data attibutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic statistical descriptions of data . . . . . . . . . . . . . . . . . . . . 2
1.3 Measuring data similarity and dissimilarity . . . . . . . . . . . . . . . . . 4
1.4 Probability inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 MapReduce 8
2.1 Distributed file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Lecture 2: Basics of data Tue Feb 21 2023 10:02

1 Data fundamentals
1.1 Data attibutes
• Data object: an entity in the dataset
• A data attribute is aparticular data field, representing a characteristic or feature of
a data object.(feature)

1.1.1 Record data


Usually stored in a relational database.
• Each row represents a data object.
• Each column represents a data attibutes.

1.1.2 Image data


A 3-layer matrix (3 × h × w) of real value in [0, 255].

1.1.3 Text data


A sequence of words/tokens that represents semantic meanings.

1
Lecture 2: Basics of data

1.1.4 Graph data


A directed/undirected graph.
• Possibly with additional information for nodes and edges.

1.1.5 Streaming data


A sequence of readings.

1.1.6 Spatio-temporal data


A sequence of (time, location, info) tuples.

1.2 Basic statistical descriptions of data


Capture the properties of a given dataset:
• Central tendency: describes the center around which the data is distributed
• Dispersion: describes the data spread

1.2.1 Measuring the central tendency


Mean
n
1X
µ= xi
n i=1

• Weighted arithmetic mean: Pn


w i xi
µ = Pi=1
n
i=1 wi

• Geometric mean: µ =
p
n
Q
i xi
– µgeometric ≤ µarithmetic .
– The geometric mean is more sensitive to values near zero.
– Geometric means make sense with ratios.
∗ The ratios 1/2 and 2 should average to 1.

Median Middle value if odd number of values, or average of the middle two values oth-
erwise.
• Mean is meaningful for symmetric distributions without outliers. e.g. height and
weight
• Median is better for skewed distributions or data with outliers. e.g. wealth and
income

1 DATA FUNDAMENTALS 2
Lecture 2: Basics of data

Mode Value that occurs most frequently in the data.


• Unimodal, bimodal, trimodal
• Empirical formular (moderately skewed distribution):

mean − mode ≈ 3 × (mean − median)

Example. • Five data points: 1, 1, 1, 1, 1, 2, 2, 2, 3, 3


• Mean: 1.7 Median: 1.5 Mode 1
Symmetric v.s. Skewed data

• Positively skewed data: mode < median


• Symmetric data: mode = median
• Negatively skewed data: mode > median

1.2.2 Measuring the dispersion of data


Variance and standard deviation
n
1X
µ= xi = E[x]
n i=1
n
1X
σ2 = (xi − µ)2 = E[x2 ] − E[x]2
n i=1

σ = σ2

The normal distribution curve

• [µ − σ, µ + σ]: contains about 68 % of the measurements


• [µ − 2σ, µ + 2σ]: 95 %
• [µ − 3σ, µ + 3σ]: 99,7 %

Regardless of how data is distributed, at least 1 − 1


k2 of the points must lie within kσ of
the mean.

Quartiles, outliers and plots


• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range (IQR): IQR = Q3 − Q1
• Five-number summary: min, Q1 , median, Q3 , max
• Outlier: usually, a value higher(lower) than 1.5 × IQR than Q3 (Q1 )
Histograms and boxplots:
• Histograms: graph display of tabultaed frequencies shown as bars.
– shows what proportion of cases fall into each of several categories

1 DATA FUNDAMENTALS 3
Lecture 2: Basics of data

• Those have the same boxplot representation may have distict histograms.
Quantile plot: each value xi is paired with fi indicating that approximatedly 100fi % of
data is less than or equal to xi .

Quantile-Quantile (Q-Q) plot: graphs the quantiles of one univariate distribution against
the corresponding quatiles of another

1.3 Measuring data similarity and dissimilarity


Proximity measure:
• Proximity: refers to a similarity of dissimilarity of two data objects
– Similarity:
∗ numerical measure of how alike two data objects are
∗ Value is higher when objects are more alike.
∗ often falls in the range [0, 1]
– Dissimilarity (e.g. distance):

1 DATA FUNDAMENTALS 4
Lecture 2: Basics of data

Figure 1: Contingency table for binary data

∗ numerical measure of how different two data objects are


∗ lower when objects are more alike
∗ minimum dissimilarity is often 0 while upper limit varies
• Applications: clustering, anomaly detection, nearest neighbor search

1.3.1 Proximity measure for binary attributes


• Distance measure for symmetric binary variables:
r+s
d(i, j) =
q+r+s+t

• Distance measure for assymetric binary variables (if t is too large):


r+s
d(i, j) =
q+r+s

• Jaccard coefficient:
q
simJaccard (i, j) =
q+r+s
– Jaccard coefficient is the ratio of intersection over union of two sets.

1.3.2 Minkowski distance


Definition 1. Minkowski distance:

xi = (xi1 , xi2 , · · · , xip )


xj = (xj1 , xj2 , · · · , xjp )
 h1
d(i, j) = |xi1 − xj1 |h + |xi2 − xj2 |h + · · · + |xip − xjp |h

where h is the order (so the distance is also called Lh norm).

Properties:
• Positive definiteness: d(i, j) > 0 if i ̸= j, d(i, i) = 0
• Symmetry: d(i, j) = d(j, i)
• Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j)
A distance that satisfies these properties is a metric.

1 DATA FUNDAMENTALS 5
Lecture 2: Basics of data

Manhatten distance i.e. L1 norm

1.3.3 Cosine similarity

Definition 2. If d1 and d2 are two vectors, then


d1 · d2
cos(d1 , d2 ) =
∥d1 ∥ · ∥d2 ∥

1.4 Probability inequalities


1.4.1 Markov’s inequality

Theorem 1. If X is a non-negative r.v., then for every c > 0:


1
Pr[X ≥ cE[X]] ≤
c

Proof.
X
E[X] = i Pr[X = i]
i

X
≥ i Pr[X = i]
i=cE[X]

X
≥ cE[X] Pr[X = i]
i=cE[X]

X
= cE[X] Pr[X = i]
i=cE[X]

= cE[X] Pr[X ≥ cE[X]]


1
⇒ Pr[X ≥ cE[X]] ≤
c

• Pro: always works


• Cons:
– Not very precise.
– Doesn’t work for the lower tail: Pr [X ≤ cE[X]]

1.4.2 Chebyshev’s inequality

Theorem 2. For every c > 0,


h p i 1
Pr |X − E[X]| ≥ c V ar[X] ≤ 2
c

1 DATA FUNDAMENTALS 6
Lecture 3: MapReduce

Proof.
h p i
Pr |X − E[X]| ≥ c V ar[X] = Pr |X − E[X]|2 ≥ c2 V ar[X]
 

= Pr |X − E[X]|2 ≥ c2 E[|X − E[X]|2 ]


 

1
≤ 2
c

1.4.3 Chernoff bound


Theorem 3. Let X1 , X2 , · · · ,P
Xt be i.i.d. random values with range [0, 1] and
expectation µ. Then if X = 1t i Xi and 0 < δ < 1,

µtδ 2
 
Pr [|X − µ| ≥ δµ] ≤ 2 exp −
3

Example. Let X = 1
Xi ,σ = V ar[Xi ]:
P
t i
V ar[X]
• Chebyshev: Pr [|X − µ| ≥ c′ ] ≤ c′2 = σ
tc′2
 2

• Chernoff: Pr [|X − µ| ≥ δµ] ≤ 2 exp − µtδ
3c

If t is very big:
• Chebyshev: Pr [|X − µ| ≥ z] = O 1

t

• Chernoff: Pr [|X − µ| ≥ z] = exp(−Ω(t))

Lecture 3: MapReduce Tue Feb 28 2023 10:00

2 MapReduce
2.1 Distributed file systems
2.1.1 Cluster architecture
c l u s t e r −− s w i t c h −−−−− r a c k _ 0 −−− n o d e _ 0 , n o d e _ 1 , ...
|
+−− r a c k _ 1 −−− n o d e _ 0 , n o d e _ 1 , ...
|
...

• Switch: 2-10 Gbps backbone between racks


– Racks: computing nodes connected by a switch

2 MAPREDUCE 7

You might also like