Professional Documents
Culture Documents
Ziqin Gong
March 7, 2023
Contents
1 Data fundamentals 1
1.1 Data attibutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic statistical descriptions of data . . . . . . . . . . . . . . . . . . . . 2
1.3 Measuring data similarity and dissimilarity . . . . . . . . . . . . . . . . . 4
1.4 Probability inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 MapReduce 8
2.1 Distributed file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Data fundamentals
1.1 Data attibutes
• Data object: an entity in the dataset
• A data attribute is aparticular data field, representing a characteristic or feature of
a data object.(feature)
1
Lecture 2: Basics of data
• Geometric mean: µ =
p
n
Q
i xi
– µgeometric ≤ µarithmetic .
– The geometric mean is more sensitive to values near zero.
– Geometric means make sense with ratios.
∗ The ratios 1/2 and 2 should average to 1.
Median Middle value if odd number of values, or average of the middle two values oth-
erwise.
• Mean is meaningful for symmetric distributions without outliers. e.g. height and
weight
• Median is better for skewed distributions or data with outliers. e.g. wealth and
income
1 DATA FUNDAMENTALS 2
Lecture 2: Basics of data
1 DATA FUNDAMENTALS 3
Lecture 2: Basics of data
• Those have the same boxplot representation may have distict histograms.
Quantile plot: each value xi is paired with fi indicating that approximatedly 100fi % of
data is less than or equal to xi .
Quantile-Quantile (Q-Q) plot: graphs the quantiles of one univariate distribution against
the corresponding quatiles of another
1 DATA FUNDAMENTALS 4
Lecture 2: Basics of data
• Jaccard coefficient:
q
simJaccard (i, j) =
q+r+s
– Jaccard coefficient is the ratio of intersection over union of two sets.
Properties:
• Positive definiteness: d(i, j) > 0 if i ̸= j, d(i, i) = 0
• Symmetry: d(i, j) = d(j, i)
• Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j)
A distance that satisfies these properties is a metric.
1 DATA FUNDAMENTALS 5
Lecture 2: Basics of data
Proof.
X
E[X] = i Pr[X = i]
i
∞
X
≥ i Pr[X = i]
i=cE[X]
∞
X
≥ cE[X] Pr[X = i]
i=cE[X]
∞
X
= cE[X] Pr[X = i]
i=cE[X]
1 DATA FUNDAMENTALS 6
Lecture 3: MapReduce
Proof.
h p i
Pr |X − E[X]| ≥ c V ar[X] = Pr |X − E[X]|2 ≥ c2 V ar[X]
1
≤ 2
c
µtδ 2
Pr [|X − µ| ≥ δµ] ≤ 2 exp −
3
Example. Let X = 1
Xi ,σ = V ar[Xi ]:
P
t i
V ar[X]
• Chebyshev: Pr [|X − µ| ≥ c′ ] ≤ c′2 = σ
tc′2
2
• Chernoff: Pr [|X − µ| ≥ δµ] ≤ 2 exp − µtδ
3c
If t is very big:
• Chebyshev: Pr [|X − µ| ≥ z] = O 1
t
2 MapReduce
2.1 Distributed file systems
2.1.1 Cluster architecture
c l u s t e r −− s w i t c h −−−−− r a c k _ 0 −−− n o d e _ 0 , n o d e _ 1 , ...
|
+−− r a c k _ 1 −−− n o d e _ 0 , n o d e _ 1 , ...
|
...
2 MAPREDUCE 7