Professional Documents
Culture Documents
Record Ordered
Relational records Video data: sequence of images
Data matrix, e.g., numerical matrix, Temporal data: time-series
crosstabs Sequential Data: transaction sequences
Document data: text documents: term- Genetic sequence data تسلسل جيني
frequency vector
Graph and network
Transaction data
World Wide Web
Spatial, image and multimedia:
Social or information networks
Spatial data: maps
Molecular Structures تراكيب جزيئية
Image data
Video data
Attributes
Data sets are made up of data objects.
A data object represents an entity. Student ID Name GPA Age
Attribute Types
Quantitative
Qualitative
Attribute Types (Qualitative)
6
Ordinal:
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings
Attribute Types (Qualitative)
7
Binary:
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
◼ e.g., gender
Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV positive)
Attribute Types (Quantitative)
8
Interval Ratio
o Measured on a scale of equal-sized units o Inherent zero-point
o Values have order o we can speak of a value as being a multiple (or
o No true zero-point ratio) of another value
o e.g., temperature in C˚or F˚, calendar dates o e.g., counts( years of experiences, number of words),
monetary quantities
Comparison between qualitative and quantitative attribute types
9
Qualitative Quantitative
OK to compute.... ? (Categorical) (Numeric)
Nominal Ordinal Interval Ratio
Median and percentiles .. No Yes Yes Yes
Ratios No No No Yes
Discrete vs. Continuous Attributes
10
Discrete Attribute:
Has only a finite or countably infinite set of values
◼ E.g., zip codes, profession, or the set of words in a collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute:
Has real numbers as attribute values
◼ E.g., temperature, height, or weight
Practically, real values can only be measured and represented using a finite number of digits
Continuous attributes are typically represented as floating-point variables
Basic Statistical Descriptions of Data
11
Motivation
To better understand the data: central tendency, variation and spread
Measures of central tendency:
Given an attribute, where do most of its values fall?
The dispersion of the data:
How are the data spread out?
Graphic displays of basic statistical descriptions:
Visually inspect data
Measuring the Central Tendency:
Mean
12
Mean: The most common and effective numeric measure of the “center” of a set of
data
Let 𝑥1 , 𝑥2 , …, 𝑥𝑁 be a set of N values or observations, such as for some numeric
attribute X. The mean of this set of values is:
1 n
x = xi
n i =1
Weighted arithmetic mean:
n
w x i i
x= i =1
n
w
i =1
i
A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
Trimmed mean: chopping extreme values
Measuring the Central Tendency:
Median, Mode, and Midrange
13
Median:
Middle value if odd number of values, or average of the middle two values otherwise.
The median generally applies to numeric data; however, it may extend the concept to ordinal data.
Mode:
Value that occurs most frequently in the data.
It can be determined for qualitative and quantitative attributes.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
Midrange:
It is the average of the largest and smallest values in the set.
Symmetric vs. Skewed Data
14
Median, mean and mode of symmetric, positively and negatively skewed data
Measuring the Dispersion of Data
15
4-quantiles→ Quartiles
100-quantiles→ Percentiles
Measuring the Dispersion of Data
16
Variance and standard deviation indicate how spread out a data distribution is.
A low standard deviation means that the data observations tend to be very close to the mean.
A high standard deviation indicates that the data are spread out over a large range of values.
Variance:
n n
1 1
= ( xi − ) = i 2
2 2
x − 2
N i =1 N i =1
The two histograms shown in the left may have the same boxplot representation
The same values for: min, Q1, median, Q3, max
But they have rather different data distributions
Quantile Plot
21
Quantile-Quantile (Q-Q) Plot
22
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
Provides a first look at bivariate data to see clusters of points, outliers, or to explore
the possibility of correlation relationships.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
Scatter Plot: Correlation
24
Two attributes, X, and Y, are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated).
Proximity
Similarity and Dissimilarity
26
Data matrix: stores the n data objects Dissimilarity matrix: This structure
in the form of a relational table. stores a collection of proximities that
are available for all pairs of n
objects.
A triangular matrix
Proximity Measure for Nominal Attributes
28
Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
d (i, j) = p −
p
m
Dissimilarity matrix
“test-1”
Proximity Measure for Binary Attributes
30
2. Map the range of each variable onto [0, 1] by replacing 𝑖𝑡ℎ object in the 𝑓 𝑡ℎ
variable by:
zif = r −1
if
𝑀𝑓 : number of possible values for variable 𝑓.
M −1 f
Dissimilarity matrix
“test-2”
pf = 1 ij( f ) dij( f )
d (i, j) =
pf = 1 ij( f )
𝑓 is binary or nominal:
𝑓 is numeric: use the normalized distance
𝑓 is ordinal:
𝜹𝒊𝒋 = 0 IF either
◼ Compute ranks and zif = r − 1
if
M −1 f
(1) 𝑥𝑖𝑓 or 𝑥𝑗𝑓 is missing OR
◼ Treat as numeric (2) 𝑥𝑖𝑓 = 𝑥𝑗𝑓 = 0 and attribute 𝑓 is asymmetric binary;
Otherwise, 𝜹𝒊𝒋 = 1.
Example: Mixed Attributes
38
Calculate d(3,1).
Cosine similarity measures the similarity between two vectors by the cosine of the
angle between them.
x
y
Θ
Θ 0 22.5 45 67.5 90
cos(Θ)≈ 1 0.92 0.71 0.38 0
Cosine Similarity
40
x y
sim( x, y ) = cos( x, y ) =
x y
x • y = xi * yi x = x
i
2
i
i
Example: Cosine Similarity
41
Term-frequency vectors are very long and sparse (contains many zeros).
Traditional distance measures are not suitable;
Many zero-matches between two documents does not mean that they are similar.
Must ignore zero-matches → cosine similarity.
Example: Cosine Similarity
42
d1 = 52 + 02 + 32 + 02 + 22 + 02 + 02 + 22 + 02 + 02 = 6.48
d 2 = 32 + 02 + 22 + 02 + 12 + 12 + 02 + 12 + 02 + 12 = 4.12
25
cos(d1, d 2) = = 0.94
6.48 4.12
Exercise
43
1 2 3 4
Summary
44