You are on page 1of 44

2 Data

IT 326: Data Mining


First semester 2020

Chapter 2, “Data Mining: Concepts and Techniques” (3rd ed.)


Outline
2

 Data Objects and Attribute Types


 Basic Statistical Descriptions of Data
 Measuring Data Similarity and Dissimilarity
 Summary
Types of Data Sets
3

 Record  Ordered
 Relational records  Video data: sequence of images
 Data matrix, e.g., numerical matrix,  Temporal data: time-series
crosstabs  Sequential Data: transaction sequences
 Document data: text documents: term-  Genetic sequence data ‫تسلسل جيني‬
frequency vector
 Graph and network
 Transaction data
 World Wide Web
 Spatial, image and multimedia:
 Social or information networks
 Spatial data: maps
 Molecular Structures ‫تراكيب جزيئية‬
 Image data
 Video data

Transaction data Document data Graph


Data Objects
4

Attributes
 Data sets are made up of data objects.
 A data object represents an entity. Student ID Name GPA Age

 Examples: Student 1 8000 Sam 3.45 19


◼ sales database: customers, store items, sales Objects Student 2 5001 Jill 2.65 21
◼ medical database: patients, treatments
◼ university database: students, professors, courses Relational records
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Attributes also called dimensions, features, variables.
 Database: rows → data objects; columns → attributes.
Attributes
5

 Attribute: a data field, representing a characteristic or feature of a data object.


 E.g., customer_ID, name, address
 The type of an attribute is determined by the set of possible values the attribute can
have.

Attribute Types

Nominal Ordinal Binary Numeric

Quantitative
Qualitative
Attribute Types (Qualitative)
6

 Nominal: categories, states, or “names of things”


 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes

 Ordinal:
 Values have a meaningful order (ranking) but magnitude between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
Attribute Types (Qualitative)
7

 Binary:
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
◼ e.g., gender
 Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV positive)
Attribute Types (Quantitative)
8

 Numeric: a measurable quantity, represented in integer or real values.


 Numeric attributes can be further categorized into ( interval or ratio)

Interval Ratio
o Measured on a scale of equal-sized units o Inherent zero-point
o Values have order o we can speak of a value as being a multiple (or
o No true zero-point ratio) of another value
o e.g., temperature in C˚or F˚, calendar dates o e.g., counts( years of experiences, number of words),
monetary quantities
Comparison between qualitative and quantitative attribute types
9

Qualitative Quantitative
OK to compute.... ? (Categorical) (Numeric)
Nominal Ordinal Interval Ratio
Median and percentiles .. No Yes Yes Yes

Add or subtract .. No No Yes Yes

Mean, standard deviation…. No No Yes Yes

Ratios No No No Yes
Discrete vs. Continuous Attributes
10

Any type of attribute can be categorized into discrete or continuous.

 Discrete Attribute:
 Has only a finite or countably infinite set of values
◼ E.g., zip codes, profession, or the set of words in a collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute:
 Has real numbers as attribute values
◼ E.g., temperature, height, or weight
 Practically, real values can only be measured and represented using a finite number of digits
 Continuous attributes are typically represented as floating-point variables
Basic Statistical Descriptions of Data
11

 Motivation
 To better understand the data: central tendency, variation and spread
 Measures of central tendency:
 Given an attribute, where do most of its values fall?
 The dispersion of the data:
 How are the data spread out?
 Graphic displays of basic statistical descriptions:
 Visually inspect data
Measuring the Central Tendency:
Mean
12

 Mean: The most common and effective numeric measure of the “center” of a set of
data
 Let 𝑥1 , 𝑥2 , …, 𝑥𝑁 be a set of N values or observations, such as for some numeric
attribute X. The mean of this set of values is:
1 n
x =  xi
n i =1
 Weighted arithmetic mean:
n

w x i i
x= i =1
n

w
i =1
i

 A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
 Trimmed mean: chopping extreme values
Measuring the Central Tendency:
Median, Mode, and Midrange
13

 Median:
 Middle value if odd number of values, or average of the middle two values otherwise.
 The median generally applies to numeric data; however, it may extend the concept to ordinal data.
 Mode:
 Value that occurs most frequently in the data.
 It can be determined for qualitative and quantitative attributes.
 Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
 Midrange:
 It is the average of the largest and smallest values in the set.
Symmetric vs. Skewed Data
14

 Median, mean and mode of symmetric, positively and negatively skewed data
Measuring the Dispersion of Data
15

 Range, Quartiles, and Interquartile Range:


 Range: is the difference between the largest and smallest values.
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1

4-quantiles→ Quartiles

100-quantiles→ Percentiles
Measuring the Dispersion of Data
16

 Five-Number Summary, Boxplots, and Outliers Branch 1: Q3=100, Q1=60

 Five number summary: Minimum, Q1, Median, Q3, Maximum • IQR=Q3-Q1=100-60=40


• IQR*1.5=60
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Boxplot: graphic display of five-number summary. • IF x > 60+Q3 OR x< Q1-60
THEN x is outlier

o Data is represented with a box


o The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR
o The median is marked by a line within the box
o Whiskers: two lines outside the box extended to
Minimum and Maximum
o Outliers: points beyond a specified outlier threshold,
plotted individually
Measuring the Dispersion of Data
17

 Variance and standard deviation indicate how spread out a data distribution is.
 A low standard deviation means that the data observations tend to be very close to the mean.
 A high standard deviation indicates that the data are spread out over a large range of values.
 Variance:
n n
1 1
 =  ( xi −  ) =  i  2
2 2
x − 2

N i =1 N i =1

 Standard deviation (or σ) is the square root of variance.


◼ σ =0 only when there is no spread, that is, when all observations have the same value. Otherwise, σ > 0.
Graphic Displays of Basic Statistical Descriptions
18

 Boxplot (see slide 16)


 Histogram
 Quantile plot
 Quantile-quantile (q-q) plot
 Scatter plot
Histogram
19

 Plotting histograms is a graphical method for summarizing the distribution of a given


attribute.
 If X is nominal, then a vertical bar is drawn for each known value of X. The height of the bar
indicates the frequency (i.e., count) of that X value. “bar chart”
 If X is numeric, the range of values for X is partitioned into disjoint consecutive subranges. For each
subrange, a bar is drawn with a height that represents the total count of items observed within the
subrange. “histogram”

The subranges, referred to as buckets or bins,


are disjoint subsets of the data distribution for X.

The range of a bucket is known as the width.


Histograms Often Tell More than Boxplots
20

 The two histograms shown in the left may have the same boxplot representation
 The same values for: min, Q1, median, Q3, max
 But they have rather different data distributions
Quantile Plot
21
Quantile-Quantile (Q-Q) Plot
22

 Quantile-quantile (q-q) plot: graphs the quantiles of one univariate distribution


against the corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example: shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of
items sold at Branch 1 tend to be lower than those at Branch 2.
Scatter plot
23

 A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
 Provides a first look at bivariate data to see clusters of points, outliers, or to explore
the possibility of correlation relationships.
 Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
Scatter Plot: Correlation
24

 Two attributes, X, and Y, are correlated if one attribute implies the other.
 Correlations can be positive, negative, or null (uncorrelated).
Proximity
Similarity and Dissimilarity
26

 Proximity refers to a similarity or dissimilarity


 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
Data Matrix and Dissimilarity Matrix
27

 Data matrix: stores the n data objects  Dissimilarity matrix: This structure
in the form of a relational table. stores a collection of proximities that
are available for all pairs of n
objects.
 A triangular matrix
Proximity Measure for Nominal Attributes
28

 Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)

 Method 1: Simple matching


 m: # of matches, p: total # of variables

d (i, j) = p −
p
m

 Method 2: Use a large number of binary attributes


 creating a new binary attribute for each of the M nominal states
Example: Obj Color Obj col-red col-yel col_blue col_green
1 Red 1 1 0 0 0
2 Yel 2 0 1 0 0
3 Red 3 1 0 0 0
4 Green 4 0 0 0 1
Example: Nominal attributes
29

Dissimilarity matrix
“test-1”
Proximity Measure for Binary Attributes
30

 A contingency table for binary attributes:

 Jaccard coefficient: similarity


measure for asymmetric binary
variables.

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:


Example: Binary Attributes
31

 Gender is a symmetric attribute


 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N be 0
Distance on Numeric Data: Minkowski Distance
32

 Minkowski distance: A popular distance measure

Where are be two objects described by 𝑝 numeric attributes,


and ℎ is the order (the distance so defined is also called 𝐿 − ℎ norm)
 Properties:
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
Distance on Numeric Data: Special cases
33

 h = 1: Manhattan (city block, L1 norm) distance


 E.g., the Hamming distance: the number of bits that are different between two binary vectors

 h = 2: (L2 norm) Euclidean distance


Example: Numeric Distance
34

Data Matrix Dissimilarity Matrices:


point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Manhattan (L1)
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Proximity Measure for Ordinal Attributes
35

 Order is important, e.g., rank


 Can be treated like interval-scaled:
1. Replace 𝑥𝑖𝑓 by their rank rif {1,..., M f }

2. Map the range of each variable onto [0, 1] by replacing 𝑖𝑡ℎ object in the 𝑓 𝑡ℎ
variable by:

zif = r −1
if
𝑀𝑓 : number of possible values for variable 𝑓.
M −1 f

3. Compute the dissimilarity using methods for numeric variables.


Example: Ordinal Attributes
36

Dissimilarity matrix
“test-2”

Obj ID Test-2 rank normalize


1 Excellent 3 (3-1)/(3-1) = 1
2 Fair 1 0
3 Good 2 0.5
4 Excellent 3 1
Dissimilarity for Attributes of Mixed Types
37

 A database may contain all attribute types:


 Nominal, symmetric binary, asymmetric binary, numeric, ordinal
 One may use a weighted formula to combine their effects:

 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
 𝑓 is binary or nominal:
 𝑓 is numeric: use the normalized distance

 𝑓 is ordinal:
𝜹𝒊𝒋 = 0 IF either
◼ Compute ranks and zif = r − 1
if

M −1 f
(1) 𝑥𝑖𝑓 or 𝑥𝑗𝑓 is missing OR
◼ Treat as numeric (2) 𝑥𝑖𝑓 = 𝑥𝑗𝑓 = 0 and attribute 𝑓 is asymmetric binary;
Otherwise, 𝜹𝒊𝒋 = 1.
Example: Mixed Attributes
38

 Calculate d(3,1).

Obj ID Test-1 Test-2 Test-3


(nominal) (ordinal) (numeric)
1 Code A Excellent → 1 45
2 Code B Fair 22  min
3 Code C Good → 0.5 64  max
4 Code A Excellent 28
−1 −2 | 64 − 45 |
=1 = 0.5
−3
d 3Test
,1 d 3Test
,1
d 3Test
,1 =
64 − 22
= 0.45

1(1) + 1(0.5) + 1(0.45)


d (3,1) = = 0.65
1+1+1
Cosine Similarity
39

 Cosine similarity measures the similarity between two vectors by the cosine of the
angle between them.
x

y
Θ

Θ 0 22.5 45 67.5 90
cos(Θ)≈ 1 0.92 0.71 0.38 0
Cosine Similarity
40

x y
sim( x, y ) = cos( x, y ) =
x y

x • y =  xi * yi x = x
i
2
i
i
Example: Cosine Similarity
41

 A document can be represented by thousands of attributes, each recording the


frequency of a particular word (such as keywords) or phrase in the document.

 Term-frequency vectors are very long and sparse (contains many zeros).
 Traditional distance measures are not suitable;
 Many zero-matches between two documents does not mean that they are similar.
 Must ignore zero-matches → cosine similarity.
Example: Cosine Similarity
42

 Ex: Find the similarity between documents 1 and 2.


 d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
 d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1 • d 2 = (5  3) + (0  0) + (3  2) + (0  0) + (2 1) + (0 1) + (0  0) + (2 1) + (0  0) + (0 1)


= 25

d1 = 52 + 02 + 32 + 02 + 22 + 02 + 02 + 22 + 02 + 02 = 6.48

d 2 = 32 + 02 + 22 + 02 + 12 + 12 + 02 + 12 + 02 + 12 = 4.12

25
cos(d1, d 2) = = 0.94
6.48  4.12
Exercise
43

1 2 3 4
Summary
44

 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled


 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion, graphical displays
 Measure data similarity and dissimilarity

You might also like