Professional Documents
Culture Documents
• Midterms: 4*10=40
Data Mining
• Exercises: 10
• Project:20 Dr. Mohammadi Zanjireh
• Final exam: 30
Imam Khomeini International
University (IKIU)
3 1
4 2
Chapter 1
§ This explosively growing, widely available, and
gigantic body of data makes our time truly the data
age. Powerful tools are badly needed to
automatically uncover valuable information from Introduction
the tremendous amounts of data and to transform
such data into organised knowledge.
7 5
§ Example 1-1: Search Engine. § “We are living in the information age” is a
popular saying; however, we are actually
§ A data rich but information poor situation. living in the data age.
8 6
§ knowledge discovery from data (KDD): § Data tombs.
§ Data Cleaning. Preprocessing Step
§ Data Integration. § What is Data Mining?
§ Data Selection. Data mining should have been more
§ Data Mining. appropriately named “knowledge mining from
§ Pattern Evaluation. data,” which is unfortunately somewhat long.
§ Knowledge Presentation.
11 9
12 10
• Data mining tasks can be classified into two • What Kinds of Data Can Be Mined?
categories: As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
– Descriptive: Characterises properties of the data in for a target application such as database data, data
a target data set. warehouse data, transactional data, data
streams, multimedia data, and the WWW.
– Predictive: Performs induction on the current data
in order to make predictions.
15 13
16 14
§ Statistics – Drill-down
– Roll-up
19 17
20 18
Chapter 2
23 21
Observations
1004 Hasan 15.45
the data mining process.
– Numeric
27 25
• Interval_scaled.
Student_i
Student_n
} Median:
} Mode:
28 26
0.00 } Dissimilarity Matrix:
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00
31 29
32 30
} Example: } Proximity measures for Ordinal attributes
} Degree={Diploma, Undergraduate, Master, PhD}
} Drink_size={Small, Medium, Large} Small Medium Large
1 2 3
Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
35 33
• Normalising: } Example:
} Degree={Diploma, Undergraduate, Master, PhD}
(Grade-min)/(max-min)
} Drink_size={Small, Medium, Large}
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45 Id Degree Drink_size
3 84 64 0.90 1 Undergraduate Medium
4 45 25 0.35 2 PhD Small
5 25 5 0.07 3 Diploma Medium
6 20 0 0.00 4 Undergraduate Large
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17 36 34
} Examples: } Proximity measures for mixed types:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)
1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14
0.00
Dissimilarity 0.85 0.00
Matrix =
0.65 0.83 0.00
0.13 0.71 0.79 0.00
39 37
} Exercise: } Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric) Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)
40 38
Chapter 3-
Data Preprocessing } Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
43 41
} 3.4-Data Transform.
44 42
} Outlier Analysis: } 3.1- Data Cleaning
Fig. 3.3 } 3.1.1-Missing values
Meta data: o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.
ط 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.
o Outlier analysis.
47 45
} 3.3- Data Reduction Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
} 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT). Smoothing by bin means:
o Principal Components Reduction (PCA). bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
48 46
} 3.4- Data Transform
q Normalisation:
q Min_max normalisation:
q Z-Score normalisation:
49
} Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34
50