Data Mining1

Data Mining
• Midterms: 4*10=40
Data Mining
• Exercises: 10
• Project:20 Dr. Mohammadi Zanjireh
• Final exam: 30
Imam Khomeini International
University (IKIU)
3 1
Data Mining Reference

• Chapter 1 – Introduction
• Chapter 2 - Getting to Know Your Data
• Chapter 3 - Data Preprocessing
• Chapter 4- Frequent Patterns Mining
• Chapter 5 – Classification
• Chapter 6 - Clustering
4 2
Chapter 1
§ This explosively growing, widely available, and
gigantic body of data makes our time truly the data
age. Powerful tools are badly needed to
automatically uncover valuable information from Introduction
the tremendous amounts of data and to transform
such data into organised knowledge.
§ This necessity has led to the birth of data mining.

The field is young, dynamic, and promising. Data
mining has and will continue to make great strides
in our journey from the data age toward the coming
information age.
7 5
§ Example 1-1: Search Engine. § “We are living in the information age” is a
popular saying; however, we are actually
§ A data rich but information poor situation. living in the data age.
§ Terabytes or petabytes of data into our

computer networks, the World Wide Web
(WWW), and various data storage devices
every day from business, society, science and
engineering, medicine, and almost every other
aspect of daily life.
8 6
§ knowledge discovery from data (KDD): § Data tombs.
§ Data Cleaning. Preprocessing Step
§ Data Integration. § What is Data Mining?
§ Data Selection. Data mining should have been more
§ Data Mining. appropriately named “knowledge mining from
§ Pattern Evaluation. data,” which is unfortunately somewhat long.
§ Knowledge Presentation.
11 9
12 10
• Data mining tasks can be classified into two • What Kinds of Data Can Be Mined?
categories: As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
– Descriptive: Characterises properties of the data in for a target application such as database data, data
a target data set. warehouse data, transactional data, data
streams, multimedia data, and the WWW.
– Predictive: Performs induction on the current data
in order to make predictions.
15 13
§ Data Cube: Exercise 1: What is the difference between

§ Fig 1.7: Database and Data warehouse?
Exercise 2: Describe a number of Data mining’s

applications.
16 14
§ Statistics – Drill-down
– Roll-up
§ Machine Learning (ML)

– Slice
§ Supervised Learning – Dice
§ Unsupervised Learning
§ Semi-supervised Learning – OLAP
– OLTP
19 17
§ Which Technologies Are Used?
20 18
Chapter 2
§ Efficiency and Scalability:

Getting to Know Your Data – The running time of a data mining algorithm must be
predictable, short, and acceptable by applications.
§ Parallel and distributed mining algorithms:

– Such algorithms first partition the data into “pieces” .
– Each piece is processed, in parallel.
– The patterns from each partition are eventually merged.
– Parallel in one machine.
– Distributed in multiple machines.
23 21
• Attributes: • Handling noise, error, exceptions, and

Attributes
outliers:
– Data often contain noise, errors, exceptions, or outliers.
Student_ID Name Average – These may confuse the data mining process, leading to the
1001 Ali 17.12 derivation of erroneous patterns.
1002 Reza 13.89 – Data cleaning, data preprocessing, and outlier detection and
1003 Maryam 16.02 removal, are examples of techniques that need to be integrated with
Observations
1004 Hasan 15.45
the data mining process.
• Privacy-preserving data mining:

– Data mining is useful. However, it poses the risk of disclosing an
individual’s personal information.
24 22
– We have to observe data sensitivity while performing data mining.
} Variance and Standard Deviation: • Attribute types:
– Nominal: Subject, Occupation.
– Binary (0,1-T,F): Gender, medical test.

• Symmetric Binary.
• Asymmetric Binary.
– Ordinal: Drink_size (small, medium, large).
– Numeric
27 25
• Interval_scaled.
} Data Matrix: } Discrete vs. Continuous attributes.

} Mean:
Student _1
Student_i
Student_n
} Median:
} Mode:
28 26
0.00 } Dissimilarity Matrix:
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00
} Proximity measures for Binary attributes

} Similarity Matrix:
Sim(i,j)=1-d(i,j)
31 29
} Euclidean Distance: } Proximity measures for Nominal attributes
} Manhattan Distance: } Example:

Id Subject Birth_City Living_City Eye_Colour
1 Computer Teh Teh Black
} Supremum Distance: 2 Electricity Teh Kar Brown
3 Mechanic Qaz Qaz Brown
4 Computer Kar Qaz Green
32 30
} Example: } Proximity measures for Ordinal attributes
} Degree={Diploma, Undergraduate, Master, PhD}
} Drink_size={Small, Medium, Large} Small Medium Large
1 2 3
Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
35 33
• Normalising: } Example:
} Degree={Diploma, Undergraduate, Master, PhD}
(Grade-min)/(max-min)
} Drink_size={Small, Medium, Large}
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45 Id Degree Drink_size
3 84 64 0.90 1 Undergraduate Medium
4 45 25 0.35 2 PhD Small
5 25 5 0.07 3 Diploma Medium
6 20 0 0.00 4 Undergraduate Large
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17 36 34
} Examples: } Proximity measures for mixed types:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)
1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14
0.00
Dissimilarity 0.85 0.00
Matrix =
0.65 0.83 0.00
0.13 0.71 0.79 0.00
39 37
} Exercise: } Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric) Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)
1 Code A Excellent ---- 1 Code A Excellent 45

2 ---- Fair 22 2 Code B Fair 22
3 Code C ---- 64 3 Code C Good 64
4 Code A Excellent 28 4 Code A Excellent 28
40 38
Chapter 3-
Data Preprocessing } Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
43 41
} 3.1-Data Cleaning. } Cosine similarity:
} 3.2-Data Integration. 1.00

Similarity 0.94 1.00
Matrix =
} 3.3-Data Reduction. 0.17 0.12 1.00
0.07 0.17 0.23 1.00
} 3.4-Data Transform.
44 42
} Outlier Analysis: } 3.1- Data Cleaning
Fig. 3.3 } 3.1.1-Missing values
Meta data: o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.
‫ط‬ 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.
o Outlier analysis.
47 45
} 3.2- Data Integration: } Example:

4, 8, 15, 21, 21, 24, 25, 28, 34
} 3.3- Data Reduction Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
} 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT). Smoothing by bin means:
o Principal Components Reduction (PCA). bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
‫ط‬ 3.3.2-Numerosity reduction Smoothing by bin boundaries:

o Sampling. bin1: 4, 4, 15. bin2: 21, 21, 24. bin3: 25, 25, 34
o Clustering.
48 46
} 3.4- Data Transform
q Normalisation:
q Min_max normalisation:
q Z-Score normalisation:
49
} Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34
Min-max: 0, 0.17, 0.17, 0.33, 1.00
Z-score: -0.95, -0.48, -0.48, 0, 1.91
50

Data Mining1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining1

Uploaded by

Copyright:

Available Formats

Data Mining

Data Mining Reference

§ This necessity has led to the birth of data mining.

§ Terabytes or petabytes of data into our

§ Data Cube: Exercise 1: What is the difference between

Exercise 2: Describe a number of Data mining’s

§ Machine Learning (ML)

§ Which Technologies Are Used?

§ Efficiency and Scalability:

§ Parallel and distributed mining algorithms:

• Attributes: • Handling noise, error, exceptions, and

• Privacy-preserving data mining:

– Binary (0,1-T,F): Gender, medical test.

– Ordinal: Drink_size (small, medium, large).

} Data Matrix: } Discrete vs. Continuous attributes.

} Proximity measures for Binary attributes

} Euclidean Distance: } Proximity measures for Nominal attributes

} Manhattan Distance: } Example:

1 Code A Excellent ---- 1 Code A Excellent 45

} 3.1-Data Cleaning. } Cosine similarity:

} 3.2-Data Integration. 1.00

} 3.2- Data Integration: } Example:

‫ط‬ 3.3.2-Numerosity reduction Smoothing by bin boundaries:

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91

You might also like