Professional Documents
Culture Documents
CSD479 Data Mining: Lecture # 2 Chapter # 1 of The Text Book
CSD479 Data Mining: Lecture # 2 Chapter # 1 of The Text Book
Data Mining
Lecture # 2
Data Preprocessing
Handling Missing and Noisy Data (Data Cleaning).
Techniques we will cover.
• Missing values Imputation using Mean, Median and Mod.
• Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining.
• Missing values Imputation using Fault-Tolerant Patterns.
• Data Binning for Noisy Data.
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
Data Mining Functionalities (2)
Association Rule Mining
Topic we will cover
Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
Fault-Tolerant/Approximate Frequent Itemset Mining.
N-Most Interesting Frequent Itemset Mining.
Closed and Maximal Frequent Itemset Mining.
Incremental Frequent Itemset Mining
Sequential Patterns.
Projects
• Mining Fault-Tolerant Using Pattern-Growth.
• Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes Rule
Islamabad hot high false Yes • If Temperature = Hot &
Multan mild low false No
Humidity = High then
Karachi cool normal false No
Rain = Yes.
Rawalpindi hot high true Yes
Prediction of City
Muree
Temperature
hot
Humidity Windy
high false
Rain
?
unknown record Sibi mild low true ?
Data Mining Functionalities (2)
Cluster Analysis
Group data to form new classes based on un-labels class data.
Business decisions are unknown (Also called unsupervised Learning).
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
City
Lahore
Temperature
hot
Humidity
low
Windy
false
Rain
?
3 clusters
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Data Mining Functionalities (3)
Outlier Analysis
Outlier: A data object that does not comply with the general behavior
of the data.
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City Temperature Humidity Windy Rain 2 outliers
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Are All the “Discovered” Patterns
Interesting?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
Suggested approach: Query-based, Constraint
mining
Interestingness Measures: A pattern is interesting if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
Can We Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Remember most of the problems in Data Mining are NP-Complete.
There is no global best solution for any single problem.
Search for only interesting patterns: Optimization
Can a data mining system find only the interesting patterns?
Approaches
• First generate all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
Reading Assignment
Book Chapter
Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
Some Nice Resources
ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.