You are on page 1of 14

CSD479

Data Mining
Lecture # 2

Chapter # 1 of the Text Book


Data Mining Functionalities (1)

 Data Preprocessing
 Handling Missing and Noisy Data (Data Cleaning).
 Techniques we will cover.
• Missing values Imputation using Mean, Median and Mod.
• Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining.
• Missing values Imputation using Fault-Tolerant Patterns.
• Data Binning for Noisy Data.

TID Refund Country Taxable Income Cheat


1 Yes USA 125K No
2 UK 100K No
3 No Australia 70K No
4 120K No
5 No NZL 95K Yes
Data Mining Functionalities (1)
 Data Preprocessing
 Data Transformation (Discretization and Normalization).
 With the help of data transformation rules become more General and
Compact.
 General and Compact rules increase the Accuracy of Classification.
Age Age
15 Child
18 Child
Child = (0 to 20)
40 Young
33 Young = (21 to 47) Young
55 Old = (48 to 120) Old
48 Old
12 Child
23 Young

1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08


then Buy_Computer = No.
1. If attribute 1 = value1 &
2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 attribute 2 = value2 and
then Buy_Computer = No. Age = Child then
Buy_Computer = No.
3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10
then Buy_Computer = No.
Data Mining Functionalities (1)
 Data Preprocessing
 Attribute Selection/Feature Selection
• Selection of those attributes which are more relevant to data mining
task.
• Advantage1: Decrease the processing time of mining task.
• Advantage2: Generalize the rules.
 Example
• If our mining goal is to find that countries which has more Cheat
on which Taxable Income.
• Then obviously the date attribute will not be an important factor
in our mining task. Date Refund Country Taxable Income Cheat
11/02/200 Yes USA 125K No
2
13/02/200 Yes UK 100K No
2
16/02/200 No Australia 120K Yes
2
21/03/200 No Australia 120K Yes
Data Mining Functionalities (1)
 Data Preprocessing
 We will cover following Attribute/Feature
Selection Technique
• Principle Component Analysis
Data Mining Functionalities (2)
 Association Rule Mining
 In Association Rule Mining Framework we have to find all the
rules in a transactional/relational dataset which contain a support
(frequency) Greater than some minimum support (min_sup)
threshold (provided by the user).

 For example with min_sup = 50%.


Transaction ID Items Bought
2000 Bread,Butter,Egg
1000 Bread,Butter, Egg
4000 Bread,Butter, Tea
5000 Butter, Ice cream, Cake

Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
Data Mining Functionalities (2)
 Association Rule Mining
 Topic we will cover
 Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
 Fault-Tolerant/Approximate Frequent Itemset Mining.
 N-Most Interesting Frequent Itemset Mining.
 Closed and Maximal Frequent Itemset Mining.
 Incremental Frequent Itemset Mining
 Sequential Patterns.
 Projects
• Mining Fault-Tolerant Using Pattern-Growth.
• Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
Data Mining Functionalities (2)
 Classification and Prediction
 Finding models (functions) that describe and distinguish classes or
concepts for future prediction
 Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
 Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes Rule
Islamabad hot high false Yes • If Temperature = Hot &
Multan mild low false No
Humidity = High then
Karachi cool normal false No
Rain = Yes.
Rawalpindi hot high true Yes

Prediction of City
Muree
Temperature
hot
Humidity Windy
high false
Rain
?
unknown record Sibi mild low true ?
Data Mining Functionalities (2)
 Cluster Analysis
 Group data to form new classes based on un-labels class data.
 Business decisions are unknown (Also called unsupervised Learning).
 Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.

City
Lahore
Temperature
hot
Humidity
low
Windy
false
Rain
?
3 clusters
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Data Mining Functionalities (3)
 Outlier Analysis
 Outlier: A data object that does not comply with the general behavior
of the data.
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City Temperature Humidity Windy Rain 2 outliers
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Are All the “Discovered” Patterns
Interesting?
 A data mining system/query may generate thousands of
patterns, not all of them are interesting.
 Suggested approach: Query-based, Constraint
mining
 Interestingness Measures: A pattern is interesting if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to

confirm
Can We Find All and Only Interesting
Patterns?
 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Remember most of the problems in Data Mining are NP-Complete.
 There is no global best solution for any single problem.
 Search for only interesting patterns: Optimization
 Can a data mining system find only the interesting patterns?
 Approaches
• First generate all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
Reading Assignment
 Book Chapter
 Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
 Some Nice Resources
 ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.

 Knowledge Discovery Nuggets www.kdnuggests.com.


 IEEE Transactions on Knowledge and Data Engineering –
http://www.computer.org/tkde/.

 IEEE Transactions on Pattern Analysis and Machine Intelligence –


http://www.computer.org/tpami/.

 Data Mining and Knowledge Discovery - Publisher: Springer


Science+Business Media B.V., Formerly Kluwer Academic
Publishers B.V. http://www.kluweronline.com/issn/1384-
5810/. current and previous offerings of Data Mining course at
Stanford, CMU, MIT and Helsinki.

You might also like