CSD479 Data Mining: Lecture # 2 Chapter # 1 of The Text Book

CSD479
Data Mining
Lecture # 2
Chapter # 1 of the Text Book

Data Mining Functionalities (1)
 Data Preprocessing
 Handling Missing and Noisy Data (Data Cleaning).
 Techniques we will cover.
• Missing values Imputation using Mean, Median and Mod.
• Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining.
• Missing values Imputation using Fault-Tolerant Patterns.
• Data Binning for Noisy Data.
TID Refund Country Taxable Income Cheat

1 Yes USA 125K No
2 UK 100K No
3 No Australia 70K No
4 120K No
5 No NZL 95K Yes
 Data Transformation (Discretization and Normalization).
 With the help of data transformation rules become more General and
Compact.
 General and Compact rules increase the Accuracy of Classification.
Age Age
15 Child
18 Child
Child = (0 to 20)
40 Young
33 Young = (21 to 47) Young
55 Old = (48 to 120) Old
48 Old
12 Child
23 Young
1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08

then Buy_Computer = No.
1. If attribute 1 = value1 &
2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 attribute 2 = value2 and
then Buy_Computer = No. Age = Child then
Buy_Computer = No.
3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10
then Buy_Computer = No.
 Attribute Selection/Feature Selection
• Selection of those attributes which are more relevant to data mining
task.
• Advantage1: Decrease the processing time of mining task.
• Advantage2: Generalize the rules.
 Example
• If our mining goal is to find that countries which has more Cheat
on which Taxable Income.
• Then obviously the date attribute will not be an important factor
in our mining task. Date Refund Country Taxable Income Cheat
11/02/200 Yes USA 125K No
2
13/02/200 Yes UK 100K No
2
16/02/200 No Australia 120K Yes
2
21/03/200 No Australia 120K Yes
 We will cover following Attribute/Feature
Selection Technique
• Principle Component Analysis
 Association Rule Mining
 In Association Rule Mining Framework we have to find all the
rules in a transactional/relational dataset which contain a support
(frequency) Greater than some minimum support (min_sup)
threshold (provided by the user).
 For example with min_sup = 50%.

Transaction ID Items Bought
2000 Bread,Butter,Egg
1000 Bread,Butter, Egg
4000 Bread,Butter, Tea
5000 Butter, Ice cream, Cake
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
 Association Rule Mining
 Topic we will cover
 Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
 Fault-Tolerant/Approximate Frequent Itemset Mining.
 N-Most Interesting Frequent Itemset Mining.
 Closed and Maximal Frequent Itemset Mining.
 Incremental Frequent Itemset Mining
 Sequential Patterns.
 Projects
• Mining Fault-Tolerant Using Pattern-Growth.
• Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
 Classification and Prediction
 Finding models (functions) that describe and distinguish classes or
concepts for future prediction
 Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
 Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes Rule
Islamabad hot high false Yes • If Temperature = Hot &
Multan mild low false No
Humidity = High then
Karachi cool normal false No
Rain = Yes.
Rawalpindi hot high true Yes
Prediction of City
Muree
Temperature
hot
Humidity Windy
high false
Rain
?
unknown record Sibi mild low true ?
 Cluster Analysis
 Group data to form new classes based on un-labels class data.
 Business decisions are unknown (Also called unsupervised Learning).
 Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
City
Lahore
Temperature
hot
Humidity
low
Windy
false
Rain
?
3 clusters
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
 Outlier Analysis
 Outlier: A data object that does not comply with the general behavior
of the data.
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City Temperature Humidity Windy Rain 2 outliers
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Are All the “Discovered” Patterns
Interesting?
 A data mining system/query may generate thousands of
patterns, not all of them are interesting.
 Suggested approach: Query-based, Constraint
mining
 Interestingness Measures: A pattern is interesting if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
Can We Find All and Only Interesting
Patterns?
 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Remember most of the problems in Data Mining are NP-Complete.
 There is no global best solution for any single problem.
 Search for only interesting patterns: Optimization
 Can a data mining system find only the interesting patterns?
 Approaches
• First generate all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
Reading Assignment
 Book Chapter
 Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
 Some Nice Resources
 ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.
 Knowledge Discovery Nuggets www.kdnuggests.com.

 IEEE Transactions on Knowledge and Data Engineering –
http://www.computer.org/tkde/.
 IEEE Transactions on Pattern Analysis and Machine Intelligence –

http://www.computer.org/tpami/.
 Data Mining and Knowledge Discovery - Publisher: Springer

Science+Business Media B.V., Formerly Kluwer Academic
Publishers B.V. http://www.kluweronline.com/issn/1384-
5810/. current and previous offerings of Data Mining course at
Stanford, CMU, MIT and Helsinki.

CSD479 Data Mining: Lecture # 2 Chapter # 1 of The Text Book

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSD479 Data Mining: Lecture # 2 Chapter # 1 of The Text Book

Uploaded by

Copyright:

Available Formats

CSD479

Chapter # 1 of the Text Book

TID Refund Country Taxable Income Cheat

1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08

 For example with min_sup = 50%.

 Knowledge Discovery Nuggets www.kdnuggests.com.

 IEEE Transactions on Pattern Analysis and Machine Intelligence –

 Data Mining and Knowledge Discovery - Publisher: Springer

You might also like