Professional Documents
Culture Documents
Lecture 1-Introduction To Data Mining - M
Lecture 1-Introduction To Data Mining - M
Lecture # 1
Administrative Stuff
1. Automated data collection tools (e.g. web, sensor networks) and mature
database technology lead to tremendous amounts of data stored in
databases, data warehouses and other information repositories.
3. YouTube users upload 48 hours of video, Facebook users share 684,478 pieces
of content, Instagram users share 3,600 new photos, and Tumblr sees 27,778
new posts published.
Alternative names :
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, business intelligence, etc.
Data Mining (Example)
Random Guessing vs. Potential Knowledge
Suppose we have to Forecast the Probability of Rain in Islamabad
city for any particular day.
Without any Prior Knowledge the probability of rain would be 50%
(pure random guess).
If we had a lot of weather data, then we can extract potential rules
using Data Mining which can then forecast the chance of rain
better than random guessing.
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
The Data Mining Process
• Step 0: Determine Business Objective/Learning the
application domain
- e.g. Forecasting the probability of rain
- Must have relevant prior knowledge and goals of application.
• Step 1: Creating a Target Data set/Prepare Data
- Data Selection
- Data Cleaning; Noisy and Missing values handling (may take 60% of
the effort!).
- Data Transformation (Normalization/Discretization).
- Attribute/Feature Selection.
• Step 2: Choosing the Function of Data Mining
- Classification, Clustering, Regression, Association Rules
• Step 3: Choosing The Mining Algorithm
- Selection of correct algorithm depending upon the quality of data.
- Selection of correct algorithm depending upon the density of data.
Step 4: Data Mining
- Search for patterns of interest:- A typical data mining algorithm can
mine millions of patterns.
• Step 5: Visualization/Knowledge Representation
- Visualization/Representation of interesting patterns, etc . and then
Use of discovered knowledge 17
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Business
Data Presentation Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining: On What Kind of Data?
1. Relational databases
2. Data warehouses
3. Transactional databases
4. Advanced DB and information repositories
Time-series data and temporal data
Text databases
Multimedia databases
Data Stream (Sensor Networks Data)
WWW
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
Data Mining vs SQL, EIS, and OLAP
• SQL. SQL is a query language, difficult for business people
to use
• EIS = Executive Information Systems. EIS systems
provide graphical interfaces that give executives a pre-
programmed (and therefore limited) selection of reports,
automatically generating the necessary SQL for each.
• OLAP allows views along multiple dimensions, and drill-
drown, therefore giving access to a vast array of analyses.
However, it requires manual navigation through scores of
reports, requiring the user to notice interesting patterns
themselves.
• Data Mining picks out interesting patterns. The user
can then use visualization tools to investigate further.
21
An Example of OLAP Analysis and its
Limits
Walking Sticks Sales by City
• What is driving sales of walking sticks ? Step 1
50
10
• Step 1: View some OLAP graphs: Karachi
e.g. walking stick sales by city. Lahore
Islamabad
• Step 2: Noticing that Islamabad has high sales
400
you decide to investigate further. Walking Sticks Sales in
Islamabad by Age
• (Before OLAP, you would have to have written a 10 30 Step 2
very complex SQL query instead of just simply
clicking to drill-down).
• It seems that old people are responsible for most Less than 20
22
Data Mining vs Expert Systems
• Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to
decisions. (To be dealt with in Part 2 of this course)
Rules Decisions
Expert
Data System
23
Difference b/w Machine Learning and
Data Mining
Machine Learning techniques are designed to deal with a
limited amount of artificial intelligence data. Where the Data
Mining Techniques deal with large amount of databases data.
Data Preprocessing
Handling Missing and Noisy Data (Data Cleaning).
Techniques we will cover.
Missing values Imputation using Mean, Median and Mod.
Missing values Imputation using K-Nearest Neighbor.
Missing values Imputation using Association Rules Mining.
Missing values Imputation using Fault-Tolerant Patterns.
Data Binning for Noisy
TIDData.
Refund Country Taxable Income Cheat
1 Yes USA 125K No
2 UK 100K No
3 No Australia 70K No
4 120K No
5 No NZL 95K Yes
Data Mining Functionalities (1)
Data Preprocessing
Data Transformation (Discretization and Normalization).
With the help of data transformation rules become more General and
Compact.
General and Compact rules increase the Accuracy of Classification.
Age Age
15 Child
18 Child
Child = (0 to 20)
40 Young
33 Young = (21 to 47) Young
55 Old = (48 to 120) Old
48 Old
12 Child
23 Young
Data Preprocessing
We will cover two Attribute/Feature Selection
Techniques
Principle Component Analysis
Wrapper Based
Filter Based
Data Mining Functionalities (2)
Association Rule Mining
In Association Rule Mining Framework we have to find all the rules
in a transactional/relational dataset which contain a support
(frequency) Greater than some minimum support (min_sup)
threshold (provided by the user).
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
Data Mining Functionalities (2)
Association Rule Mining
Topic we will cover
Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
Fault-Tolerant/Approximate Frequent Itemset Mining.
N-Most Interesting Frequent Itemset Mining.
Closed and Maximal Frequent Itemset Mining.
Incremental Frequent Itemset Mining
Sequential Patterns.
Projects
Mining Fault-Tolerant Using Pattern-Growth.
Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes Rule
Islamabad hot high false Yes • If Temperature = Hot &
Multan mild low false No Humidity = High then
Karachi cool normal false No
Rain = Yes.
Rawalpindi hot high trueCity YesTemperature Humidity Windy Rain
Muree hot high false ?
Sibi mild low true ?
Prediction of
unknown record
Data Mining Functionalities (2)
Cluster Analysis
Group data to form new classes based on un-labels class data.
Business decisions are unknown (Also called unsupervised Learning).
Example: Classify rainy/un-rainy cities based on Temperature, Humidify
and Windy Attributes.
City Temperature Humidity Windy Rain 3 clusters
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
Data Mining Functionalities (3)
Outlier Analysis
Outlier: A data object that does not comply with the general
behavior of the data.
Book Chapter
Chapter 1 of “Jiawei Han and Micheline Kamber”
book “Data Mining: Concepts and Techniques”.
Data Mining ------- Where?
Some Nice Resources
ACM Special Interest Group on Knowledge Discovery and
Data Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.