You are on page 1of 40

*Slides edited from Han and Kamber’s online lecture

Data Mining (DM)

Lecture 2: Data Mining and


its Applications

Ms Ansif Arooj, University of Education, S & T, Township Campus Lahore


Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
What Is Data Mining?

 Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
 Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.

3
Knowledge Discovery (KDD) Process
 This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
 Data mining plays an essential role in
the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

4 Databases
Cubic view of Data
Aggregation Hierarchies

6 © Prentice Hall
Data Warehousing
“Subject-oriented, integrated, time-variant, nonvolatile”
William Inmon
 Operational Data: Data used in day to day needs of
company.
 Informational Data: Supports other functions such as
planning and forecasting.
 Data mining tools often access data warehouses rather than
operational data.

DM: May access data in warehouse.

7 © Prentice Hall
Operational vs. Informational
 
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake

8 © Prentice Hall
OLAP
 Online Analytic Processing (OLAP): provides more complex
queries than OLTP.
 OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
 Dimensional data; cube view
 Visualization of operations:
 Slice: examine sub-cube.
 Dice: rotate cube to look at another dimension.
 Roll Up/Drill Down

DM: May use OLAP queries.

9 © Prentice Hall
OLAP Operations

Roll Up

Drill Down

Single Cell Multiple Cells Slice Dice

10 © Prentice Hall
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities

12
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social and
information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Descriptive (mining tasks characterize properties of the data in a
target data set.) vs. predictive data mining (mining tasks perform
induction on the current data in order to make predictions).
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
13
FUNCTION OF DATA MINING

14
Application of Data Mining
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
Data Mining Function: (1) Generalization

 Information integration and data warehouse construction


Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
 Multidimensional concept description:
 Characterization and discrimination
Generalize, summarize,
and contrast data characteristics
17
Data Mining Function: (2) Association and
Correlation Analysis
 Frequent patterns (or frequent item sets)
 What items are frequently purchased together in your shopping
mall?
 Association, correlation vs. causality
A typical association rule
 Butter, Bread  Milk [20%, 100%] [support, confidence)
 RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop")
[s=2% ,c=55%]
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large datasets?
 How to use such patterns for classification, clustering, and other
applications?
18
Support(x->y)=P(XUY)
Confidence(x->y)=XUY/X
Example: Butter, Bread  Milk
Small Data Mining Example
X  Y
 Butter, Bread  Milk
 [20%, 100%] (support, confidence)
Support
the proportion of transactions in the data set which
contain the itemset.
1/5
Confidence

1/1
CLASS ACTIVITY
Rule 1) Coke, burger Diapers
Rule 2) Coke, burger, Potatoes  bread
Rule 3) Coke, burger, potatoes onion, bread
Rule 4) burger, potatoes, onion coke

Transaction Coke Burger Potatoes Onion Diapers Bread


ID
1 1 1 1 1
2 1 1 1
3 1 1 1 1
4 1 1 1 1 1
5 1
6 1 1 1
7 1 1 1 1
8 1 1 1 1 1
9 1 1
10 1 1 1 1 1 1
Take home activity
Data Mining Function: (3) Classification
 Classification and label prediction
Construct models (functions) based on some training examples
Also named as supervised classification
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
24
Classification
Data Mining Function: (4) Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

27
Clustering
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
Outlier: A data object that does not comply with the
general behavior of the data
Noise or exception? ― One person’s garbage could be
another person’s treasure
Methods: by product of clustering or regression analysis,

Useful in fraud detection, rare events analysis

29
Data Mining Function: (6) Prediction
The major idea is to use a large number of past
values to consider probable future values.
Forecasting and predicting the unavailable data
values or a class label for some data.
Evaluation of Knowledge
 Are all mined knowledge interesting?
One can mine tremendous amount of “patterns”
Some may fit only certain dimension space (time,
location, …)
Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
31
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

32
Summary

 Data mining: Discovering interesting patterns and knowledge from massive


amount of data
 A natural evolution of science and information technology, in great demand,
with wide applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of data

 Data mining functionalities: characterization, discrimination, association,


classification, clustering, trend and outlier analysis, etc.
 Data mining technologies and applications

 Major issues in data mining

33
Class Activity
Discuss whether or not each of the following activities is a
data mining task.

A) Dividing the customers of a company according to


their gender.
No. This is a simple database query.
B) Dividing the customers of a company according to
their profitability.
No. This is an accounting calculation, followed by the
application of a threshold. However, predicting the
profitability of a new customer would be data mining.
Data Mining yes/no?
(c) Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification
numbers.
No. Again, this is a simple database query.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die
were not fair, and we needed to estimate the probabilities of each
outcome from the data, then this is more like the problems
considered by data mining. However, in this specific case, solutions
to this problem were developed by mathematicians a long time ago,
and thus, we wouldn’t consider it to be data mining.
Data Mining yes/no?
(f)Predicting the future stock price of a company using
historical records.
Yes. We would attempt to create a model that can
predict the continuous value of the stock price. This is
an example of the area of data mining known as
predictive modelling. We could use regression for this
modelling, although researchers in many fields have
developed a wide variety of techniques for predicting
time series.
Data Mining yes/no?
(g) Monitoring the heart rate of a patient for
abnormalities.
Yes. We would build a model of the normal behavior
of heart rate and raise an alarm when an unusual heart
behavior occurred. This would involve the area of data
mining known as anomaly detection. This could also
be considered as a classification problem if we had
examples of both normal and abnormal heart behavior.
Data Mining yes/no?
(h) Monitoring seismic waves for earthquake
activities.
Yes. In this case, we would build a model of different
types of seismic wave behavior associated with
earthquake activities and raise an alarm when one of
these different types of seismic activity was observed.
This is an example of the area of data mining known as
classification.

Extracting the frequencies of a sound wave.


No. This is signal processing.
Data Mining and Data Privacy
For each of the following data sets, explain whether or not data
privacy is an important issue.
(a) Census data collected from 1900–1950.
No
(b) IP addresses and visit times of Web users who visit your Website.
Yes
(c) Images from Earth-orbiting satellites.
No
(d) Names and addresses of people from the telephone book.
No
(e) Names and email addresses collected from the Web.
No
Recommended Reference Books

 E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011


 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed. , 2011
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 nd
ed., Springer, 2009
 B. Liu, Web Data Mining, Springer 2006
 T. M. Mitchell, Machine Learning, McGraw Hill, 1997
 Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
40 Morgan Kaufmann, 2 nd ed. 2005

You might also like