You are on page 1of 41

Data Mining

January 21, 2024 Data Mining 1


Introduction
 Why data mining?

 What is Data Mining / Knowledge Data Discovery?

 Origins of Data Mining

 Potential Applications

 Data Mining: On what kind of data?

 Data Mining Functionalities

 OLAP Mining System

January 21, 2024 Data Mining 2


Why Data Mining:
Trends leading to Data Flood
More data is generated:
 Bank, telecom, other
business
transactions ...
 Scientific data:
astronomy, biology, etc
 Web, text, and e-
commerce

January 21, 2024 Data Mining 3


Scale Of Data

January 21, 2024 Data Mining 4


Data Growth Rate
 Twice as much information was created in
2002 as in 1999 (~30% growth rate)
 Other growth rate estimates even higher
 And THE PROBLEM IS:
 Very little data will ever be looked at by a
human
 We are drowning in data, but starving for
knowledge
 Knowledge Discovery is NEEDED to make
sense and use of data.
January 21, 2024 Data Mining 5
Why Mine Data?
 There is often information “hidden” in the data that is not readily
evident
 Human analysts may take weeks to discover useful information
 Much of the data is never analyzed at all

January 21, 2024 Data Mining 6


Why Mine Data?

January 21, 2024 Data Mining 7


What Is Data Mining:
Many Names of Data Mining
 Data Fishing, Data Dredging: 1960-
 used by Statistician (as a bad name)
 Data Mining :1990-
 used DB, business
 in 2003 – bad image because of TIA
 Knowledge Discovery in Databases: 1989-
 used by AI, Machine Learning Community
 also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery in


Databases (KDD) are used interchangeably
January 21, 2024 Data Mining 8
Knowledge Data Discovery (KDD)
Knowledge Discovery in Data
is the non-trivial process of
identifying
 valid
 novel
 potentially useful
 and ultimately
understandable patterns in
data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
January 21, 2024 Data Mining 9
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
– Look up phone number – Certain names are more
in phone directory prevalent in certain US
locations (O’Brien,
O’Rurke, O’Reilly… in
Boston area)
– Query a Web search – Group together similar
engine for information documents returned by
about “Amazon” search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,
etc)
January 21, 2024 Data Mining 10
Origins of Data Mining
 Draws ideas from
machine
learning/AI,
pattern
recognition,
statistics,
and
database systems

January 21, 2024 Data Mining 11


Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
January 21, 2024 Data Mining 12
What is Data Mining: A KDD Process

Data mining: the core of


Knowledge Data Discovery
process. Pattern Evaluation

Data Mining
Task-relevant
Data
Selection
Data
Warehouse
Data
Cleaning

Data Integration
Databases
January 21, 2024 Data Mining 13
Steps of a KDD Process
1. Learning the application domain
 relevant prior knowledge and goals of application
2. Creating a target data set  data selection
3. Data cleaning and preprocessing (may take 60% of effort!)
4. Data reduction and transformation
 Find useful features, dimensionality/variable reduction,
invariant representation.
5. Choosing functions of data mining
 summarization, classification, regression, association,
clustering.
6. Choosing the mining algorithm(s)
7. Data mining  search for patterns of interest
8. Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
9. Use of discovered knowledge
January 21, 2024 Data Mining 14
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining
Information Discovery
Data
Data Exploration Analyst

Statistical Analysis, Querying and Reporting


Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers,
January 21, 2024 Data Mining Database Systems, OLTP 15
Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine

(Database / data Knowledge-base


warehouse) server
Data cleaning & data integration Filtering

Data
Databases Warehouse
January 21, 2024 Data Mining 16
What Tasks Can Data Mining
Accomplish?

The most common data mining tasks.


 Description
 Classification
 Estimation
 Prediction
 Clustering
 Association

January 21, 2024 Data Mining 17


Task 1: Description
 Find ways to describe patterns and trends
lying within data.
 For example:
 A pollster can uncover evidence that those who
have been laid off are less likely to support the
present incumbent in the presidential election.
 From descriptions of patterns and trends we knew
that they are now less well off financially than
before the incumbent was elected, and so would
tend to prefer an alternative.

January 21, 2024 Data Mining 18


Task 1: Description
 The models should be as transparent
as possible.
 High-quality description can often be
accomplished by exploratory data
analysis , a graphical method of
exploring data in search of patterns and
trends.

January 21, 2024 Data Mining 19


Task 2: Classification
The data mining model examines a large set of records, each record
containing information on the target variable as well as a set of
input or predictor variables.
 For example, consider the excerpt data set.

 After “learns” the data, the algorithm can classify new records,
for which no information about income bracket is available.

January 21, 2024 Data Mining 20


Task 2: Classification
Examples of classification tasks in business and research include:
 Determining whether a particular credit card transaction is
fraudulent
 Placing a new student into a particular track with regard to
special needs
 Assessing whether a mortgage application is a good or bad
credit risk
 Diagnosing whether a particular disease is present
 Determining whether a will was written by the actual deceased,
or fraudulently by someone else
 Classifying type of drug a patient should be prescribed, based
on certain patient characteristics.
 Etc.

January 21, 2024 Data Mining 21


Task 2: Classification
 Common data mining methods
used for classification are:
k -nearest neighbor
 decision tree
 neural network

January 21, 2024 Data Mining 22


Task 3: Estimation
 Similar to classification except that the target
variable is numerical rather than categorical.
 Models are built using “complete ” records,
which provide the value of the target variable
as well as the predictors.
 Then, for new observations, estimates of the
value of the target variable are made, based
on the values of the predictors.

January 21, 2024 Data Mining 23


Task 3: Estimation
Examples of estimation tasks in business and research include:
 Estimating the amount of money a randomly chosen family of
four will spend for back-to-school shopping this fall.
 Estimating the percentage decrease in rotary-movement
sustained by a National Football League running back with a
knee injury.
 Estimating the number of points per game that Patrick Ewing
will score when double-teamed in the playoffs.
 Estimating the grade-point average (GPA) of a graduate
student, based on that student ’s undergraduate GPA.
 Estimating person yearly incomes based on the description and
personal data, ie: age, jobs, home addresses, etc.
 Etc.

January 21, 2024 Data Mining 24


Task 3: Estimation
 Common data mining methods used for
estimation are:
 Statistical analysis:
 Point estimation
 Confidence interval estimations
 Simple linear regression
 Multiple regression
 Correlation
 Neural networks

January 21, 2024 Data Mining 25


Task 4: Prediction
Similar to classification and estimation, except that for
prediction, the results lie in the future.
 For example, predicting the price of a stock three
months in the future.

January 21, 2024 Data Mining 26


Task 4: Prediction
Examples of prediction tasks in business and research
include:
 Predicting the price of a stock three months into the
future
 Predicting the percentage increase in traffic deaths
next year if the speed limit is increased
 Predicting the winner of this fall’s baseball World
Series, based on a comparison of team statistics
 Predicting whether a particular molecule in drug
discovery will lead to a profitable new drug for a
pharmaceutical company
January 21, 2024 Data Mining 27
Task 4: Prediction
 Any of the methods and techniques
used for classification and estimation
may also be used for prediction. These
include:
 Statistical methods
 Neural Networks
 Decision tree
 k-nearest neighbor
January 21, 2024 Data Mining 28
Task 5: Clustering
 Grouping of records, observations, or cases into
classes of similar objects.
 A cluster is a collection of records that are similar to
one another, and dissimilar to records in other
clusters.
 The clustering task does not try to classify, estimate,
or predict the value of a target variable.
 It seek to segment the entire data set into relatively
homogeneous subgroups or clusters.

January 21, 2024 Data Mining 29


Task 5: Clustering
 For Example, PRIZM segmentation system, which
describes every U.S. zip code area in terms of
distinct lifestyle types.
 For illustration, the clusters for zip code 90210,
Beverly Hills, California, are:
 Cluster 01: Blue Blood Estates
 Cluster 10: Bohemian Mix
 Cluster 02: Winner ’s Circle
 Cluster 07: Money and Brains
 Cluster 08: Young Literati

January 21, 2024 Data Mining 30


Task 5: Clustering
Examples of clustering tasks in business and research
include:
 Target marketing of a niche product for a small-
capitalization business that does not have a large
marketing budget
 For accounting auditing purposes, to segment financial
behavior into benign and suspicious categories
 As a dimension-reduction tool when the data set has
hundreds of attributes
 For gene expression clustering, where very large
quantities of genes may exhibit similar behavior

January 21, 2024 Data Mining 31


Task 5: Clustering
Common data mining methods used for
clustering are:
 Hierarchical clustering (AgNes, DiAna, etc)
 Partitional clustering (K–means, PAM, etc)
 DB-Scan
 Kohonen networks

January 21, 2024 Data Mining 32


Task 6: Association
 Finding which attributes “go together. ”
 Most prevalent in the business world.
 It is known as affinity analysis or
market basket analysis
 The task of association seeks to
uncover rules for quantifying the
relationship between two or more
attributes.
January 21, 2024 Data Mining 33
Task 6: Association
 For example, a particular supermarket may
find that of the 1000 customers shopping on a
Thursday night, 200 bought diapers, and of
those 200 who bought diapers, 50 bought
beer.
 Thus, the association rule would be “If buy
diapers, then buy beer” with a support of
200/1000 = 20% and a confidence of 50/200
= 25%.

January 21, 2024 Data Mining 34


Task 6: Association
Examples of association tasks in business and research
include:
 Examining the proportion of children whose parents read to
them who are themselves good readers
 Predicting degradation in telecommunications networks
 Finding out which items in a supermarket are purchased
together and which items are never purchased together
 Determining the proportion of cases in which a new drug
will exhibit dangerous side effects
 Cross-selling analysis of the products.
 Optimize the performance of online banner advertisement,
which presents discount offers on various investment
products Data Mining 35
January 21, 2024
Task 6: Association
Common data mining methods used for
association are:
 Apriori Algorithm
 FP-Tree
 Generalized Rule Induction Method
 Etc.

January 21, 2024 Data Mining 36


Potential Applications
 Database analysis and decision support
 Market analysis and management
 target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
 Fraud detection and management
 Other Applications
 Text mining (news, email, documents) and Web analysis.
 Intelligent query answering
January 21, 2024 Data Mining 37
Successful e-commerce – Case Study

January 21, 2024 Data Mining 41


Other Applications
 Sports
 IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York Knicks
and Miami Heat
 Astronomy
 JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
 Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
 Detecting diseases, pendemic, epidemic, plagues spreading.

January 21, 2024 Data Mining 44


Data Mining: On What Kind of Data?
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
January 21, 2024 Data Mining 45
Thanks

January 21, 2024 Data Mining 51

You might also like