Data Warehousing & Data Mining Chapter 4

Chapter 4:
Data Mining
TBS 2020-2021
Olfa Dridi & Afef Ben Brahim 1

What is data mining
▪ To understand the term ‘data mining’ it is useful
to look at the literal translation of the word
▪ To mine means to extract

▪ The association of this word with data suggests
an in-depth search to find additional information
which previously went unnoticed in the mass of
data available
2
What is data mining
▪ We can have the following types of models
▪ Models that explain the data (e.g., a single
function)
▪ Models that predict the future data instances.
▪ Models that summarize the data
▪ Models that extract the most prominent
features of the data.
3
What is data mining
▪ Data mining is used today by companies with a strong
consumer focus - retail, financial, communication, and
marketing organizations.
▪ It enables these companies to determine relationships

among "internal" factors such as price, product
positioning, or staff skills, and "external" factors such as
economic indicators, competition, and customer
demographics.
▪ It enables them to determine the impact on sales,

customer satisfaction, and corporate profits.
[Bill Palace, 1996 ]

4
What is data mining
▪ With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly
important, if not necessary, to develop powerful means
for analysis and perhaps interpretation of such data and
for the extraction of interesting knowledge that could
help in decision-making.
▪ Data Mining, also popularly known as Knowledge

Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and
potentially useful information from data in databases.
While data mining and knowledge discovery in databases
(or KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery
process. 5
What is data mining
The following figure shows data mining
as a step in an iterative knowledge
discovery process.
6
The Knowledge Discovery in Databases
(KDD)
The Knowledge Discovery in Databases process comprises
of a few steps leading from raw data collections to some
form of new knowledge.
The iterative process consists of the following steps:
▪ Data cleaning: also known as data cleansing, it is a
phase in which noise data and irrelevant data are
removed from the collection.
▪ Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common
source.
▪ Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data
collection.
7
What is data mining
▪ Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
▪ Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially
useful.
▪ Pattern evaluation: in this step, strictly interesting
patterns representing knowledge are identified based
on given measures.
▪ Knowledge representation: is the final phase in which
the discovered knowledge is visually represented to
the user. This essential step uses visualization
techniques to help users understand and interpret the
data mining results. 8
Steps of a KDD Process
▪ Learning the application domain:
– relevant prior knowledge and goals of application
▪ Creating a target data set: data selection
▪ Data cleaning and preprocessing: (may take 60% of effort!)
▪ Data reduction and transformation:
– Find useful features, dimensionality/variable reduction,
invariant representation.
▪ Choosing functions of data mining
– summarization, classification, regression, association, clustering.
▪ Choosing the mining algorithm(s)
▪ Data mining: search for patterns of interest
▪ Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
9
▪ Use of discovered knowledge
Main data mining tasks
▪ Classification:
mining patterns that can classify future data
into known classes.
▪ Association rule mining
mining any rule of the form X → Y, where X
and Y are sets of data items.
▪ Clustering
identifying a set of similarity groups in the data
10
Main data mining tasks
▪ Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B with
a certain confidence
▪ Deviation detection:
discovering the most significant changes in
data
▪ Data visualization:
using graphical methods to show
patterns in data.
11
Why is data mining necessary?
▪ Make use of your data assets
▪ There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
▪ Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”
12
Data mining applications
▪ Marketing,
customer profiling and retention, identifying
potential customers, market segmentation.
▪ Fraud detection
identifying credit card fraud, intrusion detection
▪ Scientific data analysis
▪ Text and web mining
▪ Any application that involves a large
amount of data …
13
Data mining functions
▪ Association rules
▪ Sequence mining
▪ Classification(decision tree etc.)
▪ Clustering
▪ Deviation detection
14
Data mining techniques
Many methods, such as
▪ Decision trees
▪ K-nearest neighbours
▪ Neural networks
▪ Genetic algorithms
▪ Hidden markov models
▪ Time series
▪ Bayesian networks
▪ Rough and fuzzy sets
15
Predictive modeling
▪ A “black box” that makes predictions about
the future based on information from the
past and present
Age
Salary Model High/Low Risk
CarType
▪ Large number of input available
16
Models
• Some models are better than others
– Accuracy
– Understandability
• Models range from easy to understand to
incomprehensible
– Decision trees Easier
– Rule induction
– Regression models
– Neural networks
Harder
17
Supervised vs. Unsupervised
Learning
Unsupervised
system
Supervised
system
Data Mining
18
Supervised vs. Unsupervised Learning
▪ Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
▪ Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
19
Supervised Learning
▪ Discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the
values of the target attribute in future data
instances.
▪ Supervised techniques
• Decision Tree
• Bayesian networks
• Classification rules
• Neural networks
20
Supervised Learning
The data and the goal
▪ Data: A set of data records (also called examples,
instances or cases) described by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-
defined class.
▪ Goal: To learn a classification model from the data
that can be used to predict the classes of new
(future, or test) cases/instances.
21
Supervised Learning
Object O
Attributes A1 A2 …………………… AK
(Variables)
Supervised method
Classes C1 , C2, C3, ……………….Cn

22
Unsupervised Learning
▪ Goal: To find some kind of structure in the data.

▪ TASK – vaguely defined
▪ No EXPERIENCE
▪ No PERFORMANCE (but, there are some
evaluations metrics)
▪ The data have no target attribute.
▪ We want to explore the data to find some intrinsic
structures in them.
23
Object O
Attributes A1 A2 …………………… AK
(Variables)
Unsupervised method
Measures
????
Results
24
▪ Unsupervised techniques
• Clustering
• Associations rules
• Neural networks
25
Recap: Data mining process (KDD)
Interpretation
Data Mining
Transformation
Preprocessing Knowledge
Selection
Patterns
Transformed
Preprocessed Data
Target Data
Original
Data Data
26
Recap: Data mining goals
▪ Prediction
– What? Opaque
▪ Description
– Why? Transparent
▪ Data mining vs. Statistical
– Discover rather than check
▪ Data mining vs. machine learning
– Manipulating huge DB rather than "small"
training set
27
“If
you torture the data long
enough, it will confess”
Ronald Coase
Nobel Prize in Economics, 1991
28

Data Warehousing & Data Mining Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing & Data Mining Chapter 4

Uploaded by

Copyright:

Available Formats

Chapter 4:

Olfa Dridi & Afef Ben Brahim 1

▪ To mine means to extract

▪ It enables these companies to determine relationships

▪ It enables them to determine the impact on sales,

[Bill Palace, 1996 ]

▪ Data Mining, also popularly known as Knowledge

Salary Model High/Low Risk

▪ Large number of input available

Classes C1 , C2, C3, ……………….Cn

▪ Goal: To find some kind of structure in the data.

You might also like

Data Warehousing &amp; Data Mining Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing &amp; Data Mining Chapter 4

Uploaded by

Copyright:

Available Formats

Chapter 4:

Olfa Dridi & Afef Ben Brahim 1

▪ To mine means to extract

▪ It enables these companies to determine relationships

▪ It enables them to determine the impact on sales,

[Bill Palace, 1996 ]

▪ Data Mining, also popularly known as Knowledge

Salary Model High/Low Risk

▪ Large number of input available

Classes C1 , C2, C3, ……………….Cn

▪ Goal: To find some kind of structure in the data.

You might also like

Data Warehousing & Data Mining Chapter 4

Data Warehousing & Data Mining Chapter 4