You are on page 1of 28

Chapter 4:

Data Mining

TBS 2020-2021

Olfa Dridi & Afef Ben Brahim 1


What is data mining
▪ To understand the term ‘data mining’ it is useful
to look at the literal translation of the word

▪ To mine means to extract


▪ The association of this word with data suggests
an in-depth search to find additional information
which previously went unnoticed in the mass of
data available

2
What is data mining
▪ We can have the following types of models
▪ Models that explain the data (e.g., a single
function)
▪ Models that predict the future data instances.
▪ Models that summarize the data
▪ Models that extract the most prominent
features of the data.

3
What is data mining
▪ Data mining is used today by companies with a strong
consumer focus - retail, financial, communication, and
marketing organizations.

▪ It enables these companies to determine relationships


among "internal" factors such as price, product
positioning, or staff skills, and "external" factors such as
economic indicators, competition, and customer
demographics.

▪ It enables them to determine the impact on sales,


customer satisfaction, and corporate profits.

[Bill Palace, 1996 ]


4
What is data mining
▪ With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly
important, if not necessary, to develop powerful means
for analysis and perhaps interpretation of such data and
for the extraction of interesting knowledge that could
help in decision-making.

▪ Data Mining, also popularly known as Knowledge


Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and
potentially useful information from data in databases.
While data mining and knowledge discovery in databases
(or KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery
process. 5
What is data mining
The following figure shows data mining
as a step in an iterative knowledge
discovery process.

6
The Knowledge Discovery in Databases
(KDD)
The Knowledge Discovery in Databases process comprises
of a few steps leading from raw data collections to some
form of new knowledge.
The iterative process consists of the following steps:
▪ Data cleaning: also known as data cleansing, it is a
phase in which noise data and irrelevant data are
removed from the collection.
▪ Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common
source.
▪ Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data
collection.

7
What is data mining
▪ Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
▪ Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially
useful.
▪ Pattern evaluation: in this step, strictly interesting
patterns representing knowledge are identified based
on given measures.
▪ Knowledge representation: is the final phase in which
the discovered knowledge is visually represented to
the user. This essential step uses visualization
techniques to help users understand and interpret the
data mining results. 8
Steps of a KDD Process
▪ Learning the application domain:
– relevant prior knowledge and goals of application
▪ Creating a target data set: data selection
▪ Data cleaning and preprocessing: (may take 60% of effort!)
▪ Data reduction and transformation:
– Find useful features, dimensionality/variable reduction,
invariant representation.
▪ Choosing functions of data mining
– summarization, classification, regression, association, clustering.
▪ Choosing the mining algorithm(s)
▪ Data mining: search for patterns of interest
▪ Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
9
▪ Use of discovered knowledge
Main data mining tasks
▪ Classification:
mining patterns that can classify future data
into known classes.
▪ Association rule mining
mining any rule of the form X → Y, where X
and Y are sets of data items.
▪ Clustering
identifying a set of similarity groups in the data

10
Main data mining tasks
▪ Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B with
a certain confidence
▪ Deviation detection:
discovering the most significant changes in
data
▪ Data visualization:
using graphical methods to show
patterns in data.
11
Why is data mining necessary?
▪ Make use of your data assets
▪ There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
▪ Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”

12
Data mining applications
▪ Marketing,
customer profiling and retention, identifying
potential customers, market segmentation.
▪ Fraud detection
identifying credit card fraud, intrusion detection
▪ Scientific data analysis
▪ Text and web mining
▪ Any application that involves a large
amount of data …

13
Data mining functions
▪ Association rules
▪ Sequence mining
▪ Classification(decision tree etc.)
▪ Clustering
▪ Deviation detection

14
Data mining techniques
Many methods, such as
▪ Decision trees
▪ K-nearest neighbours
▪ Neural networks
▪ Genetic algorithms
▪ Hidden markov models
▪ Time series
▪ Bayesian networks
▪ Rough and fuzzy sets
15
Predictive modeling
▪ A “black box” that makes predictions about
the future based on information from the
past and present
Age

Salary Model High/Low Risk

CarType

▪ Large number of input available

16
Models
• Some models are better than others
– Accuracy
– Understandability
• Models range from easy to understand to
incomprehensible
– Decision trees Easier

– Rule induction
– Regression models
– Neural networks
Harder

17
Supervised vs. Unsupervised
Learning

Unsupervised
system

Supervised
system

Data Mining

18
Supervised vs. Unsupervised Learning
▪ Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
▪ Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

19
Supervised Learning
▪ Discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the
values of the target attribute in future data
instances.
▪ Supervised techniques
• Decision Tree
• Bayesian networks
• Classification rules
• Neural networks

20
Supervised Learning
The data and the goal
▪ Data: A set of data records (also called examples,
instances or cases) described by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-
defined class.
▪ Goal: To learn a classification model from the data
that can be used to predict the classes of new
(future, or test) cases/instances.

21
Supervised Learning
Object O

Attributes A1 A2 …………………… AK
(Variables)

Supervised method

Classes C1 , C2, C3, ……………….Cn


22
Unsupervised Learning

▪ Goal: To find some kind of structure in the data.


▪ TASK – vaguely defined
▪ No EXPERIENCE
▪ No PERFORMANCE (but, there are some
evaluations metrics)
▪ The data have no target attribute.
▪ We want to explore the data to find some intrinsic
structures in them.

23
Unsupervised Learning
Object O

Attributes A1 A2 …………………… AK
(Variables)

Unsupervised method

Measures

????
Results
24
Unsupervised Learning

▪ Unsupervised techniques
• Clustering
• Associations rules
• Neural networks

25
Recap: Data mining process (KDD)

Interpretation

Data Mining
Transformation
Preprocessing Knowledge
Selection
Patterns
Transformed
Preprocessed Data
Target Data
Original
Data Data

26
Recap: Data mining goals
▪ Prediction
– What? Opaque
▪ Description
– Why? Transparent
▪ Data mining vs. Statistical
– Discover rather than check
▪ Data mining vs. machine learning
– Manipulating huge DB rather than "small"
training set
27
“If
you torture the data long
enough, it will confess”
Ronald Coase
Nobel Prize in Economics, 1991

28

You might also like