DM Chapter 1

Introduction to Data Mining & machine
learning
(INSY3051)
Chapter 1
Introduction to Data Mining
Compiled by: Abinet T. 04/11/2024

Contents
To define DM
Database Processing vs. Data Mining
Data warehouse and Data Mining
Issues of Data mining
Architecture of DM
Motivation of DM

What is data ?
 The world is data rich but information poor situation

How can we analyze these data ?

GOAL
1. Data repository 2. Making analysis
Introduction to Data Mining 3. Discovering knowledge from data

What is data mining?
Data is growing at a phenomenal rate. At the same
time, users expect more sophisticated information
 A marketing manager is no longer satisfied with a simple listing
of marketing contacts, but wants detailed information about
customers past purchasing behavior and prediction of future
purchases
Data mining steps to solve such kinds of needs.
 How? Data mining uncover hidden patterns in a database
Data Mining is a technology that uses various

techniques to discover hidden knowledge from
heterogeneous and distributed historical data stored
in large databases, warehouses and other massive
What is Data Mining?
 Exploration & analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful
patterns
 Definition (Fayyad et. al): The non-trivial discovery of novel,
valid ,comprehensible and potentially useful patterns from data.
 What is a pattern? A relationship in the data. E.g.,
 On Thursday nights people who buy diapers also tend to buy beer
 People with good credit ratings are less likely to have accidents
 Male consumers, 37+, income bracket 50K-75K → spend between $25-$50
per catalog order

Cont,,,
Data mining is the process of analyzing large
databases using various techniques to find patterns in
data that are:
– valid: not only represent current state, but also hold on
new data with some certainty, patterns hold in general
– novel: non-obvious to the system that are generated as
new facts, didn’t know the pattern beforehand
– useful: should be possible to act on the item or problem ,
can devise actions from the patterns
– understandable: humans should be able to interpret the
pattern, can interpret and comprehend the patterns
Why DM Now?
A. The competitive pressure is very strong
How to gain competitive advantage?
How to control the volatile market?
How to satisfy customers (prosumers) need?
How to manage the high turnover rate of professionals?
B. Massive data collection: Lots of data is being
collected and warehoused
Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card transactions
Too much data & too little knowledge
 There is a need to extract knowledge (useful information)
from the massive data.
 The competitive pressures are strong, which needs
useful information for prediction, classification,
clustering, associating, linking
 Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
C. Powerful computers: The computing power is available
and is also affordable
 The need for improved computational engines can now
be met in a cost-effective manner with parallel
multiprocessor computer technology.
Why DM Now: DM algorithms
TO provides rational decision for Business and
applications to classify, cluster ,associate ,
predict(futures of) using data the data mining
tools and techniques is necessary.
Commercial products (for data mining) are
available
 Data mining algorithms have been matured & there
are reliable tools that consistently outperform older
statistical methods.
 New ideas in machine learning/statistics
Boosting, SVMs, decision trees, non-parametric Bayes, text models, etc

Database Processing vs. Data Mining as a data analytics Process

What is (not) Data Mining?
 What is not Data  What is Data Mining?

Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain
directory locations (O’Brien,
O’Rurke, O’Reilly… in
– Query a Web search Boston area)
engine for
information about – Discover groups of similar
“Amazon” documents on the Web

Query Examples
 Database
– Find all credit applicants with first name ‘Alex’.
– Identify customers who have purchased more than Birr
10,000 in the last month.
– Find all customers who have purchased Bread
 Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
Bread. (association rules)
What is a Data Warehouse?
 A single, complete and consistent
Store of data obtained from a variety of
different sources made available to end
users in a what they can understand and
use in a business context.
[Barry Devlin]

Cont,,,
▶ Defined in many different ways, but none are
rigorous definition.
▶ A decision support database that is maintained
separately from the organization’s operational database
▶ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
▶ A short and more comprehensive definition is given
by Inmon as
▶ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision making process”.
Issues in Data Mining
 Many of data mining issues have been addressed in recent
data mining research and development to a certain extent
and are now considered data mining requirements. The
issues are:
A. Mining Methodology: various aspects of mining
Methodologies are :
 Mining various and new kinds of knowledge
 Mining knowledge in multidimensional space
 Data mining
 Handling uncertainty, noise, or incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining

Cont,,,
B. User Interaction: Interesting areas of research
include how to interact with a data mining system,
how to incorporate a user’s background knowledge in
mining, and how to visualize and comprehend data
mining results.
C. Efficiency and Scalability
As data amounts continue to multiply, these two
factors are especially critical
Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Reading assignment
D. Diversity of Database Types
E. Data Mining and Society

Data Mining(DM) vs. Knowledge Discovery in
Databases(KDD
 KDD is often used as a synonym for Data Mining.
Some author define KDD as the whole process involving:
data selection  data pre-processing: cleaning  data
transformation data mining  result evaluation 
visualization/representation
Data mining is a step in the KDD process of applying data
analysis and discovery algorithms.
 KDD is process of extracting previously unknown, valid, and
actionable (understandable) information from large databases.
It is the process of finding useful information and
patterns in data
• DM is the use of algorithms to extract hidden patterns &
knowledge
Introduction in data
to Data Mining
Stages in data mining: The KDD process
 Knowledge discovery in databases (KDD) is the non-trivial process of identifying
valid, potentially useful and ultimately understandable patterns in data

Cont,,,
 Selection: Obtain data from various heterogeneous sources such
as databases, data warehouses, files, non-electronic records, etc.
 Preprocessing: Cleanse inconsistent & incorrect data; fills
incomplete records; predict missing values; correct erroneous &
anomalous data.
 Transformation: Convert data from different sources into
common new format. Apply data reduction & data
categorization/binning to ease data mining
 Mining: apply classification or clustering techniques to obtain
predictive or descriptive models.
 Interpretation/Evaluation: Present results to user in meaningful
manner using various visualization and GUI strategies.

Data Mining Process Model
➢ SEMMA, CRISP-DM and Hybrid DM models
The most common standard in data mining is the Cross-
Industry Standard Process for Data Mining (CRISP-DM)
 CRISP-DM has the following steps
 Business/research understanding (Learning the
application domain),
 Data understanding (data selection for the problem)
Data Preparation which involves
 collecting, cleaning, consolidating and amalgamating records,
summarizing fields, checking for data integrity, detecting
irregularities and illegal attributes, filling in for missing values,
trimming outliers.
Cont,,,
Data modeling involves
 selecting data mining tools, transforming the data if the
tools require it, generating samples for training and
testing the model and finally using the tools to build
and select a model
 Evaluating the model and
 Deploying the model

CRoss Industry Standard Process for Data Mining
(CRISP-DM

Phases and task
• C

Cont,,,
• Origins of Data Mining

DM Architecture
Database, data warehouse, WWW or other
information repository (store data)
Database or data warehouse server (fetch and
combine data)
Knowledge base (turn data into meaningful groups
according to domain knowledge)
Data mining engine (perform mining tasks)
Pattern evaluation module (find interesting
patterns)
 User tointerface
Introduction Data Mining (interact with the user)
DM Architecture

DM Tasks and Models

Cont,,,
Data mining tasks are generally divided into two major categories:
 Predictive tasks - Use some attributes to predict unknown or future values of other
attributes.
• Predictive data mining task perform inference on the current data in
order to make prediction to the future reference
» Classification
» Regression
» Deviation Detection
 Descriptive tasks - Find human-interpretable patterns that describe the data.
– Descriptive data mining task characterize the general properties of the
data in a database.
» Association Discovery
» Clustering
» etc.

Predictive Data Mining (Supervised Learning)
 Given a collection of records (training set)
 Each record contains a set of attributes, one of the
attributes is the class.
 Find ("learn") a model for the class attribute as a
function of the values of the other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.

The nature of real world data and data
formats
 The Data Set (input to DM process)
Thus dataset can be seen as collection of data
objects/examples and their attributes, representing a
concept
▶ An attribute is a property or characteristic of an object
▶ Examples: eye color of a person, temperature, etc.
▶ Attribute is also known as variable, field, characteristic, or
feature
▶ A collection of attributes to describe an object
▶ Object is also known as record, point, case,sample,
entity, or instance
Are All the “Discovered” Patterns Interesting?
 A pattern is interesting if it is easily understood by humans, valid on new or test

data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
 An interesting pattern represents knowledge
 Measure of Interestingness measures
– Two types (Objective vs. subjective)
• Objective: based on statistics and structures of patterns, e.g., support,
confidence, Error (Mean Square error, absolute error, etc), Similarity
measure, etc.
• Subjective: based on user’s belief in the data, e.g., unexpectedness
(contradicting a user’s belief), novelty, actionability, etc.

Mining Large Data Sets - Motivation
 There is often information “hidden” in the
data that is not readily evident
 Human analysts may take weeks to discover
useful information
 Much of the data is never analyzed at all

▶ Are All the Discovered Patterns Interesting?
▶ Enumerate differences b/n predictive and descriptive mining?
▶ Discuss steps in KDD
▶ Describe how a database processing differs from an data mining as a
data analytics Process?
▶ Why data mining ?
▶ how data mining steps can solve business needs?

Thank you

DM Chapter 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Chapter 1

Uploaded by

Copyright:

Available Formats

Introduction to Data Mining & machine

Compiled by: Abinet T. 04/11/2024

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

1. Data repository 2. Making analysis

Introduction to Data Mining 3. Discovering knowledge from data

Data Mining is a technology that uses various

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

 What is not Data  What is Data Mining?

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

 A pattern is interesting if it is easily understood by humans, valid on new or test

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

You might also like