You are on page 1of 42

Topics to be covered

Topics Areas covered


1 Introduction Data mining, Motivation, Challenges,
Other issues, Data warehousing and The
KDD/DM Process Model
2 Data preparation Major Tasks in Data Preprocessing; Data
Cleaning; Data Integration; Data
Reduction; Data Transformation
2 Classification Data mining Problem definition, discussing
classification concepts and algorithms and
Model evaluation
3 Association Discovery Problem definition, Frequent itemset
generation, Rule generation etc.
4 Clustering analysis Introduction, discussing different
clustering algorithms etc.
5 Web Mining Discussing web mining application and
techniques
1
Tools
 Weka
 R
 Matlabs
 Ptython
 Rminer
Datasets
1. UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/

2
1. Introduction

3
Cont..
Data: a set of discrete facts about
events (conceptualization)
No meaning attached to it as a result of
which it may have multiple meaning
Example:
 what does “Alex” mean?
Information: Aggregation of data
(contextualization) that makes decision
making easier.
Meaning is attached and contextualized
Answers questions: what, who, when,
where
Knowledge: know-how gained through experience. It
includes facts about the real world entities and the
relationship between them. It is an Understanding gained
through experience
Answer ‘how’ question
Cont..

Wisdom: embodies principles, insight and moral by


integrating knowledge
 Answer ‘why’ question
Truth: making the mind think and belief in doing what is
true for all not for narrow
Motivation: Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation,

 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
6
CONT..
Too much data & too little knowledge:
There is a need to extract knowledge (useful information) from
the massive data.
 The competitive pressures are strong, which needs useful information
for prediction
Facing too enormous volumes of data, human analysts with no
special tools can no longer make sense.
 Data mining can automate the process of finding patterns &
relationships in raw data and the results can be utilized for decision
support. That is why data mining is used, especially in science and
business areas.
If we know how to reveal valuable knowledge hidden in raw data,
data might be one of our most valuable assets.
 Data mining is the tool that involves retrospective analysis to extract
diamonds of knowledge from historical data & predict outcome of the
future.
7
CONT..

The world is data rich but


information poor.

Data mining:
searching for
knowledge (interesting
patterns) in data.

8
What is Data Mining?
Data Mining(knowledge discovery from data) is a technology that
uses various techniques to discover hidden knowledge from
heterogeneous and distributed historical data stored in large
databases, warehouses and other massive information
repositories so as to find patterns in data that are:
 valid: not only represent current state, but also hold on new
data with some certainty
 novel: non-obvious to the system that are generated as new
facts
 useful: should be possible to act on the item or problem
 understandable: humans should be able to interpret the pattern

9
Knowledge Discovery (KDD) Process

 Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
10 Aug
Why DM Now: The need for Business
Intelligence

 The need for Business Intelligence, to control the


complex and dynamic business environment
 How to gain competitive advantage in such very strong
competitive pressure?
 How to control the volatile market (Product, Price,
Promotion and Place)?
 How to satisfy users (such as customers, consumers need)
that are professional?
 How to manage the high turnover rate of professionals
which results in diminishing experience of the
organization?
Why DM Now: Massive data
collection
 Massive data collection: large databases (data
warehouses) are growing at unprecedented rates to
manage the explosive growth in stored data.
 Examples of massive data sets
Google: Order of 10 billion Web pages indexed
 100’s of millions of site visitors per day
MEDLINE text database: 17 million published
articles
Retail transaction data: EBay, Amazon, Wal-Mart:
order of 100 million transactions per day
 Visa, MasterCard: similar or larger numbers
Why DM Now: DM algorithms
 Commercial products (for data mining) are available
Data mining algorithms have been matured & there are reliable
tools that consistently outperform older statistical methods.
New ideas in machine learning/statistics
 Boosting, SVMs, decision trees, non-parametric Bayes, text
models, etc
Existence of around 20-30 mining tool vendors
Existence of many embedded products
 Fraud detection
 Customer relationship management
 Health care
 E-commerce applications
Example: Why Data Mining
 Customer relationship management:
 Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor?
 Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
 Identify likely responders to sales promotions

 Fraud detection/Network intrusion detection


 Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular
customer?
 Data Mining helps extract such useful information

14
Database Processing vs. Data Mining
Processing
Database Data mining Comments

Query Well defined • Poorly defined The data miner might


Structured • No precise not know what he
Query query language exactly wants to see
Language

Data Operational Non-Operational The data have been


data data cleansed and modified
to better support the
mining process

Output Precise and Not a subset of The output is some


Subset of database hidden useful patterns
database & knowledge in the
database
Query Examples
 Database
 Find all credit applicants with first name ‘Alex’.
 Identify customers who have purchased more than Birr 10,000
in the last month.
 Find all customers who have purchased Bread
 Data Mining
 Find all credit applicants who have no credit risks.
(classification)
 Identify customers with similar buying habits. (Clustering)
 Find all items which are frequently purchased with Bread.
Bread
(association rules)
Data Mining works with Data
Warehouse
 Data Warehouse provides the
Enterprise with a memory

• Data Mining provides the


Enterprise with
intelligence
Data Warehouse
 Data warehouse is a decision support database that is
maintained separately from the organization’s operational database
 A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.
 Focusing on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing.
Cont..
 Integrated  centralized, consolidated database that
integrates data derived from the entire organization.
 Consolidates data from multiple & diverse sources with diverse
formats.
 Constructed by integrating multiple, heterogeneous data sources
 relational databases, flat files, on-line transaction records.
 Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted
19
 Subject-Oriented  Data warehouse contains data
organized by topics.
• E.g. Sales, marketing, finance, etc.
Time variant  In contrast to the operational database
that focus on current transactions, the data warehouse
represent the flow of data through time.
 Data warehouse contains data that reflect what happened last
week, last month, past five years, and so on.

20
Cont..
 Non volatile  Once data enter the data warehouse, they are never
removed. Because the data in the warehouse represent the
company’s entire history.
 Operational update of data does not occur in the data warehouse
environment
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data
 Because data is added all the time, warehouse is growing.

21
Cont..
 Operational update of data does not occur in the data warehouse
environment
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

22
Data Warehouse Stores Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data Mart
 A subset of a data warehouse for small and medium-size
businesses or departments within larger companies.
 Its scope is confined to specific, selected groups, such as
marketing data mart.

 Motivation
 Only a small portion of data cells
may be “above the water’’ .
 Only “interesting”
data cells above certain threshold.

24
Data Warehouse Back-End Tools
and Utilities
 Data extraction
 get data from multiple, heterogeneous, and external sources
 Data cleaning
 detect errors in the data and rectify them when possible
 Data transformation
 convert data from legacy or host format to warehouse format
 Load
 sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the warehouse

Aug 25
Database & data warehouse: Differences
 The data warehouse and operational environments are
separated. Data warehouse receives its data from
operational databases.
Data warehouse environment is characterized by read-only
transactions to very large data sets.
Operational environment is characterized by numerous update
transactions to a few data entities at a time.
Data warehouse contains historical data over a long time horizon.
 Ultimately Information is created from data warehouses. Such
Information becomes the basis for rational decision making.
 The data found in data warehouse is analyzed to discover previously
unknown data characteristics, relationships, dependencies, or trends.
Data Mining in Business Intelligence
Business Intelligence:
Business intelligence is information about a company's past
performance that is used to help predict the company's future
performance.
 It can reveal emerging trends from which the company might profit.
BI takes advantage of data mining and data warehousing to help
organizations gather their information in a timelier and in a more
valuable manner
BI keeps the organization:
 informed about the market trends,
 alerts to new market potentials,
 helps to determine how competitors are doing
Without such information and knowledge the organization may
suffer false growth or setbacks
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
28
Data Mining vs. Knowledge Discovery in
Databases
 KDD is often used as a synonym for Data Mining.
 Some author define KDD as the whole process involving:
data selection  data pre-processing: cleaning  data
transformation  mining  result evaluation 
visualization
 Data Mining, on the other hand, refer to the modeling step
using the various techniques to extract useful
information/pattern from the data.
 KDD is the process model to find useful information
and patterns in database
 DM is the use of algorithms to extract hidden patterns &
knowledge in data sets
Stages in data mining: The KDD process
KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant representation
 Choosing functions of data mining
 summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

32
Are All the “Discovered” Patterns
Interesting?

 Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.

Data Mining: Concepts


Aug 33
Hybrid Knowledge Discovery Process
Data Mining implementation issues
 Scalability
Applicability of data mining techniques to perform well with
massive real world data sets
Techniques should also work regardless of the amount of available
main memory
 Real World Data
Real world data are noisy and have many missing attribute values.
Algorithms should be able to work even in the presence of these
problems
 Updates
Database can not be assumed to be static. The data is frequently
changing.
However, many data mining algorithms work with static data sets.
This requires that the algorithm be completely rerun any time the
database changes.
Data Mining implementation issues
 High dimensionality:
A conventional database schema may be composed of many different
attributes. The problem here is that all attributes may not be needed to
solve a given DM problem.
The use of unnecessary attributes may increase the overall complexity
and decrease the efficiency of an algorithms.
The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a tough
task!
 Overfitting
The size and representativeness of the dataset determines whether the
model associated with a given database states fits to also future database
states.
Overfitting occurs when the model does not fit to the future states which
is caused by the use of small size and unbalanced training database.
Data Mining implementation issues
 Ease of Use of the DM tool
Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical experts
Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand
Focus area
Designing an efficient DM algorithms & architectures
that is scalable to the number of features and instances extracted
from the high dimensional database
The need for data miner that handle large, heterogeneous
data
including multimedia data, spatial data, Web usage data …
Presentation of DM results
To easily view and understand the output of the DM algorithms,
there is a need to use knowledge representation (decision tree,
rules, equations, semantic networks) and visualization techniques
(such as graphs, bar charts, etc.).
Integration of DM functionality into traditional DBMS in
order to design an intelligent database
Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

39
Why Confluence of Multiple
Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

40
Data Mining Functionalities
 Multidimensional concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
 Frequent patterns, association, correlation vs. causality
 Diaper  Beer [0.5%, 75%] (Correlation or causality?)
 Classification and prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
 Predict some unknown or missing numerical values

Aug 41
Data Mining Functionalities (2)
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: Data object that does not comply with the general behavior of the
data
 Noise or exception? Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e.g., digital camera  large SD memory
 Periodicity analysis
 Similarity-based analysis
 Other pattern-directed or statistical analyses

Data Mining: Concepts


Aug 42
Why Data Mining?—Potential
Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
43

You might also like