Data Mining

Data Mining:
Introduction
1
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
June 30, 2020 Data Mining: Concepts and Techniques 2

Why Not Traditional Data Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

What Is Data Mining?
 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 data cleaning (to remove noise and inconsistent data)
 data integration (where multiple data sources may be combined)
 data selection (where data relevant to the analysis task are retrieved
from the database)
 data transformation (where data are transformed or consolidated into

forms appropriate for mining by performing summary or aggregation
operations)

KDD Process: Several Key Steps
 data mining (an essential process where intelligent methods are
applied in order to extract data patterns.
 pattern evaluation (to identify the truly interesting patterns

representing knowledge based on some interestingness measures)
 knowledge presentation (where visualization and knowledge

representation techniques are used to present the mined knowledge
to the user)
Data mining is a core of knowledge discovery process

Knowledge Discovery (KDD) Process
 Data mining—core of Pattern Evaluation

knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info

Database Repositories
Warehouse Web

Architecture: Typical Data Mining
System
 Database, Data warehouse, World wide web: This is one or a set of
databases, data warehouses,spreadsheets,or other kind of
information repositories. Data cleaning and integration technique
may be performed on the data.
 Database or data warehouse server: Responsible for fetching the
relevant data, based on the users data mining request.
 knowledge base: This is the domain knowledge –used to guide the
search or evaluate the interestingness of resulting patterns.
 Data mining engine: Essential to the data mining system and ideally
consists of a set of functional modules for tasks such as
characterization, association and correlation
analysis,classification,prediction ,cluster analysis, outlier analysis,
and evolution analysis

Architecture: Typical Data Mining
System
 Pattern evaluation module: This component typically employs

interestingness measures and interact with the data mining modules
so as to focus the search toward interesting patterns .
 User interface: This module communicates between users and the
data mining system, allowing the user to
 interact with the system by specifying a data mining query or task
 providing information to help focus the search
 performing exploratory data mining based on the intermediate
results.
 Allow the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns and visualize the patterns in
different forms.

Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines

Data Mining: Classification Schemes
 General functionality
 Descriptive data mining –Characterize the general
properties of the data in the database.
 Predictive data mining- Perform inference on the
current data in order to make predictions.
 Different views lead to different classifications
 Data view: Kinds of data to be mined
 Knowledge view: Kinds of knowledge to be discovered
 Method view: Kinds of techniques utilized
 Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

Data Mining Functionalities
 Concept/Class description: Characterization and discrimination
 Mining Frequent patterns, association, correlation
 Classification and prediction
 Cluster analysis
 Outlier analysis
 Trend and evolution analysis

 Concept/Class description: Characterization and discrimination
Data Characterization: A data mining system should be able to produce a
description summarizing the characteristics of customers.
Example: The characteristics of customers who spend more than $1000 a year
at (some store called ) AllElectronics. The result can be a general profile
such as age, employment status or credit ratings.
Data Discrimination: It is a comparison of the general features of targeting class

data objects with the general features of objects from one or a set of
contrasting classes. User can specify target and contrasting classes.
Example: The user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose
sales decreased by about 30% in the same duration.

 Mining Frequent patterns, association, correlation

Frequent Patterns : as the name suggests patterns that occur frequently in
data.
Association Analysis: from marketing perspective, determining which items are
frequently purchased together within the same transaction.
Example: An example is mined from the (some store) AllElectronic transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a 50% chance that he/she
will buy software as well.
 Support = 1%, means that 1% of all the transactions under analysis showed that
computer and software were purchased together.
Another example:
Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
Customers between 20 to 29 years of age with an income $20000-$29000. There is 60%
chance they will purchase CD Player and 2% of all the transactions under analysis
showed that this age group customers with that range of income bought CD Player.

 Classification and prediction
Classification
is the process of finding a model that describes and distinguishes data classes
or concepts for the purpose of being able to use the model to predict the
class of objects whose class label is unknown.
Classification model can be represented in various forms such as

 IF-THEN Rules
 A decision tree
 Neural network


 Clustering Analysis
 Clustering analyses data objects without consulting a known class
label.
 Given a collection of objects, put objects into groups based on
similarity.

 Outlier Analysis
 Outlier Analysis : A database may contain data objects that do not
comply with the general behavior or model of the data. These data
objects are outliers.
 Example: Use in finding Fraudulent usage of credit cards. Outlier
Analysis may uncover Fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier
values may also be detected with respect to the location and type of
purchase or the purchase frequency.

 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis

 Sequential pattern mining: e.g., digital camera  large SD
memory
 Periodicity analysis
 Similarity-based analysis

Learning
To learn: to get knowledge of by study,experience,or being

taught.
Types of learning :
Supervised learning: Supervised learning as the name indicates a
presence of supervisor as teacher. Eg: Classification
Unsupervised learning: Unsupervised learning is the training of
machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without
guidance
Eg: Clustering

Major Issues in Data Mining
Data mining is not an easy task, as the algorithms used can get very complex
and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources. These factors also create some issues.
Here, we will discuss the major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types
Issues
The diagram
describes the major

issues.

Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction
 Data mining query languages
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

Data Mining Applications
Data mining is highly useful in the following domains −
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration, sports,
astrology, and Internet Web Surf-Aid

Market Analysis and Management
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people

buy what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the

best products for different customers. It uses prediction to find the factors
that may attract new customers.
Cross Market Analysis − Data mining performs Association/correlations

between product sales.

Market Analysis and Management
 Target Marketing − Data mining helps to find clusters of model
customers who share the same characteristics such as interests, spending
habits, income, etc.
 Determining Customer purchasing pattern − Data mining helps in

determining customer purchasing pattern.
 Providing Summary Information − Data mining provides us various

multidimensional summary reports.

Corporate Analysis and Risk Management
Data mining is used in the following fields of the

Corporate Sector −
Finance Planning and Asset Evaluation − It involves cash

flow analysis and prediction, contingent claim analysis to evaluate
assets.
Resource Planning − It involves summarizing and comparing

the resources and spending.
Competition − It involves monitoring competitors and market

directions.

Fraud Detection
 Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it
helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate
from expected norms

 Fraudulent pattern analysis and the identification of unusual patterns
 Identify potentially fraudulent users and their atypical usage
patterns
 Detect attempts to gain fraudulent entry to customer accounts
 Discover unusual patterns which may need special attention
 Multidimensional association and sequential pattern analysis
 Find usage patterns for a set of communication services by
customer group, by month, etc.
 Promote the sales of specific services
 Improve the availability of particular services in a region
 Use of visualization tools in telecommunication data analysis

Biomedical Data Analysis
 DNA sequences: 4 basic building blocks (nucleotides): adenine (A),
cytosine (C), guanine (G), and thymine (T).
 Gene: a sequence of hundreds of individual nucleotides arranged in a
particular order
 Humans have around 30,000 genes
 Tremendous number of ways that the nucleotides can be ordered and
sequenced to form distinct genes
 Semantic integration of heterogeneous, distributed genome databases
 Current: highly distributed, uncontrolled generation and use of a wide
variety of DNA data

 Data cleaning and data integration methods developed in data mining
will help

DNA Analysis: Examples
 Similarity search and comparison among DNA sequences
 Compare the frequently occurring patterns of each class (e.g., diseased
and healthy)
 Identify gene sequence patterns that play roles in various diseases
 Association analysis: identification of co-occurring gene sequences

 Most diseases are not triggered by a single gene but by a combination of
genes acting together

 Association analysis may help determine the kinds of genes that are
likely to co-occur together in target samples

 Path analysis: linking genes to different disease development stages
 Different genes may become active at different stages of the disease
 Develop pharmaceutical interventions that target the different stages
separately
 Visualization tools and genetic data analysis

Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
 noisy: containing errors or outliers

 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or

names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Multi-Dimensional Measure of Data
Quality
 A well-accepted multidimensional view:

 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 Intrinsic, contextual, representational, and accessibility

Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

Forms of Data Preprocessing

Mining Data Descriptive
Characteristics
 Motivation
 To better understand the data: central tendency, variation
and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball

 “Data cleaning is the number one problem in data
warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
Missing Data
 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree

Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning

 duplicate records
 incomplete data
 inconsistent data

How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency,equal
width) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

 Regression
 Data can be smoothed by fitting the data into
regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,
deal with possible outliers)

Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Regression
Y1
Y1’ y=x+1
X1 x

Cluster Analysis

Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining:

 The Explosive Growth of Data: from terabytes to petabytes

June 30, 2020 Data Mining: Concepts and Techniques 2

June 30, 2020 Data Mining: Concepts and Techniques 4

 Data mining (knowledge discovery from data)

June 30, 2020 Data Mining: Concepts and Techniques 5

 data integration (where multiple data sources may be combined)

 data transformation (where data are transformed or consolidated into

June 30, 2020 Data Mining: Concepts and Techniques 6

 pattern evaluation (to identify the truly interesting patterns

 knowledge presentation (where visualization and knowledge

Data mining is a core of knowledge discovery process

June 30, 2020 Data Mining: Concepts and Techniques 7

 Data mining—core of Pattern Evaluation

Graphical User Interface

data cleaning, integration, and selection

Data World-Wide Other Info

June 30, 2020 Data Mining: Concepts and Techniques 9

June 30, 2020 Data Mining: Concepts and Techniques 10

 Pattern evaluation module: This component typically employs

June 30, 2020 Data Mining: Concepts and Techniques 11

June 30, 2020 Data Mining: Concepts and Techniques 12

June 30, 2020 Data Mining: Concepts and Techniques 14

June 30, 2020 Data Mining: Concepts and Techniques 15

Data Discrimination: It is a comparison of the general features of targeting class

June 30, 2020 Data Mining: Concepts and Techniques 16

 Mining Frequent patterns, association, correlation

June 30, 2020 Data Mining: Concepts and Techniques 17

Classification model can be represented in various forms such as

June 30, 2020 Data Mining: Concepts and Techniques 18

June 30, 2020 Data Mining: Concepts and Techniques 19

June 30, 2020 Data Mining: Concepts and Techniques 20

June 30, 2020 Data Mining: Concepts and Techniques 21

 Trend and deviation: e.g., regression analysis

June 30, 2020 Data Mining: Concepts and Techniques 22

To learn: to get knowledge of by study,experience,or being

June 30, 2020 Data Mining: Concepts and Techniques 23

Diverse Data Types

describes the major

June 30, 2020 Data Mining: Concepts and Techniques 24

June 30, 2020 Data Mining: Concepts and Techniques 25

 Market Analysis and Management

 Corporate Analysis & Risk Management

June 30, 2020 Data Mining: Concepts and Techniques 26

Customer Profiling − Data mining helps determine what kind of people

Identifying Customer Requirements − Data mining helps in identifying the

Cross Market Analysis − Data mining performs Association/correlations

June 30, 2020 Data Mining: Concepts and Techniques 27

 Determining Customer purchasing pattern − Data mining helps in

 Providing Summary Information − Data mining provides us various

June 30, 2020 Data Mining: Concepts and Techniques 28

Data mining is used in the following fields of the

Finance Planning and Asset Evaluation − It involves cash

Resource Planning − It involves summarizing and comparing

Competition − It involves monitoring competitors and market

June 30, 2020 Data Mining: Concepts and Techniques 29

June 30, 2020 Data Mining: Concepts and Techniques 30

June 30, 2020 Data Mining: Concepts and Techniques 31

variety of DNA data

June 30, 2020 Data Mining: Concepts and Techniques 32

 Association analysis: identification of co-occurring gene sequences

genes acting together

likely to co-occur together in target samples