You are on page 1of 48

Data Mining:

Introduction

1
Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube

June 30, 2020 Data Mining: Concepts and Techniques 2


Why Not Traditional Data Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
June 30, 2020 Data Mining: Concepts and Techniques 3
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

June 30, 2020 Data Mining: Concepts and Techniques 4


What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

June 30, 2020 Data Mining: Concepts and Techniques 5


KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 data cleaning (to remove noise and inconsistent data)

 data integration (where multiple data sources may be combined)

 data selection (where data relevant to the analysis task are retrieved
from the database)

 data transformation (where data are transformed or consolidated into


forms appropriate for mining by performing summary or aggregation
operations)

June 30, 2020 Data Mining: Concepts and Techniques 6


KDD Process: Several Key Steps
 data mining (an essential process where intelligent methods are
applied in order to extract data patterns.

 pattern evaluation (to identify the truly interesting patterns


representing knowledge based on some interestingness measures)

 knowledge presentation (where visualization and knowledge


representation techniques are used to present the mined knowledge
to the user)

Data mining is a core of knowledge discovery process

June 30, 2020 Data Mining: Concepts and Techniques 7


Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
June 30, 2020 Data Mining: Concepts and Techniques 8
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web

June 30, 2020 Data Mining: Concepts and Techniques 9


Architecture: Typical Data Mining
System
 Database, Data warehouse, World wide web: This is one or a set of
databases, data warehouses,spreadsheets,or other kind of
information repositories. Data cleaning and integration technique
may be performed on the data.
 Database or data warehouse server: Responsible for fetching the
relevant data, based on the users data mining request.
 knowledge base: This is the domain knowledge –used to guide the
search or evaluate the interestingness of resulting patterns.
 Data mining engine: Essential to the data mining system and ideally
consists of a set of functional modules for tasks such as
characterization, association and correlation
analysis,classification,prediction ,cluster analysis, outlier analysis,
and evolution analysis

June 30, 2020 Data Mining: Concepts and Techniques 10


Architecture: Typical Data Mining
System

 Pattern evaluation module: This component typically employs


interestingness measures and interact with the data mining modules
so as to focus the search toward interesting patterns .
 User interface: This module communicates between users and the
data mining system, allowing the user to
 interact with the system by specifying a data mining query or task
 providing information to help focus the search
 performing exploratory data mining based on the intermediate
results.
 Allow the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns and visualize the patterns in
different forms.

June 30, 2020 Data Mining: Concepts and Techniques 11


Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines

June 30, 2020 Data Mining: Concepts and Techniques 12


Data Mining: Classification Schemes
 General functionality
 Descriptive data mining –Characterize the general
properties of the data in the database.
 Predictive data mining- Perform inference on the
current data in order to make predictions.
 Different views lead to different classifications
 Data view: Kinds of data to be mined
 Knowledge view: Kinds of knowledge to be discovered
 Method view: Kinds of techniques utilized
 Application view: Kinds of applications adapted
June 30, 2020 Data Mining: Concepts and Techniques 13
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

June 30, 2020 Data Mining: Concepts and Techniques 14


Data Mining Functionalities
 Concept/Class description: Characterization and discrimination
 Mining Frequent patterns, association, correlation
 Classification and prediction
 Cluster analysis
 Outlier analysis
 Trend and evolution analysis

June 30, 2020 Data Mining: Concepts and Techniques 15


Data Mining Functionalities
 Concept/Class description: Characterization and discrimination
Data Characterization: A data mining system should be able to produce a
description summarizing the characteristics of customers.

Example: The characteristics of customers who spend more than $1000 a year
at (some store called ) AllElectronics. The result can be a general profile
such as age, employment status or credit ratings.

Data Discrimination: It is a comparison of the general features of targeting class


data objects with the general features of objects from one or a set of
contrasting classes. User can specify target and contrasting classes.

Example: The user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose
sales decreased by about 30% in the same duration.

June 30, 2020 Data Mining: Concepts and Techniques 16


Data Mining Functionalities

 Mining Frequent patterns, association, correlation


Frequent Patterns : as the name suggests patterns that occur frequently in
data.
Association Analysis: from marketing perspective, determining which items are
frequently purchased together within the same transaction.
Example: An example is mined from the (some store) AllElectronic transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a 50% chance that he/she
will buy software as well.
 Support = 1%, means that 1% of all the transactions under analysis showed that
computer and software were purchased together.
Another example:
Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
Customers between 20 to 29 years of age with an income $20000-$29000. There is 60%
chance they will purchase CD Player and 2% of all the transactions under analysis
showed that this age group customers with that range of income bought CD Player.

June 30, 2020 Data Mining: Concepts and Techniques 17


Data Mining Functionalities
 Classification and prediction
Classification
is the process of finding a model that describes and distinguishes data classes
or concepts for the purpose of being able to use the model to predict the
class of objects whose class label is unknown.

Classification model can be represented in various forms such as


 IF-THEN Rules
 A decision tree
 Neural network

June 30, 2020 Data Mining: Concepts and Techniques 18


Data Mining Functionalities

June 30, 2020 Data Mining: Concepts and Techniques 19


Data Mining Functionalities
 Clustering Analysis
 Clustering analyses data objects without consulting a known class
label.
 Given a collection of objects, put objects into groups based on
similarity.

June 30, 2020 Data Mining: Concepts and Techniques 20


Data Mining Functionalities
 Outlier Analysis
 Outlier Analysis : A database may contain data objects that do not
comply with the general behavior or model of the data. These data
objects are outliers.
 Example: Use in finding Fraudulent usage of credit cards. Outlier
Analysis may uncover Fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier
values may also be detected with respect to the location and type of
purchase or the purchase frequency.

June 30, 2020 Data Mining: Concepts and Techniques 21


Data Mining Functionalities
 Trend and evolution analysis

 Trend and deviation: e.g., regression analysis


 Sequential pattern mining: e.g., digital camera  large SD
memory
 Periodicity analysis
 Similarity-based analysis

June 30, 2020 Data Mining: Concepts and Techniques 22


Learning

To learn: to get knowledge of by study,experience,or being


taught.
Types of learning :
Supervised learning: Supervised learning as the name indicates a
presence of supervisor as teacher. Eg: Classification
Unsupervised learning: Unsupervised learning is the training of
machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without
guidance
Eg: Clustering

June 30, 2020 Data Mining: Concepts and Techniques 23


Major Issues in Data Mining
Data mining is not an easy task, as the algorithms used can get very complex
and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources. These factors also create some issues.
Here, we will discuss the major issues regarding −
Mining Methodology and User Interaction

Performance Issues

Diverse Data Types

Issues
The diagram

describes the major


issues.

June 30, 2020 Data Mining: Concepts and Techniques 24


Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction
 Data mining query languages
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

June 30, 2020 Data Mining: Concepts and Techniques 25


Data Mining Applications
Data mining is highly useful in the following domains −

 Market Analysis and Management

 Corporate Analysis & Risk Management

 Fraud Detection

Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration, sports,
astrology, and Internet Web Surf-Aid

June 30, 2020 Data Mining: Concepts and Techniques 26


Market Analysis and Management
Listed below are the various fields of market where data mining is used −

Customer Profiling − Data mining helps determine what kind of people


buy what kind of products.

Identifying Customer Requirements − Data mining helps in identifying the


best products for different customers. It uses prediction to find the factors
that may attract new customers.

Cross Market Analysis − Data mining performs Association/correlations


between product sales.

June 30, 2020 Data Mining: Concepts and Techniques 27


Market Analysis and Management
 Target Marketing − Data mining helps to find clusters of model
customers who share the same characteristics such as interests, spending
habits, income, etc.

 Determining Customer purchasing pattern − Data mining helps in


determining customer purchasing pattern.

 Providing Summary Information − Data mining provides us various


multidimensional summary reports.

June 30, 2020 Data Mining: Concepts and Techniques 28


Corporate Analysis and Risk Management

Data mining is used in the following fields of the


Corporate Sector −

Finance Planning and Asset Evaluation − It involves cash


flow analysis and prediction, contingent claim analysis to evaluate
assets.

Resource Planning − It involves summarizing and comparing


the resources and spending.

Competition − It involves monitoring competitors and market


directions.

June 30, 2020 Data Mining: Concepts and Techniques 29


Fraud Detection

 Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it
helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate
from expected norms

June 30, 2020 Data Mining: Concepts and Techniques 30


 Fraudulent pattern analysis and the identification of unusual patterns
 Identify potentially fraudulent users and their atypical usage
patterns
 Detect attempts to gain fraudulent entry to customer accounts
 Discover unusual patterns which may need special attention
 Multidimensional association and sequential pattern analysis
 Find usage patterns for a set of communication services by
customer group, by month, etc.
 Promote the sales of specific services
 Improve the availability of particular services in a region
 Use of visualization tools in telecommunication data analysis

June 30, 2020 Data Mining: Concepts and Techniques 31


Biomedical Data Analysis
 DNA sequences: 4 basic building blocks (nucleotides): adenine (A),
cytosine (C), guanine (G), and thymine (T).
 Gene: a sequence of hundreds of individual nucleotides arranged in a
particular order
 Humans have around 30,000 genes
 Tremendous number of ways that the nucleotides can be ordered and
sequenced to form distinct genes
 Semantic integration of heterogeneous, distributed genome databases
 Current: highly distributed, uncontrolled generation and use of a wide

variety of DNA data


 Data cleaning and data integration methods developed in data mining

will help

June 30, 2020 Data Mining: Concepts and Techniques 32


DNA Analysis: Examples
 Similarity search and comparison among DNA sequences
 Compare the frequently occurring patterns of each class (e.g., diseased

and healthy)
 Identify gene sequence patterns that play roles in various diseases

 Association analysis: identification of co-occurring gene sequences


 Most diseases are not triggered by a single gene but by a combination of

genes acting together


 Association analysis may help determine the kinds of genes that are

likely to co-occur together in target samples


 Path analysis: linking genes to different disease development stages
 Different genes may become active at different stages of the disease

 Develop pharmaceutical interventions that target the different stages

separately
 Visualization tools and genetic data analysis

June 30, 2020 Data Mining: Concepts and Techniques 33


Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”

 noisy: containing errors or outliers


 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or


names
 e.g., Age=“42” Birthday=“03/07/1997”

 e.g., Was rating “1,2,3”, now rating “A, B, C”

 e.g., discrepancy between duplicate records

June 30, 2020 Data Mining: Concepts and Techniques 34


Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

June 30, 2020 Data Mining: Concepts and Techniques 35


Multi-Dimensional Measure of Data
Quality

 A well-accepted multidimensional view:


 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Value added

 Interpretability

 Accessibility

 Broad categories:
 Intrinsic, contextual, representational, and accessibility

June 30, 2020 Data Mining: Concepts and Techniques 36


Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

June 30, 2020 Data Mining: Concepts and Techniques 37


Forms of Data Preprocessing

June 30, 2020 Data Mining: Concepts and Techniques 38


Mining Data Descriptive
Characteristics

 Motivation
 To better understand the data: central tendency, variation
and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
June 30, 2020 Data Mining: Concepts and Techniques 39
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems

in data warehousing”—Ralph Kimball


 “Data cleaning is the number one problem in data

warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
June 30, 2020 Data Mining: Concepts and Techniques 40
Missing Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

June 30, 2020 Data Mining: Concepts and Techniques 41


How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree

June 30, 2020 Data Mining: Concepts and Techniques 42


Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning


 duplicate records

 incomplete data

 inconsistent data

June 30, 2020 Data Mining: Concepts and Techniques 43


How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency,equal

width) bins
 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 Data can be smoothed by fitting the data into

regression functions
 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

June 30, 2020 Data Mining: Concepts and Techniques 44


Simple Discretization Methods:
Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
June 30, 2020 Data Mining: Concepts and Techniques 45
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

June 30, 2020 Data Mining: Concepts and Techniques 46


Regression

Y1

Y1’ y=x+1

X1 x

June 30, 2020 Data Mining: Concepts and Techniques 47


Cluster Analysis

June 30, 2020 Data Mining: Concepts and Techniques 48

You might also like