Unit-3: Data Mining

UNIT- 3
DATA MINING
What is Data mining?

Data Mining is a collection of techniques for efficient automated discovery of previously unknown , valid , novel, useful and understandable patterns in large databases. Pattern must be actionable so that they may be used in an enterprises decision making process It is also known as Knowledge Discovery Data Mining refer to the extraction of hidden predictive information patterns from large database.
Data Mining
Raw Information
Data Mining
Hidden information Pattern
Need for data mining

Data mining has found many application in the last few years for a number of reasons: Growth in OLTP data Growth in data due to cards Growth in data due to web Growth in data due to telephone transactions, banking, medical. Growth in data storage capacity Decline in cost processing Availability of software/ tool
Data mining process

Requirement analysis Clearly define goals Clearly define business problem Data Selection and Collection Cleaning and preparing data Data mining exploration and validation Implementing , evaluating and monitoring Results visualization
CRISP( cross industry standard process) data mining model
Data mining VS data warehouse

OLAP What is happening in enterprise. Summary data
Limited dimensions Small number of attributes. User driven , interactive analysis Multidimensional , drill down , and slice- and- dice Mature and widely used
DATA MINING Predict future based on why this happening. Detailed transaction- level data. Large dimensions Many dimension attributes. Data- driven automatic knowledge discovery. Prepare data, mining tools
Still emerging
Relationship of data warehouse and data mining

Data mining algorithms need large amount of data, detailed level data whereas in data warehouse contain lowest level of data. Data mining need integrated and cleansed data whereas data warehouse contain data that is suitable for data mining. Infrastructure of data warehouse is robust, with parallel processing technology and relational database systems since data mining needs this type of data
Data mining techniques

Association rules mining or market basket analysis Supervised classification Cluster Analysis Web data mining Search engines
techniques
Association rules mining or market basket analysis
Transaction Items bought
1 2 3 4
bread, milk, cheese bread, cheese jam, milk milk, ghee
Now here we can see maximum combination of bread and cheese
Supervised classification
Data mining technique origin from machine learning techniques. It help in predicting whether an individual is likely to respond to a direct mail or not. Identify good risk for granting loans or insurance. Rule for insurance If sex= female & 19<= age<=43 then Life insurance = yes
Cluster Analysis
Grouping data into disjoint sets that are similar in some respect. It also attempts to place dissimilar data in different clusters. For example, in the context of super market data, clustering of sale items to perform effective shelf space organization is a typical application
Web data mining

It has impact on way we search &find information at home and at work Evaluation of learning Sites Example :- student portal Check login Notes Submit online test Chat page for clarifying doubts
Search engines
It is huge databases of web pages and software package for indexing and retrieving pages that enable users to find information Ranking help the user to choose best one
Data mining application

Customer Segmentation Market basket analysis Risk management Fraud detection Demand prediction Delinquency Tracking
Looking for knowledge

The Explosive Growth of Data The World Wide Web
Business: e-commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific simulation Society and everyone: news, digital cameras, YouTube, forums, blogs,
Google & Co
We are drowning in data, but starving for knowledge! Avoid data tombs Necessity is the mother of inventionData miningAutomated analysis of massive data sets.
16
What is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Are simple search engines data mining? Are queries data mining? Are expert systems data mining?
17
Knowledge Discovery (KDD) Process

Pattern Evaluation
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Data sources
18
Selection
DATA MINING AND BUSINESS INTELLIGENCE
Increasing potential to support business decisions
End User
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Quantity of data DBA 19
Data Mining: confluence of multiple disciplines

Database Technology
Statistics
Machine Learning
Pattern Recognition
Data Mining
Visualization
Algorithms
Other Disciplines
20
Why Data Mining?
21
Why is Data Mining so complex? A matter of data dimensions

Tremendous amount of data
Walmart Customer buying patterns a data warehouse 7.5 Terabytes large in 1995 VISA Detecting credit card interoperability issues 6800 payment transactions per second
High-dimensionality of data
Many dimensions to be combined together Data cube example: time, location, product sales
High complexity of data

Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial, spatiotemporal, multimedia, text and Web data
22
What does Data Mining provide me with? (1)

Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Characterization describes things in the same class, discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality
Wine Spaghetti [0.3% of all basket cases, 75% of cases when tomato sauce is bought] Is this correlation or not?
23

Classification and prediction
Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas mileage Predict some unknown or missing numerical values
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity
24

Outlier analysis
Outlier: Data object that does not comply with the general behavior of the data Fraud detection is the main application area Noise or exception?
Trend and evolution analysis

Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis
25
Applications of Data Mining Market Analysis and Management

Data sources:
credit card transactions, loyalty cards, smart cards, discount coupons, ...
Target marketing
Find clusters of model customers who share the same characteristics:
Geographics (lives in Rome, lives in Trentino) Demographics (married, between 21-35, at least one child, family income more than 40.000/year) Psychographics (likes new products, consistently uses the Web) Behaviors (searches info in Internet, always defends her decisions)
Determine customer purchasing patterns over time

26
Applications of Data Mining Market Analysis and Management

Cross-market analysis Find associations between product sales, and predict based on such association Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K will buy product A Customer requirement analysis Identify the best products for different groups of customers Predict what factors will attract new customers
27
Applications of Data Mining Corporate Analysis

Finance Planning and Asset Evaluation
Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend analysis)
Resource Planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Other examples?
28
Data Preprocessing
29
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation= , birthdate=31/12/2099
noisy: containing errors or outliers

e.g., Salary=-10
inconsistent: containing discrepancies in codes or names

e.g., Age=42 Birthday=03/07/1997 (we are in 2007!!) e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records. In one copy of the data customer A has to pay 200.000, in the second copy of the data A does not have to pay anything.
30
Why is data dirty?

Incomplete data may come from
Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from

Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from

Different data sources Functional dependency violation (e.g., modify some linked data)
31
Why Is Data Preprocessing Important?
32
Data Preprocessing 1. Data cleaning missing values

Data cleaning is one of the three biggest problems in data warehousing Ralph Kimball
Fill in missing values

Name=John, Occupation=Lawyer, Age=28, Salary= Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record class) Most probable value: make some inference!
33
Data Preprocessing 1. Data cleaning binning

Handle noisy data
Binning, clustering, regression (not details)
Binning 1. Sort data by price (): 4, 8, 9, 15, 21, 21, 24, 25, 26 2. Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8, 9 Bin 2: 15, 21, 21 Bin 3: 24, 25, 26
3. Smoothing by bin means:

Bin 1: 7, 7, 7 Bin 2: 19, 19, 19 Bin 3: 25, 25, 25
34
Data Preprocessing 1. Data cleaning clustering
noise
35
Data Preprocessing 2. Integration and transformation

Data Integration combines data from multiple sources into a coherent store Schema integration D1 D2 D3 Integrate metadata from different sources A.cust-id B.cust-number D1,2,3 Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different (e.g., cm vs. inch)
36

Data integration can lead to redundant attributes
Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation analysis

A mathematical method detecting the correletion between two attributes Correlation coefficient (Pearsons product moment coefficient): the higher it is, the stronger the correlation between attributes 2 (chi-square) test No details on these methods here
37

Aggregation:
Sum the sales of different branches (in different data sources) to compute the company sales
Generalization:
concept hierarchy climbing From integer attribute age to classes of age (children, adult, old)
Normalization: scaled to fall within a small, specified range

Change the range from [-,+ ] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}
38
Data Preprocessing 3. Data reduction

Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Different reduction types (dimensions, numerosity, discretization) Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False) A4?
Initial attribute A1? A6? Reduced set: attribute set: {A1, A2, A3, Class 1 Class 2 {A1, A4, A6} A4, A5, A6} Class 1Class 2
39

Dimensionality: Principal Components Analysis
Given N data vectors from n-dimensions, find k n orthogonal vectors (principal components) that can be best used to represent data Works for numeric data only Used when the number of dimensions is large
Numerosity: Clustering
Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
2 clusters
Sparse data leads to many clusters non effective

40

Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough representative samples should be preserved Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database
Random sampling
Stratified sampling
No samples
from here
41
Three types of attributes
Data Preprocessing 4. Discretization - concept hierarchy
Nominal values from an unordered set (color, profession) Ordinal values from an ordered set (military or academic rank) Continuous numbers (integer or real numbers)
Discretization
Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types, and in those cases discretization is mandatory
Some useful methods:

Binning, clustering (already presented) Entropy-based discretization (no details here)
42
Concept hierarchy generation
Data Preprocessing 4. Discretization - concept hierarchy
For categorical data Specification of an ordering between attributes (schema level)

street < city < state < country
Specification of a hierarchy of values (data level)

{Urbana, Champaign, Chicago} < Illinois
Automatic generation using the number of distinct values

For the set of attributes: {street, city, state, country} IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15 THEN: street < city < state < country
43

Unit-3: Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-3: Data Mining

Uploaded by

Copyright:

Available Formats

UNIT- 3

What is Data mining?

Hidden information Pattern

Need for data mining

Data mining process

CRISP( cross industry standard process) data mining model

Data mining VS data warehouse

Relationship of data warehouse and data mining

Data mining techniques

bread, milk, cheese bread, cheese jam, milk milk, ghee

Now here we can see maximum combination of bread and cheese

Web data mining

Data mining application

Looking for knowledge

Business: e-commerce, transactions, stocks,

What is Data Mining?

Knowledge Discovery (KDD) Process

DATA MINING AND BUSINESS INTELLIGENCE

Increasing potential to support business decisions

Data Mining: confluence of multiple disciplines

Why Data Mining?

Why is Data Mining so complex? A matter of data dimensions

High complexity of data

What does Data Mining provide me with? (1)

What does Data Mining provide me with? (2)

What does Data Mining provide me with? (3)

Trend and evolution analysis

Applications of Data Mining Market Analysis and Management

Determine customer purchasing patterns over time

Applications of Data Mining Market Analysis and Management

Applications of Data Mining Corporate Analysis

Why Data Preprocessing?

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

Why is data dirty?

Noisy data (incorrect values) may come from

Inconsistent data may come from

Why Is Data Preprocessing Important?

Data Preprocessing 1. Data cleaning missing values

Fill in missing values

Data Preprocessing 1. Data cleaning binning

3. Smoothing by bin means:

Data Preprocessing 1. Data cleaning clustering

Data Preprocessing 2. Integration and transformation

Data Preprocessing 2. Integration and transformation

Redundant attributes can be discoverd via correlation analysis

Data Preprocessing 2. Integration and transformation

Normalization: scaled to fall within a small, specified range

Data Preprocessing 3. Data reduction

Data Preprocessing 3. Data reduction

Sparse data leads to many clusters non effective

Data Preprocessing 3. Data reduction

Three types of attributes

Data Preprocessing 4. Discretization - concept hierarchy

Some useful methods:

Concept hierarchy generation

Data Preprocessing 4. Discretization - concept hierarchy

For categorical data Specification of an ordering between attributes (schema level)

Specification of a hierarchy of values (data level)

Automatic generation using the number of distinct values

You might also like