You are on page 1of 25

DATA MINING

Prepared by: Srinivasan 109CE21 Manas 109CE22 Karan 109CE23

Data explosion problem


Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Lots of data is being collected and warehoused


Web data, e-commerce purchases at department/

grocery stores Bank/Credit Card transactions

Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Data collected and stored at enormous speeds (GB/hour)


remote sensors on a satellite telescopes scanning the skies microarrays generating gene

expression data scientific simulations generating terabytes of data


Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation

Data mining (knowledge discovery in databases):


Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases

Alternative names and their inside stories:


Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, business intelligence, etc.

Decisions in

data mining

Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted
Data

mining tasks

Descriptive data mining Predictive data mining

Databases to be mined Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining, stock

market analysis, Web mining, Weblog analysis, etc.

Prediction Tasks
Use some variables to predict unknown or future values of

other variables

Description Tasks
Find human-interpretable patterns that describe the data.

Common data mining tasks


Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]

In terms of software and the marketing thereof Data Mining != Data Analysis Data Mining implies software uses some intelligence over simple grouping and partitioning of data to infer new information. Data Analysis is more in line with standard statistical software (ie: web stats). These usually present information about subsets and relations within the recorded data set (ie: browser/search engine usage, average visit time, etc. )

CLUSTERING

Given

a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one

another. Data points in separate clusters are less similar to one another.
Similarity

Measures:

Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Market

Segmentation:

Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
x Collect different attributes of customers based on their geographical and lifestyle related information. x Find clusters of similar customers. x Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Document

Clustering:

Goal: To find groups of documents that are

similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an

item based on occurrences of other items.

TID

Items

1 2 3 4 5

Bread, Coke, Milk Beer, Bread Beer, Coke, Newspaper, Milk Beer, Bread, Newspaper, Milk Coke, Newspaper, Milk

Rules Discovered:
{Milk} --> {Coke} {Newspaper, Milk} --> {Beer}

Marketing

and Sales Promotion:

Let the rule discovered be

{Milk, } --> {Coke} Coke as consequent => Can be used to determine what should be done to boost its sales. Milk in the antecedent => Can be used to see which products would be affected if the store discontinues selling milk. Milk in antecedent and Coke in consequent => Can be used to see what products should be sold with milk to promote sale of coke!

Supermarket

shelf management.

Goal: To identify items that are bought

together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -x If a customer buys Newspaper and milk, then he is very likely to buy beer:

ADVANTAGES AND DISADVANTAGES OF DATA MINING

significant deviations from normal behavior Applications:


Detect

Credit Card Fraud Detection

Network Intrusion

Detection

Induction vs Deduction

Deductive reasoning is truth-preserving:


1. All horses are mammals 2. All mammals have lungs 3. Therefore, all horses have lungs

Induction reasoning adds information:


1. All horses observed so far have lungs. 2. Therefore, all horses have lungs.

Pattern Evaluation

Data mining: the core of

knowledge discovery process.


Data Selection Data Preprocessing

Data Mining

Task-relevant Data Data Warehouse


Data Cleaning Data Integration

Databases

Learning the application domain:


relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction,

invariant representation.

Choosing functions of data mining


summarization, classification, regression, association,

clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,

etc.

Use of discovered knowledge

Statistical Analysis:
Ill-suited for Nominal and Structured Data Types Completely data driven - incorporation of domain knowledge not possible Interpretation of results is difficult and daunting Requires expert user guidance

Data Mining:
Large Data sets Efficiency of Algorithms is important Scalability of Algorithms is important Real World Data Lots of Missing Values Pre-existing data - not user generated Data not static - prone to updates Efficient methods for data retrieval available for use

AI/Machine Learning Combinatorial/Game Data Mining Good for analyzing winning strategies to games, and thus developing intelligent AI opponents. (ie: Chess) Business Strategies Market Basket Analysis Identify customer demographics, preferences, and purchasing patterns. Risk Analysis Product Defect Analysis Analyze product defect rates for given plants and predict possible complications (read: lawsuits) down the line.

User Behavior Validation Fraud Detection In the realm of cell phones Comparing phone activity to calling records. Can help detect calls made on cloned phones. Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards.

THANK YOU

You might also like