Data Mining Presentation

DATA MINING
Prepared by: Srinivasan 109CE21 Manas 109CE22 Karan 109CE23
Data explosion problem

Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Lots of data is being collected and warehoused

Web data, e-commerce purchases at department/
grocery stores Bank/Credit Card transactions
Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Data collected and stored at enormous speeds (GB/hour)

remote sensors on a satellite telescopes scanning the skies microarrays generating gene
expression data scientific simulations generating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from data in large databases
Alternative names and their inside stories:

Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology, business intelligence, etc.
Decisions in
data mining
Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted
Data
mining tasks
Descriptive data mining Predictive data mining
Databases to be mined Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Prediction Tasks
Use some variables to predict unknown or future values of
other variables
Description Tasks
Find human-interpretable patterns that describe the data.
Common data mining tasks

Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]
In terms of software and the marketing thereof Data Mining != Data Analysis Data Mining implies software uses some intelligence over simple grouping and partitioning of data to infer new information. Data Analysis is more in line with standard statistical software (ie: web stats). These usually present information about subsets and relations within the recorded data set (ie: browser/search engine usage, average visit time, etc. )
CLUSTERING
Given
a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one
another. Data points in separate clusters are less similar to one another.
Similarity
Measures:
Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Market
Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
x Collect different attributes of customers based on their geographical and lifestyle related information. x Find clusters of similar customers. x Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Document
Clustering:
Goal: To find groups of documents that are
similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID
Items
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Newspaper, Milk Beer, Bread, Newspaper, Milk Coke, Newspaper, Milk
Rules Discovered:
{Milk} --> {Coke} {Newspaper, Milk} --> {Beer}
Marketing
and Sales Promotion:
Let the rule discovered be
{Milk, } --> {Coke} Coke as consequent => Can be used to determine what should be done to boost its sales. Milk in the antecedent => Can be used to see which products would be affected if the store discontinues selling milk. Milk in antecedent and Coke in consequent => Can be used to see what products should be sold with milk to promote sale of coke!
Supermarket
shelf management.
Goal: To identify items that are bought
together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -x If a customer buys Newspaper and milk, then he is very likely to buy beer:
ADVANTAGES AND DISADVANTAGES OF DATA MINING
significant deviations from normal behavior Applications:

Detect
Credit Card Fraud Detection
Network Intrusion
Detection
Induction vs Deduction
Deductive reasoning is truth-preserving:

1. All horses are mammals 2. All mammals have lungs 3. Therefore, all horses have lungs
Induction reasoning adds information:

1. All horses observed so far have lungs. 2. Therefore, all horses have lungs.
Pattern Evaluation
Data mining: the core of
knowledge discovery process.

Data Selection Data Preprocessing
Data Mining
Task-relevant Data Data Warehouse

Data Cleaning Data Integration
Databases
Learning the application domain:

relevant prior knowledge and goals of application
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining

summarization, classification, regression, association,
clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
Statistical Analysis:
Ill-suited for Nominal and Structured Data Types Completely data driven - incorporation of domain knowledge not possible Interpretation of results is difficult and daunting Requires expert user guidance
Data Mining:
Large Data sets Efficiency of Algorithms is important Scalability of Algorithms is important Real World Data Lots of Missing Values Pre-existing data - not user generated Data not static - prone to updates Efficient methods for data retrieval available for use
AI/Machine Learning Combinatorial/Game Data Mining Good for analyzing winning strategies to games, and thus developing intelligent AI opponents. (ie: Chess) Business Strategies Market Basket Analysis Identify customer demographics, preferences, and purchasing patterns. Risk Analysis Product Defect Analysis Analyze product defect rates for given plants and predict possible complications (read: lawsuits) down the line.
User Behavior Validation Fraud Detection In the realm of cell phones Comparing phone activity to calling records. Can help detect calls made on cloned phones. Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards.
THANK YOU

Data Mining Presentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Presentation

Uploaded by

Copyright:

Available Formats

DATA MINING

Prepared by: Srinivasan 109CE21 Manas 109CE22 Karan 109CE23

Data explosion problem

patterns, constraints) from data in large databases

Lots of data is being collected and warehoused

grocery stores Bank/Credit Card transactions

Customer Relationship Management)

Data collected and stored at enormous speeds (GB/hour)

expression data scientific simulations generating terabytes of data

Data mining (knowledge discovery in databases):

Alternative names and their inside stories:

knowledge extraction, data/pattern analysis, data archeology, business intelligence, etc.

Descriptive data mining Predictive data mining

market analysis, Web mining, Weblog analysis, etc.

Common data mining tasks

Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Goal: subdivide a market into distinct subsets of

Goal: To find groups of documents that are

item based on occurrences of other items.

and Sales Promotion:

Let the rule discovered be

Goal: To identify items that are bought

ADVANTAGES AND DISADVANTAGES OF DATA MINING

significant deviations from normal behavior Applications:

Credit Card Fraud Detection

Deductive reasoning is truth-preserving:

Induction reasoning adds information:

Data mining: the core of

knowledge discovery process.

Task-relevant Data Data Warehouse

Learning the application domain:

Choosing functions of data mining

Use of discovered knowledge

You might also like