Professional Documents
Culture Documents
DATA MINING
Data Mining
Raw Information
Data Mining
Limited dimensions Small number of attributes. User driven , interactive analysis Multidimensional , drill down , and slice- and- dice Mature and widely used
DATA MINING Predict future based on why this happening. Detailed transaction- level data. Large dimensions Many dimension attributes. Data- driven automatic knowledge discovery. Prepare data, mining tools
Still emerging
techniques
Association rules mining or market basket analysis
Transaction Items bought
1 2 3 4
Supervised classification
Data mining technique origin from machine learning techniques. It help in predicting whether an individual is likely to respond to a direct mail or not. Identify good risk for granting loans or insurance. Rule for insurance If sex= female & 19<= age<=43 then Life insurance = yes
Cluster Analysis
Grouping data into disjoint sets that are similar in some respect. It also attempts to place dissimilar data in different clusters. For example, in the context of super market data, clustering of sale items to perform effective shelf space organization is a typical application
Search engines
It is huge databases of web pages and software package for indexing and retrieving pages that enable users to find information Ranking help the user to choose best one
Google & Co
We are drowning in data, but starving for knowledge! Avoid data tombs Necessity is the mother of inventionData miningAutomated analysis of massive data sets.
16
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Are simple search engines data mining? Are queries data mining? Are expert systems data mining?
17
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Data sources
18
Selection
End User
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Quantity of data DBA 19
Statistics
Machine Learning
Pattern Recognition
Data Mining
Visualization
Algorithms
Other Disciplines
20
21
High-dimensionality of data
Many dimensions to be combined together Data cube example: time, location, product sales
22
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Characterization describes things in the same class, discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality
Wine Spaghetti [0.3% of all basket cases, 75% of cases when tomato sauce is bought] Is this correlation or not?
23
Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas mileage Predict some unknown or missing numerical values
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity
24
Target marketing
Find clusters of model customers who share the same characteristics:
Geographics (lives in Rome, lives in Trentino) Demographics (married, between 21-35, at least one child, family income more than 40.000/year) Psychographics (likes new products, consistently uses the Web) Behaviors (searches info in Internet, always defends her decisions)
27
Resource Planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Other examples?
28
Data Preprocessing
29
30
31
32
Binning 1. Sort data by price (): 4, 8, 9, 15, 21, 21, 24, 25, 26 2. Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8, 9 Bin 2: 15, 21, 21 Bin 3: 24, 25, 26
noise
35
Generalization:
concept hierarchy climbing From integer attribute age to classes of age (children, adult, old)
Initial attribute A1? A6? Reduced set: attribute set: {A1, A2, A3, Class 1 Class 2 {A1, A4, A6} A4, A5, A6} Class 1Class 2
39
Numerosity: Clustering
Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
2 clusters
Random sampling
Stratified sampling
No samples
from here
41
Nominal values from an unordered set (color, profession) Ordinal values from an ordered set (military or academic rank) Continuous numbers (integer or real numbers)
Discretization
Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types, and in those cases discretization is mandatory
43