You are on page 1of 30

Unit III

Introduction of Data Mining

1
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (implicit, previously unknown
and potentially useful) patterns or knowledge from
huge amount of data
 Alternative name
 Knowledge discovery in databases (KDD)
 Watch out: Is everything “data mining”?
 Query processing
 Expert systems or statistical programs

2
Data Mining: A KDD Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
3
4
5
6
Data Mining: On What Kinds of Data?
 Relational database
 Data warehouse
 Transactional database
 Advanced database and information repository
 Spatial and temporal data

 Time-series data

 Stream data

 Multimedia database

 Text databases & WWW

7
Why Data Mining?—Potential Applications

 Data analysis and decision support


 Market analysis and management
 Target marketing, customer relationship
management (CRM), market basket analysis,
market segmentation
 Risk analysis and management
 Forecasting, customer retention, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns
(outliers)
8
Market Analysis and Management

 Where does the data come from?


 Credit card transactions, discount coupons, customer
complaint calls
 Target marketing
 Find clusters of “model” customers who share the
same characteristics: interest, income level, spending
habits etc.
 Determine customer purchasing patterns over time

9
Market Analysis and Management

 Cross-market analysis
 Associations/co-relations between product sales, &
prediction based on such association
 Customer profiling
 What types of customers buy what products
 Customer requirement analysis
 Identifying the best products for different customers
 Predict what factors will attract new customers

10
Fraud Detection & Mining Unusual Patterns

 Approaches: Clustering & model construction for frauds, outlier


analysis
 Applications: Health care, retail, credit card service,
telecomm.
 Medical insurance
 Professional patients, and ring of doctors
 Unnecessary or correlated screening tests
 Telecommunications:
 Phone call model: destination of the call, duration, time of day
or week. Analyze patterns that deviate from an expected norm

11
Architecture: Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse

12
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines

13
Data Mining: Classification Schemes

 Different views, different classifications


 Kinds of data to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

14
Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
 Multiple/integrated functions and mining at multiple
levels

15
Multi-Dimensional View of Data Mining
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, Web mining,
etc.

16
Major Issues in Data Mining (1)

 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

17
Major Issues in Data Mining (2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

18
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

19
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
2 Beer, Bread
 Sequential Data: transaction sequences
 Genetic sequence data 3 Beer, Coke, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Milk
 Spatial data: maps 5 Coke, Milk
 Image data:
 Video data:

20
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
21
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal
 Binary
 Numeric: quantitative
 Interval-scaled
 Ratio-scaled

22
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {black, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, rankings

23
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order can be +ve or –ve.
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 It is numeric attribute with inherent zero-point.
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts

24
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
25
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

26
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread

 Data dispersion characteristics


 median, max, min, quantiles, outliers, variance etc.

 Numerical dimensions correspond to sorted intervals


 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals

 Dispersion analysis on computed measures


 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
27
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x

i i
 Trimmed mean: chopping extreme values x i 1
n
 Median: w
i 1
i
 Middle value if odd number of values, or average of
the middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median  L1  ( ) width
 Mode freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode  3  (mean  median)
28
Symmetric vs. Skewed Data

 Median, mean and mode of symmetric

symmetric, positively and


negatively skewed data

positively skewed negatively skewed

 Data Mining: Concepts and


 7/15/21 Techniques 29
Measuring the Dispersion of Data
(Refer PPTs uploaded on MOODLE)

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)


 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance:
1 n (algebraic, 1scalable
n computation)
1 n 1 n
1 n

 i i n    x  2
2 2
s 2
( x 2
x )  [ x  ( 2
xi ) ] 2
 ( xi  2
)  i
n  1 i 1 n  1 i 1 i 1 N i 1 N i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)


 Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane.
 Histogram: x-axis are values, y-axis represents Frequencies.
30

You might also like