Professional Documents
Culture Documents
http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
shahid.awan@umt.edu.pk
University of Management and Technology
Spring 2016
Data Mining:
Concepts and Techniques
— Chapter 2 —
5
Knowing such basic statistics regarding each attribute makes it easier to fill in
missing values,
smooth noisy values,
and spot outliers during data preprocessing.
Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.
Plotting the measures of central tendency shows us if the data are symmetric
or skewed.
Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.
These can help identify relations, trends, and biases “hidden” in unstructured
data sets.
6
Data First, Data Mining Later
7
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
8
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
Document data: text documents: term-frequency
wi
n
y
vector
Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
Molecular Structures
Ordered
Video data: sequence of images TID Items
Temporal data: time-series 1 Bread, Coke, Milk
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data
3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
Spatial data: maps
5 Coke, Diaper, Milk
Image data:
Video data:
9
Important Characteristics of Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
10
Data Objects
11
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
observations.
Univariate, bivariate,
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
12
Attribute Types
Nominal: Relating to Names
categories, states, or “names of things”
13
Attribute Types
Binary
14
Attribute Types
Ordinal
15
Numeric Attribute Types
Measurable Quantity (integer or real-valued)
Interval Scaled
Inherent zero-point
Start from ‘0’
collection of documents
18
Discrete vs. Continuous Attributes
Continuous Attribute
19