You are on page 1of 17

Data Mining:

Concepts and Techniques

1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series 2 Beer, Bread
 Sequential Data: transaction sequences 3 Beer, Coke, Diaper, Milk
 Genetic sequence data
4 Beer, Bread, Diaper, Milk
 Spatial, image and multimedia:
 Spatial data: maps
5 Coke, Diaper, Milk
 Image data:
 Video data:

3
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
4
Attributes
 Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

6
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
7
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits


 Continuous attributes are typically represented as floating-

point variables

8
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Data Pre-processing( Introduction)

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

9
The data analysis pipeline
Mining is not the only step in the analysis process
Pre-processing: real data is noisy, incomplete and

inconsistent. Data cleaning is required to make sense of the


data
Techniques: Sampling, Dimensionality Reduction, Feature

selection.
A dirty work, but it is often the most important step for the

analysis.
Post-Processing: Make the data actionable and useful to the

user
Statistical analysis of importance

Visualization.

10
Data Quality
Examples of data quality problems:
 Noise and outliers

 Missing values

 Duplicate data

11
Sampling
 Sampling is the main technique employed for data selection.
 •It is often used for both the preliminary investigation of the data and the final data analysis.

 •Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.
 •Example: What is the average height of a person in Ioannina?
 •We cannot measure the height of everybody

 •Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
 •Example: We have 1M documents. What fraction has at least 100 words in common?
 •Computing number of common words for all pairs requires 1012 comparisons
 •Example: What fraction of tweets in a year contain the word “Greece”?
 •300M tweets per day, if 100 characters on average, 86.5TB to store all tweets

12
Sampling

 The key principle for effective sampling is the following:


 using a sample will work almost as well as using the entire data sets, if the
sample is representative
 A sample is representative if it has approximately the same property (of
interest) as the original set of data
 Otherwise we say that the sample introduces some bias
 What happens if we take a sample from the university campus to compute
the average height of a person at Ioannina?

13
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item

Sampling without replacement


As each item is selected, it is removed from the population

Sampling with replacement


Objects are not removed from the population as they are selected for the

sample.
In sampling with replacement, the same object can be picked up more

than once. This makes analytical computation of probabilities easier


•E.g., we have 100 people, 51 are women P(W) = 0.51, 49 men P(M) =

0.49. If I pick two persons what is the probability P(W,W) that both are
women?
•Sampling with replacement: P(W,W) = 0.512

•Sampling without replacement: P(W,W) = 51/100 * 50/99

14
Types of Sampling
Stratified sampling
Split the data into several groups; then draw random samples from each group.

Ensures that both groups are represented.

Example 1. I want to understand the differences between legitimate and fraudulent


credit card transactions. 0.1% of transactions are fraudulent. What happens if I select
1000 transactions at random?
I get 1 fraudulent transaction (in expectation). Not enough to draw any conclusions.

Solution: sample 1000 legitimate and 1000 fraudulent transactions


Example 2. I want to answer the question: Do web pages that are linked have on

average more words in common than those that are not? I have 1M pages, and 1M links,
what happens if I select 10K pairs of pages at random?
Most likely I will not get any links. Solution: sample 10K random pairs, and 10K links

15
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
 Many types of data sets, e.g., numerical, text, graph, Web,
image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,

graphical displays
 Data visualization: map data onto graphical primitives

 Measure data similarity

 Above steps are the beginning of data preprocessing


 Many methods have been developed but still an active area of
research
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S.  Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
 C. Yu et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009

You might also like