ITS632 Lecture2 Data

ITS-632 Introduction to Data Mining
Kwang Lee, Ph.D.

Computer and Information Science
Cumberland University
1
Getting to Know Your Data
Lecture 2
2
Announcement!!!
Assignment #1: 11:59pm, Saturday
Copyright © Prof. Kwang Lee All rights reserved.

Lecture Overview
◼ Learn about data attribute types:
◼ Nominal, binary, ordinal, interval-scaled, ratio-scaled
◼ Study many types of data sets,

◼ e.g., numerical, text, graph, Web, image.
◼ Learn about data visualization:

◼ map data onto graphical primitives
◼ Measure data similarity
◼ Review all steps of data preprocessing

◼ Explore many methods which have been developed but still
an active area of research
◼ Know data quality issues
4
Chapter 3: Nature of Data, Statistical Modeling
and Visualization
◼ Data Types, Objects, Attributes
◼ Data Visualization
◼ Measuring Data Similarity and Dissimilarity
◼ Data Quality
◼ Summary
5
1. Types of Data Sets
◼ Data is the lowest level of abstraction from which
information and knowledge are derived
◼ Data is the source for information and knowledge
◼ Data is a collection of facts
◼ usually obtained as the result of experiences,
observations, or experiments
◼ Data may consist of numbers, words, images, …
◼ Thus, data quality and data integrity → critical to
analytics
6
7
◼ Analytic of data is the process of finding patterns and
correlations within large data sets to predict outcomes,
thus getting to know data type and property are important
◼ A simple taxonomy of data:
8
◼ Here, we define them into four categories namely,
◼ Record Data
◼ Graph and Network Data
◼ Ordered Data
◼ Spatial, Image, and Multimedia Data
9
1. 1 Record Data
◼ In data science, a record (also called a structure, struct, or
compound data) is a basic data structure.
◼ Records in a database or spreadsheet are usually called
"rows“, “column”.
◼ Relational record forms
◼ Transaction data
◼ Document data: text documents
◼ Data matrix, e.g., numerical matrix, crosstabs
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
10
1.2 Graph and Network Data
◼ A graph and network data can represent objects/links in
connection with human-related data to exhibit social
properties
◼ e.g., patterns in graph from which human behavioral
patterns can be analyzed and mined for valuable

information.
◼ World Wide Web, social or information networks, molecular
structures
11
1.3 Ordered Data
◼ Ordered data is a categorical, statistical data type where
the variables have natural, ordered categories and the
distances between the categories is not known
◼ Temporal data: time-series
◼ Sequential data: transaction sequences
◼ Video data: sequence of images
12
1.4 Spatial, Image, and Multimedia Data
◼ Spatial data, known as geospatial data (map), is
information about a physical object that can be represented
by numerical values in a geographic coordinate system.
◼ Multimedia data refers to data representing multiple
types of medium to capture information and experiences
related to objects and events.
◼ E.g., image data and video data
13
2. Data Objects
◼ Data object is a region of storage that contains a value or
group of values. A data object can represent an entity
described by several attributes.
◼ Examples:
◼ Sales database: customers, store items, sales

◼ Medical database: patients, treatments
◼ University database: students, professors, courses
◼ As see the above, these data sets are made up of data
object
14
2.1 Attributes
◼ Data objects are described by attributes. An attribute
has a data field representing a characteristics or features of
a data object
◼ An attribute is a property or characteristics of an object
that can have data field

◼ E.g., customer_ID, name, address
◼ Attribute types:
◼ Nominal
◼ Binary
◼ Ordinal
◼ Numeric
◼ Discrete/continuous
2.2 Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary: nominal attribute with only 2 states (0 and 1,

true and false)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important
◼ e.g., medical test (positive vs. negative)

◼ e.g., COVID-19 test (positive vs. negative)
16
2.2 Attribute Types
◼ Ordinal: a categorical, statistical data type where the
variables have natural, ordered categories and the
distances
◼ Values have a meaningful order ranking but magnitude
between successive values is not known.
◼ E.g., size = {small, medium, large}, grades = {A, B, C, D, F},
army rankings
◼ Customer satisfaction has the following ordinal
categories:
◼ 4: very satisfied
◼ 3: satisfied
◼ 2: neutral
◼ 1: somewhat dissatisfied
◼ 0: very dissatisfied
17
2.2 Attribute Types
◼ Numeric:
◼ It is quantity; such as integer or real-valued type
◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ Ratio
◼ We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as

high as 5 K˚).
◼ Inherent zero-point
◼ E.g., Kelvin temperature scale, length, counts,

monetary quantities
18
2.2 Attribute Types
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
◼ E.g., zip codes, profession, or the set of words in a collection of

documents
◼ Sometimes, represented as integer variables
◼ Note: Binary attributes are a special case of discrete
attributes
19
2.2 Attribute Types
◼ Continuous Attribute
◼ Continuous attributes are typically represented as
floating-point variables
◼ It has real numbers as attribute values
◼ E.g., temperature, height, or weight
◼ Practically, real values can only be measured and
represented using a finite number of digits
20
and Visualization
◼ Data Quality
◼ Summary
21
3. Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data
onto graphical primitives as follows:
◼ Provide qualitative overview of large data sets
◼ Support to find interesting regions and suitable
parameters for further quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Help search for patterns, trends, structure, irregularities,
relationships among data
22
3. Data Visualization
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations
23
3.1 Pixel-Oriented Visualization Techniques
◼ The basic idea of pixel-oriented visualization techniques
is to represent as many data objects as possible on the
screen at the same time by mapping each data value to a
pixel of the screen by arranging the pixels adequately.
(a) Income (b) Credit Limit (c) transaction volume (d) age
24
3.1 Pixel-Oriented Visualization Techniques
◼ For a data set of n-
dimensions, create n-
windows on the screen,
one for each dimension
◼ The n-dimension values
of a record are mapped

to n-pixels at the
corresponding positions
in the windows
◼ E.g., The colors of the
pixels reflect the
corresponding values
25
(1) Laying Out Pixels in Circle Segments
◼ To save space and show the connections among multiple
dimensions, space filling is often done in a segment
(a) Representing a data record (b) Laying out pixels in hexagon

in circle segment segment
26
(1) Laying Out Pixels in Circle Segments
◼ Therefore, we can display large information on the small
screen interface
3.2 Geometric Projection Visualization
Techniques
◼ Geometric projection techniques help users to find
interesting projections of multidimensional data sets.
◼ Visualization of geometric transformations and projections
of the data
◼ A scatter plot displays 2-D data point using Cartesian co-ordinates.
◼ 3-D can be added using different colors of shapes to represent
different data points
28
3.2 Geometric Projection Visualization
Techniques
◼ Geometric projection methods:
◼ Direct data visualization
◼ Scatterplot matrices
◼ Landscapes
◼ Parallel coordinates
◼ Projection pursuit technique: Help users find meaningful
projections of multidimensional data
◼ Prosection views
◼ Hyperslice
29
(1) Direct Data Visualization
◼ Direct visualizations of image data make use of the images in their
original visible format
Vorticity
Ribbons with Twists Based on
Data Mining: Concepts and Techniques 30

(2) Scatterplot Matrices
◼ A scatter plot matrix is a grid or matrix of scatter plots used to
visualize bivariate relationships between combinations of variables.
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

31
(3) Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
◼ Visualization of the data as perspective landscape

◼ The data needs to be transformed into a 2-D spatial representation
which is the visible features of an area of land
32
(4) Parallel Coordinates
◼ Parallel coordinates plot allows to compare the feature
of several individual observations on a set of numeric
variables
◼ It is a visualization technique used to plot individual data
elements across many performance measures.
◼ The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
• • •
Attr. 1 Attr. 2 Attr. 3 Attr. k

33
(4) Parallel Coordinates
◼ Seven columns from the cars table. The lines are color
encoded by the origin countries of the cars
34
3.3 Icon-Based Visualization Techniques
◼ Uses icons to represent multidimensional data values
◼ General techniques:
◼ Shape coding: Use shape to represent certain information
encoding
◼ Color icons: Use color icons to encode more information
◼ Tile bars: Use small icons to represent the relevant feature
vectors in document retrieval
35
3.3 Icon-Based Visualization Techniques
◼ This is the visualization of large multi-variate data values
as features of icons
◼ It is still a challenging task, especially when we
consider the exploration of a variety of attributes in one

representation
◼ Typical visualization methods
◼ Chernoff faces
◼ Stick figures
36
(1) Chernoff Faces
◼ A way to display variables on a two-dimensional surface,
i.e., let x be eyebrow slant, y be eye size, z be nose
length, etc.
◼ The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening)
37
(2) Stick Figure
◼ A stick figure is a very simple drawing of a person or
animal, composed of a few lines, curves, and dots.
◼ E.g., A census data figure showing age, income,
gender, education, etc.
◼ E.g., Family stick figure
38
3.4 Hierarchical Visualization Techniques
◼ Hierarchical data visualization is a method to explain
how to show hierarchy with data visualization.
◼ Visualization of the data using a hierarchical
partitioning into subspaces

◼ Methods
◼ Dimensional stacking
◼ Worlds-within-Worlds
◼ Tree-map
◼ Cone trees
◼ InfoCube
39
(1) Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
◼ Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
◼ Partitioning of the attribute value ranges into classes. The important
attributes should be used on the outer levels.
◼ Adequate for data with ordinal attributes of low
cardinality, but difficult to display more than nine
dimensions
◼ Important to map dimensions appropriately
40
(2) Worlds-within-Worlds
◼ Assign the function and two most important parameters to
innermost world. Fix all other parameters at constant
values - draw other (1 or 2 or 3 dimensional worlds
choosing these as the axes)
◼ Software N–vision: Dynamic interaction through data glove and
stereo displays, including rotation, scaling (inner) and translation
(inner/outer)
41
(3) Tree-Map
◼ A tree-map is a method for displaying hierarchical view
using nested figures, usually rectangles
◼ The information is displayed as a cluster of rectangles
varying in size and color, depending on their data value
42
(3) Tree-Map
◼ A screen-filling method can be used in a hierarchical
partitioning of the tiles into regions depending on the
attribute values
◼ The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43
(4) InfoCube
◼ Infocube is a 3-D visualization technique where
hierarchical information is displayed as nested semi-
transparent cubes
◼ The outermost cube corresponds to the top-level data,
the lower-level data is represented as smaller cube inside

the outermost cube, and so on
44
(5) Three-D Cone Trees
◼ Cone tree is a 3-D visualization technique works well for
up to a thousand nodes or so
◼ First build a 2-D circle tree that arranges its nodes in
concentric circles centered on the root node

◼ Cannot avoid overlaps when projected to 2-D
◼ 3-D cone tree is used for visualizing hierarchical
information structures
45
3.5 Which Chart or Graph Should You Use?
Figure 3.21 A Taxonomy of Charts and Graphs.
Source: Adapted from Abela, A. (2008). Advanced Presentations by Design: Creating

Communication That Drives Action. New York: Wiley.
and Visualization
◼ Data Quality
◼ Summary
47
4. Similarity and Dissimilarity
◼ Similarity is a numerical measure of how alike two data
objects are, and dissimilarity is a numerical measure of
how different two data objects are.
48
◼ Similarity
◼ Numerical measure of how alike two data objects are
◼ Value is higher when objects are more alike
◼ Often falls in the range [0,1]
◼ Dissimilarity (e.g., distance)

◼ Numerical measure of how different two data objects
are
◼ Lower when objects are more alike
◼ Minimum dissimilarity is often 0
◼ Upper limit varies
◼ Proximity refers to a similarity or dissimilarity
49
p and q are the attribute values for two data objects.

and Visualization
◼ Data Quality
◼ Summary
51
5. Data Quality
◼ Data quality is a measure of the condition of data based
on factors such as accuracy, completeness, consistency,
reliability and whether it's up to date.
◼ Examples of data quality problems:

◼ Noise
◼ Outliers
◼ Missing values
◼ Duplicate data
(1) Noise
◼ Noise refers to modification of original values
◼ Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise

(2) Outliers
◼ Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
(3) Missing Values
◼ Reasons for missing values
◼ Information is not collected (e.g., people decline to
give their age and weight)
◼ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
◼ Handling missing values,

◼ Eliminate data objects
◼ Estimate missing values
◼ Ignore the missing value during analysis
◼ Replace with all possible values (weighted by their

probabilities)
(4) Duplicate Data
◼ Data set may include data objects that are duplicates, or
almost duplicates of one another
◼ Major issue when merging data from heterogenous
sources
◼ Examples:
◼ Same person with multiple email addresses
◼ Need data cleaning which is a process of dealing with
duplicate data issues

and Visualization
◼ Data Quality
◼ Summary
57
Summary
◼ Learned about data attribute types:
◼ Nominal, binary, ordinal, interval-scaled, ratio-scaled
◼ Studied many types of data sets,

◼ e.g., numerical, text, graph, Web, image.
◼ Learned about data visualization:

◼ map data onto graphical primitives
◼ Measure data similarity
◼ Reviewed all steps of data preprocessing.

◼ Explored many methods which have been developed but still
an active area of research.
◼ Knew data quality issues.
58
Note and Thank you!!!
Assignment #1: 11:59pm, Saturday
Thank You!
Copyright © Prof. Kwang Lee All rights reserved.

References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
60
Assignment #1 - 2/3
Open the assignment #1 MS word file and answer the
following questions.
6. Briefly outline how to compute the dissimilarity between objects

described by the following:
(a) Nominal attributes
(b) Binary attributes
(c) Numeric attribues
7. Briefly outline how to compute the visualization techniques described

by the following:
(a) Pixel-oriented
(b) Geometric-based
(c) Parallel coordinates
61

ITS632 Lecture2 Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ITS632 Lecture2 Data

Uploaded by

Copyright:

Available Formats

ITS-632 Introduction to Data Mining

Kwang Lee, Ph.D.

Copyright © Prof. Kwang Lee All rights reserved.

◼ Study many types of data sets,

◼ Learn about data visualization:

◼ Measure data similarity

◼ Review all steps of data preprocessing

◼ Data Types, Objects, Attributes

◼ Measuring Data Similarity and Dissimilarity

◼ Document data: text documents

◼ Data matrix, e.g., numerical matrix, crosstabs

patterns can be analyzed and mined for valuable

◼ Sequential data: transaction sequences

◼ Video data: sequence of images

◼ Sales database: customers, store items, sales

that can have data field

◼ marital status, occupation, ID numbers, zip codes

◼ Binary: nominal attribute with only 2 states (0 and 1,

◼ e.g., medical test (positive vs. negative)

larger than the unit of measurement (10 K˚ is twice as

◼ E.g., Kelvin temperature scale, length, counts,

◼ E.g., zip codes, profession, or the set of words in a collection of

◼ Data Types, Objects, Attributes

◼ Measuring Data Similarity and Dissimilarity

of a record are mapped

(a) Representing a data record (b) Laying out pixels in hexagon

Data Mining: Concepts and Techniques 30

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

◼ Visualization of the data as perspective landscape

Attr. 1 Attr. 2 Attr. 3 Attr. k

consider the exploration of a variety of attributes in one

gender, education, etc.

◼ E.g., Family stick figure

partitioning into subspaces

alternately according to the attribute values (classes)

MSR Netscan Image

the lower-level data is represented as smaller cube inside

concentric circles centered on the root node

Source: Adapted from Abela, A. (2008). Advanced Presentations by Design: Creating

◼ Data Types, Objects, Attributes

◼ Measuring Data Similarity and Dissimilarity

◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

p and q are the attribute values for two data objects.

◼ Data Types, Objects, Attributes

◼ Measuring Data Similarity and Dissimilarity

◼ Examples of data quality problems:

on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

◼ Handling missing values,

◼ Estimate missing values

◼ Ignore the missing value during analysis

◼ Replace with all possible values (weighted by their

◼ Need data cleaning which is a process of dealing with

duplicate data issues

◼ Data Types, Objects, Attributes

◼ Measuring Data Similarity and Dissimilarity

◼ Studied many types of data sets,

◼ Learned about data visualization: