You are on page 1of 61

ITS-632 Introduction to Data Mining

Kwang Lee, Ph.D.


Computer and Information Science
Cumberland University

1
Getting to Know Your Data

Lecture 2

2
Announcement!!!
Assignment #1: 11:59pm, Saturday

Copyright © Prof. Kwang Lee All rights reserved.


Lecture Overview
◼ Learn about data attribute types:
◼ Nominal, binary, ordinal, interval-scaled, ratio-scaled

◼ Study many types of data sets,


◼ e.g., numerical, text, graph, Web, image.

◼ Learn about data visualization:


◼ map data onto graphical primitives

◼ Measure data similarity

◼ Review all steps of data preprocessing


◼ Explore many methods which have been developed but still
an active area of research
◼ Know data quality issues

4
Chapter 3: Nature of Data, Statistical Modeling
and Visualization

◼ Data Types, Objects, Attributes

◼ Data Visualization

◼ Measuring Data Similarity and Dissimilarity

◼ Data Quality

◼ Summary

5
1. Types of Data Sets
◼ Data is the lowest level of abstraction from which
information and knowledge are derived
◼ Data is the source for information and knowledge
◼ Data is a collection of facts
◼ usually obtained as the result of experiences,

observations, or experiments
◼ Data may consist of numbers, words, images, …
◼ Thus, data quality and data integrity → critical to
analytics

6
1. Types of Data Sets

7
1. Types of Data Sets
◼ Analytic of data is the process of finding patterns and
correlations within large data sets to predict outcomes,
thus getting to know data type and property are important
◼ A simple taxonomy of data:

8
1. Types of Data Sets
◼ Here, we define them into four categories namely,
◼ Record Data
◼ Graph and Network Data
◼ Ordered Data
◼ Spatial, Image, and Multimedia Data

9
1. 1 Record Data
◼ In data science, a record (also called a structure, struct, or
compound data) is a basic data structure.
◼ Records in a database or spreadsheet are usually called
"rows“, “column”.
◼ Relational record forms

◼ Transaction data

◼ Document data: text documents

◼ Data matrix, e.g., numerical matrix, crosstabs

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
10
1.2 Graph and Network Data
◼ A graph and network data can represent objects/links in
connection with human-related data to exhibit social
properties
◼ e.g., patterns in graph from which human behavioral

patterns can be analyzed and mined for valuable


information.
◼ World Wide Web, social or information networks, molecular
structures

11
1.3 Ordered Data
◼ Ordered data is a categorical, statistical data type where
the variables have natural, ordered categories and the
distances between the categories is not known
◼ Temporal data: time-series

◼ Sequential data: transaction sequences

◼ Video data: sequence of images

12
1.4 Spatial, Image, and Multimedia Data
◼ Spatial data, known as geospatial data (map), is
information about a physical object that can be represented
by numerical values in a geographic coordinate system.
◼ Multimedia data refers to data representing multiple
types of medium to capture information and experiences
related to objects and events.
◼ E.g., image data and video data

13
2. Data Objects
◼ Data object is a region of storage that contains a value or
group of values. A data object can represent an entity
described by several attributes.
◼ Examples:

◼ Sales database: customers, store items, sales


◼ Medical database: patients, treatments
◼ University database: students, professors, courses
◼ As see the above, these data sets are made up of data
object

14
2.1 Attributes
◼ Data objects are described by attributes. An attribute
has a data field representing a characteristics or features of
a data object
◼ An attribute is a property or characteristics of an object

that can have data field


◼ E.g., customer_ID, name, address

◼ Attribute types:
◼ Nominal
◼ Binary
◼ Ordinal
◼ Numeric
◼ Discrete/continuous
2.2 Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {black, blond, brown, grey, red, white}

◼ marital status, occupation, ID numbers, zip codes

◼ Binary: nominal attribute with only 2 states (0 and 1,


true and false)
◼ Symmetric binary: both outcomes equally important

◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important

◼ e.g., medical test (positive vs. negative)


◼ e.g., COVID-19 test (positive vs. negative)

16
2.2 Attribute Types
◼ Ordinal: a categorical, statistical data type where the
variables have natural, ordered categories and the
distances
◼ Values have a meaningful order ranking but magnitude
between successive values is not known.
◼ E.g., size = {small, medium, large}, grades = {A, B, C, D, F},
army rankings
◼ Customer satisfaction has the following ordinal
categories:
◼ 4: very satisfied
◼ 3: satisfied
◼ 2: neutral
◼ 1: somewhat dissatisfied
◼ 0: very dissatisfied

17
2.2 Attribute Types
◼ Numeric:
◼ It is quantity; such as integer or real-valued type

◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ Ratio
◼ We can speak of values as being an order of magnitude

larger than the unit of measurement (10 K˚ is twice as


high as 5 K˚).
◼ Inherent zero-point

◼ E.g., Kelvin temperature scale, length, counts,


monetary quantities

18
2.2 Attribute Types
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values

◼ E.g., zip codes, profession, or the set of words in a collection of


documents
◼ Sometimes, represented as integer variables
◼ Note: Binary attributes are a special case of discrete
attributes

19
2.2 Attribute Types
◼ Continuous Attribute
◼ Continuous attributes are typically represented as
floating-point variables
◼ It has real numbers as attribute values
◼ E.g., temperature, height, or weight
◼ Practically, real values can only be measured and
represented using a finite number of digits

20
Chapter 3: Nature of Data, Statistical Modeling
and Visualization

◼ Data Types, Objects, Attributes

◼ Data Visualization

◼ Measuring Data Similarity and Dissimilarity

◼ Data Quality

◼ Summary

21
3. Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data
onto graphical primitives as follows:
◼ Provide qualitative overview of large data sets
◼ Support to find interesting regions and suitable
parameters for further quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Help search for patterns, trends, structure, irregularities,
relationships among data

22
3. Data Visualization
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations

23
3.1 Pixel-Oriented Visualization Techniques
◼ The basic idea of pixel-oriented visualization techniques
is to represent as many data objects as possible on the
screen at the same time by mapping each data value to a
pixel of the screen by arranging the pixels adequately.

(a) Income (b) Credit Limit (c) transaction volume (d) age
24
3.1 Pixel-Oriented Visualization Techniques
◼ For a data set of n-
dimensions, create n-
windows on the screen,
one for each dimension
◼ The n-dimension values

of a record are mapped


to n-pixels at the
corresponding positions
in the windows
◼ E.g., The colors of the
pixels reflect the
corresponding values

25
(1) Laying Out Pixels in Circle Segments
◼ To save space and show the connections among multiple
dimensions, space filling is often done in a segment

(a) Representing a data record (b) Laying out pixels in hexagon


in circle segment segment
26
(1) Laying Out Pixels in Circle Segments
◼ Therefore, we can display large information on the small
screen interface
3.2 Geometric Projection Visualization
Techniques
◼ Geometric projection techniques help users to find
interesting projections of multidimensional data sets.
◼ Visualization of geometric transformations and projections
of the data
◼ A scatter plot displays 2-D data point using Cartesian co-ordinates.
◼ 3-D can be added using different colors of shapes to represent
different data points

28
3.2 Geometric Projection Visualization
Techniques
◼ Geometric projection methods:
◼ Direct data visualization
◼ Scatterplot matrices
◼ Landscapes
◼ Parallel coordinates
◼ Projection pursuit technique: Help users find meaningful
projections of multidimensional data
◼ Prosection views
◼ Hyperslice

29
(1) Direct Data Visualization
◼ Direct visualizations of image data make use of the images in their
original visible format
Vorticity
Ribbons with Twists Based on

Data Mining: Concepts and Techniques 30


(2) Scatterplot Matrices
◼ A scatter plot matrix is a grid or matrix of scatter plots used to
visualize bivariate relationships between combinations of variables.

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]


31
(3) Landscapes
Used by permission of B. Wright, Visible Decisions Inc.

news articles
visualized as
a landscape

◼ Visualization of the data as perspective landscape


◼ The data needs to be transformed into a 2-D spatial representation
which is the visible features of an area of land
32
(4) Parallel Coordinates
◼ Parallel coordinates plot allows to compare the feature
of several individual observations on a set of numeric
variables
◼ It is a visualization technique used to plot individual data
elements across many performance measures.
◼ The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute

• • •

Attr. 1 Attr. 2 Attr. 3 Attr. k


33
(4) Parallel Coordinates
◼ Seven columns from the cars table. The lines are color
encoded by the origin countries of the cars

34
3.3 Icon-Based Visualization Techniques
◼ Uses icons to represent multidimensional data values
◼ General techniques:
◼ Shape coding: Use shape to represent certain information
encoding
◼ Color icons: Use color icons to encode more information
◼ Tile bars: Use small icons to represent the relevant feature
vectors in document retrieval

35
3.3 Icon-Based Visualization Techniques
◼ This is the visualization of large multi-variate data values
as features of icons
◼ It is still a challenging task, especially when we

consider the exploration of a variety of attributes in one


representation
◼ Typical visualization methods

◼ Chernoff faces
◼ Stick figures

36
(1) Chernoff Faces
◼ A way to display variables on a two-dimensional surface,
i.e., let x be eyebrow slant, y be eye size, z be nose
length, etc.
◼ The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening)

37
(2) Stick Figure
◼ A stick figure is a very simple drawing of a person or
animal, composed of a few lines, curves, and dots.
◼ E.g., A census data figure showing age, income,

gender, education, etc.

◼ E.g., Family stick figure

38
3.4 Hierarchical Visualization Techniques
◼ Hierarchical data visualization is a method to explain
how to show hierarchy with data visualization.
◼ Visualization of the data using a hierarchical

partitioning into subspaces


◼ Methods
◼ Dimensional stacking

◼ Worlds-within-Worlds

◼ Tree-map

◼ Cone trees

◼ InfoCube

39
(1) Dimensional Stacking

attribute 4
attribute 2

attribute 3

attribute 1
◼ Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
◼ Partitioning of the attribute value ranges into classes. The important
attributes should be used on the outer levels.
◼ Adequate for data with ordinal attributes of low
cardinality, but difficult to display more than nine
dimensions
◼ Important to map dimensions appropriately
40
(2) Worlds-within-Worlds
◼ Assign the function and two most important parameters to
innermost world. Fix all other parameters at constant
values - draw other (1 or 2 or 3 dimensional worlds
choosing these as the axes)
◼ Software N–vision: Dynamic interaction through data glove and
stereo displays, including rotation, scaling (inner) and translation
(inner/outer)

41
(3) Tree-Map
◼ A tree-map is a method for displaying hierarchical view
using nested figures, usually rectangles
◼ The information is displayed as a cluster of rectangles
varying in size and color, depending on their data value

42
(3) Tree-Map
◼ A screen-filling method can be used in a hierarchical
partitioning of the tiles into regions depending on the
attribute values
◼ The x- and y-dimension of the screen are partitioned

alternately according to the attribute values (classes)

MSR Netscan Image

Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43
(4) InfoCube
◼ Infocube is a 3-D visualization technique where
hierarchical information is displayed as nested semi-
transparent cubes
◼ The outermost cube corresponds to the top-level data,

the lower-level data is represented as smaller cube inside


the outermost cube, and so on

44
(5) Three-D Cone Trees
◼ Cone tree is a 3-D visualization technique works well for
up to a thousand nodes or so
◼ First build a 2-D circle tree that arranges its nodes in

concentric circles centered on the root node


◼ Cannot avoid overlaps when projected to 2-D
◼ 3-D cone tree is used for visualizing hierarchical
information structures

45
3.5 Which Chart or Graph Should You Use?
Figure 3.21 A Taxonomy of Charts and Graphs.

Source: Adapted from Abela, A. (2008). Advanced Presentations by Design: Creating


Communication That Drives Action. New York: Wiley.
Chapter 3: Nature of Data, Statistical Modeling
and Visualization

◼ Data Types, Objects, Attributes

◼ Data Visualization

◼ Measuring Data Similarity and Dissimilarity

◼ Data Quality

◼ Summary

47
4. Similarity and Dissimilarity
◼ Similarity is a numerical measure of how alike two data
objects are, and dissimilarity is a numerical measure of
how different two data objects are.

48
4. Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are

◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)


◼ Numerical measure of how different two data objects

are
◼ Lower when objects are more alike

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

49
4. Similarity and Dissimilarity

p and q are the attribute values for two data objects.


Chapter 3: Nature of Data, Statistical Modeling
and Visualization

◼ Data Types, Objects, Attributes

◼ Data Visualization

◼ Measuring Data Similarity and Dissimilarity

◼ Data Quality

◼ Summary

51
5. Data Quality
◼ Data quality is a measure of the condition of data based
on factors such as accuracy, completeness, consistency,
reliability and whether it's up to date.

◼ Examples of data quality problems:


◼ Noise

◼ Outliers

◼ Missing values

◼ Duplicate data
(1) Noise
◼ Noise refers to modification of original values
◼ Examples: distortion of a person’s voice when talking

on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


(2) Outliers
◼ Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
(3) Missing Values
◼ Reasons for missing values
◼ Information is not collected (e.g., people decline to
give their age and weight)
◼ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

◼ Handling missing values,


◼ Eliminate data objects

◼ Estimate missing values

◼ Ignore the missing value during analysis

◼ Replace with all possible values (weighted by their


probabilities)
(4) Duplicate Data
◼ Data set may include data objects that are duplicates, or
almost duplicates of one another
◼ Major issue when merging data from heterogenous

sources

◼ Examples:
◼ Same person with multiple email addresses

◼ Need data cleaning which is a process of dealing with

duplicate data issues


Chapter 3: Nature of Data, Statistical Modeling
and Visualization

◼ Data Types, Objects, Attributes

◼ Data Visualization

◼ Measuring Data Similarity and Dissimilarity

◼ Data Quality

◼ Summary

57
Summary
◼ Learned about data attribute types:
◼ Nominal, binary, ordinal, interval-scaled, ratio-scaled

◼ Studied many types of data sets,


◼ e.g., numerical, text, graph, Web, image.

◼ Learned about data visualization:


◼ map data onto graphical primitives

◼ Measure data similarity

◼ Reviewed all steps of data preprocessing.


◼ Explored many methods which have been developed but still
an active area of research.
◼ Knew data quality issues.

58
Note and Thank you!!!
Assignment #1: 11:59pm, Saturday

Thank You!

Copyright © Prof. Kwang Lee All rights reserved.


References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
60
Assignment #1 - 2/3
Open the assignment #1 MS word file and answer the
following questions.

6. Briefly outline how to compute the dissimilarity between objects


described by the following:
(a) Nominal attributes
(b) Binary attributes
(c) Numeric attribues

7. Briefly outline how to compute the visualization techniques described


by the following:
(a) Pixel-oriented
(b) Geometric-based
(c) Parallel coordinates
61

You might also like