Professional Documents
Culture Documents
1
Getting to Know Your Data
Lecture 2
2
Announcement!!!
Assignment #1: 11:59pm, Saturday
4
Chapter 3: Nature of Data, Statistical Modeling
and Visualization
◼ Data Visualization
◼ Data Quality
◼ Summary
5
1. Types of Data Sets
◼ Data is the lowest level of abstraction from which
information and knowledge are derived
◼ Data is the source for information and knowledge
◼ Data is a collection of facts
◼ usually obtained as the result of experiences,
observations, or experiments
◼ Data may consist of numbers, words, images, …
◼ Thus, data quality and data integrity → critical to
analytics
6
1. Types of Data Sets
7
1. Types of Data Sets
◼ Analytic of data is the process of finding patterns and
correlations within large data sets to predict outcomes,
thus getting to know data type and property are important
◼ A simple taxonomy of data:
8
1. Types of Data Sets
◼ Here, we define them into four categories namely,
◼ Record Data
◼ Graph and Network Data
◼ Ordered Data
◼ Spatial, Image, and Multimedia Data
9
1. 1 Record Data
◼ In data science, a record (also called a structure, struct, or
compound data) is a basic data structure.
◼ Records in a database or spreadsheet are usually called
"rows“, “column”.
◼ Relational record forms
◼ Transaction data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
10
1.2 Graph and Network Data
◼ A graph and network data can represent objects/links in
connection with human-related data to exhibit social
properties
◼ e.g., patterns in graph from which human behavioral
11
1.3 Ordered Data
◼ Ordered data is a categorical, statistical data type where
the variables have natural, ordered categories and the
distances between the categories is not known
◼ Temporal data: time-series
12
1.4 Spatial, Image, and Multimedia Data
◼ Spatial data, known as geospatial data (map), is
information about a physical object that can be represented
by numerical values in a geographic coordinate system.
◼ Multimedia data refers to data representing multiple
types of medium to capture information and experiences
related to objects and events.
◼ E.g., image data and video data
13
2. Data Objects
◼ Data object is a region of storage that contains a value or
group of values. A data object can represent an entity
described by several attributes.
◼ Examples:
14
2.1 Attributes
◼ Data objects are described by attributes. An attribute
has a data field representing a characteristics or features of
a data object
◼ An attribute is a property or characteristics of an object
◼ Attribute types:
◼ Nominal
◼ Binary
◼ Ordinal
◼ Numeric
◼ Discrete/continuous
2.2 Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {black, blond, brown, grey, red, white}
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important
16
2.2 Attribute Types
◼ Ordinal: a categorical, statistical data type where the
variables have natural, ordered categories and the
distances
◼ Values have a meaningful order ranking but magnitude
between successive values is not known.
◼ E.g., size = {small, medium, large}, grades = {A, B, C, D, F},
army rankings
◼ Customer satisfaction has the following ordinal
categories:
◼ 4: very satisfied
◼ 3: satisfied
◼ 2: neutral
◼ 1: somewhat dissatisfied
◼ 0: very dissatisfied
17
2.2 Attribute Types
◼ Numeric:
◼ It is quantity; such as integer or real-valued type
◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ Ratio
◼ We can speak of values as being an order of magnitude
18
2.2 Attribute Types
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
19
2.2 Attribute Types
◼ Continuous Attribute
◼ Continuous attributes are typically represented as
floating-point variables
◼ It has real numbers as attribute values
◼ E.g., temperature, height, or weight
◼ Practically, real values can only be measured and
represented using a finite number of digits
20
Chapter 3: Nature of Data, Statistical Modeling
and Visualization
◼ Data Visualization
◼ Data Quality
◼ Summary
21
3. Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data
onto graphical primitives as follows:
◼ Provide qualitative overview of large data sets
◼ Support to find interesting regions and suitable
parameters for further quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Help search for patterns, trends, structure, irregularities,
relationships among data
22
3. Data Visualization
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations
23
3.1 Pixel-Oriented Visualization Techniques
◼ The basic idea of pixel-oriented visualization techniques
is to represent as many data objects as possible on the
screen at the same time by mapping each data value to a
pixel of the screen by arranging the pixels adequately.
(a) Income (b) Credit Limit (c) transaction volume (d) age
24
3.1 Pixel-Oriented Visualization Techniques
◼ For a data set of n-
dimensions, create n-
windows on the screen,
one for each dimension
◼ The n-dimension values
25
(1) Laying Out Pixels in Circle Segments
◼ To save space and show the connections among multiple
dimensions, space filling is often done in a segment
28
3.2 Geometric Projection Visualization
Techniques
◼ Geometric projection methods:
◼ Direct data visualization
◼ Scatterplot matrices
◼ Landscapes
◼ Parallel coordinates
◼ Projection pursuit technique: Help users find meaningful
projections of multidimensional data
◼ Prosection views
◼ Hyperslice
29
(1) Direct Data Visualization
◼ Direct visualizations of image data make use of the images in their
original visible format
Vorticity
Ribbons with Twists Based on
news articles
visualized as
a landscape
• • •
34
3.3 Icon-Based Visualization Techniques
◼ Uses icons to represent multidimensional data values
◼ General techniques:
◼ Shape coding: Use shape to represent certain information
encoding
◼ Color icons: Use color icons to encode more information
◼ Tile bars: Use small icons to represent the relevant feature
vectors in document retrieval
35
3.3 Icon-Based Visualization Techniques
◼ This is the visualization of large multi-variate data values
as features of icons
◼ It is still a challenging task, especially when we
◼ Chernoff faces
◼ Stick figures
36
(1) Chernoff Faces
◼ A way to display variables on a two-dimensional surface,
i.e., let x be eyebrow slant, y be eye size, z be nose
length, etc.
◼ The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening)
37
(2) Stick Figure
◼ A stick figure is a very simple drawing of a person or
animal, composed of a few lines, curves, and dots.
◼ E.g., A census data figure showing age, income,
38
3.4 Hierarchical Visualization Techniques
◼ Hierarchical data visualization is a method to explain
how to show hierarchy with data visualization.
◼ Visualization of the data using a hierarchical
◼ Worlds-within-Worlds
◼ Tree-map
◼ Cone trees
◼ InfoCube
39
(1) Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
◼ Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
◼ Partitioning of the attribute value ranges into classes. The important
attributes should be used on the outer levels.
◼ Adequate for data with ordinal attributes of low
cardinality, but difficult to display more than nine
dimensions
◼ Important to map dimensions appropriately
40
(2) Worlds-within-Worlds
◼ Assign the function and two most important parameters to
innermost world. Fix all other parameters at constant
values - draw other (1 or 2 or 3 dimensional worlds
choosing these as the axes)
◼ Software N–vision: Dynamic interaction through data glove and
stereo displays, including rotation, scaling (inner) and translation
(inner/outer)
41
(3) Tree-Map
◼ A tree-map is a method for displaying hierarchical view
using nested figures, usually rectangles
◼ The information is displayed as a cluster of rectangles
varying in size and color, depending on their data value
42
(3) Tree-Map
◼ A screen-filling method can be used in a hierarchical
partitioning of the tiles into regions depending on the
attribute values
◼ The x- and y-dimension of the screen are partitioned
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43
(4) InfoCube
◼ Infocube is a 3-D visualization technique where
hierarchical information is displayed as nested semi-
transparent cubes
◼ The outermost cube corresponds to the top-level data,
44
(5) Three-D Cone Trees
◼ Cone tree is a 3-D visualization technique works well for
up to a thousand nodes or so
◼ First build a 2-D circle tree that arranges its nodes in
45
3.5 Which Chart or Graph Should You Use?
Figure 3.21 A Taxonomy of Charts and Graphs.
◼ Data Visualization
◼ Data Quality
◼ Summary
47
4. Similarity and Dissimilarity
◼ Similarity is a numerical measure of how alike two data
objects are, and dissimilarity is a numerical measure of
how different two data objects are.
48
4. Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
are
◼ Lower when objects are more alike
49
4. Similarity and Dissimilarity
◼ Data Visualization
◼ Data Quality
◼ Summary
51
5. Data Quality
◼ Data quality is a measure of the condition of data based
on factors such as accuracy, completeness, consistency,
reliability and whether it's up to date.
◼ Outliers
◼ Missing values
◼ Duplicate data
(1) Noise
◼ Noise refers to modification of original values
◼ Examples: distortion of a person’s voice when talking
sources
◼ Examples:
◼ Same person with multiple email addresses
◼ Data Visualization
◼ Data Quality
◼ Summary
57
Summary
◼ Learned about data attribute types:
◼ Nominal, binary, ordinal, interval-scaled, ratio-scaled
58
Note and Thank you!!!
Assignment #1: 11:59pm, Saturday
Thank You!