You are on page 1of 464

1

Lecture : Visualization Building Blocks


(Marks and Channels)
DATA VISUALIZATION
SPRING 2019

Dr. Muhammad Faisal Cheema


COMSATS UNIVERSITY
GOALS FOR TODAY
• Learn the basic visual primitives of visualizations
(marks and channels)
• Understand how marks and channels are
assembled to make visualizations
• Learn which marks and channels are most
effective for a given task (“perceptual ordering”)

32
In-class exercise: Critique & Redesign
Describe: What do you see? 1.Who is the intended audience?
2.What information does this visualization represent?
Analyze: How is the work organized? What are the visual
3. How many data dimensions does it encode?
encodings?
4. List several tasks, comparisons or evaluations it enables
Task: What is the purpose of the visualization? 5.What principles of excellence best describe why it is good / bad?
Decide: Is this a successful (effective) visualization? 6.Can you suggest any improvements?
7.Why do you like / dislike this visualization?

http://www.theatlantic.com/past/docs/images/issues/200709/win.jpg 30
MARKS & CHANNELS

38
Visualization Building Blocks

MARK = basic graphical element in an image

Points Lines Areas

Munzner,“VisualizationAnalysis and Design” (2014) 39


Visualization Building Blocks
CHANNEL = way to control the appearance of marks,
independent of the dimensionality of the geometric primitive
Position Color
Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

Munzner,“VisualizationAnalysis and Design” (2014) 40


Visualization Building Blocks

41
Visualization Building Blocks
# of attributes encoded:

41
Visualization Building Blocks
# of attributes encoded: 2

41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

41
Visualization Building Blocks
MARK: Areas
Points Lines

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

42
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

42
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

43
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

43
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

43
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

44
Visualization Building Blocks
# of attributes encoded: 4 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

44
Visualization Building Blocks
# of attributes encoded: 4 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

44
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

46
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

46
Visualization Building Blocks
# of attributes encoded: 3 MARK: Lines Areas
Points

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

47
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

48
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas

(4 WITH
POSITION)
CHANNEL :
Position Color

Horizontal Vertical Both

Shape Tilt

Size
Length Area Volume

+ position in 3D space 48
Kindlmann (2004) 50
51
Visualization Building Blocks
Marks as Items/Nodes
Points Lines Areas

Marks as Links
Containment Connection

Munzner,“VisualizationAnalysis and Design” (2014) 52


Visualization Building Blocks
Channels :
Marks as Items/Nodes Position Color
Points Lines Areas
Horizontal Vertical Both

Marks as Links
Containment Connection
Shape Tilt

Size
Length Area Volume

54
Visualization Building Blocks
Marks as Links
Containment Connection

53
How do I pick which marks or channels to use?
Bertin’s Semiology of Graphics
1.A, B, C are distinguishable
2.B is between A and C.
C
3.BC is twice as long asAB.
B
A Encode quantitative variables

"Resemblance, order and proportion are the three


signifieds in graphics.” - Bertin
Visual encoding variables
Position (x 2)
Size
Value
Texture
Color
Orientation
Shape
Characteristics of Visual Variables

Selective
Is a mark distinct from other marks?
Can we make out the difference between two marks?
Associative
Does it support grouping?
Quantitative
Can we quantify the difference between two marks?
Characteristics of Visual Variables

Order
Can we see a change in order?
Length
How many unique marks can we make?
Position
• Strongest visual variable Suitable
for all data types
• Problems:
• Sometimes not available
• Cluttering
Size & Length
• Good visual variable
• Easy to see whether one is bigger Grouping
works
• Judging differences
• Good for aligned bars (position)
• OK for changes in length
• Bad for changes in area
Shape

Great to recognize many classes.


No ordering.
Value
Good for quantitative data when length & size
are used.
Not very many shades recognizable
Supports grouping
Is preattentive (stands out) if sufficiently different
Color

Good for qualitative data Limited


number of classes!
Not good for quantitative data!
Is preattentive if sufficiently different.
Lots of pitfalls! Be careful!
Information in color and value
Value is perceived as ordered
Encode ordinal variables (O)

Encode continuous variables (Q) [not as well]

Hue is normally perceived as unordered


Encode nominal variables (N) using color
Bertin’s “Levels of Organization”
Position N O Q
Nominal
N O Q Ordered
Size
N O Q Quantitative
Value Note: Q <O <N
O
N
Texture N
Color N

Orientation
N
“Ordering of Elemental Perceptual Tasks”

Cleveland & McGill (1984) 56


“Ordering of Elemental Perceptual Tasks”

TASK:Which segment/bar is the maximum, and what is its percentage/value?


Cleveland & McGill (1984) 57
“Ordering of Elemental Perceptual Tasks”

Cleveland & McGill (1984) 58


“Ordering of Elemental Perceptual Tasks”

This is why pie


charts are bad!

Cleveland & McGill (1984) 58


This is why pie
charts are bad!

https://www.washingtonpost.com/news/wonk/wp/2013/06/17/the-usefulness-of-pie-charts-in-two-pie-charts/ 59
“Ordering of Elemental Perceptual Tasks”

This is why pie


charts are bad!

Cleveland & McGill (1984) 64


“Ordering of Elemental Perceptual Tasks”
Cleveland & McGill’s Results

1.0 1.5 2.0 2.5 3.0


Log Error

Cleveland & McGill (1984) 65


Channel Ranking by Data Type
(Categorical)

Mackinlay
(1986)
Channel Ranking by Data Type
(Categorical)

Mackinlay
(1986)
Channel Ranking by Data Type

Mackinlay
(1986)
Channel Ranking by Data Type
AREA
Quantitative
Ordinal

Categorical

Mackinlay
(1986)
Channel Ranking by Data Type

Mackinlay
(1986)
Channel Ranking by Data Type

Mackinlay
(1986)
Cleveland & McGill’s Results

Positions
1.0 1.5 2.0 2.5 3.0
Lo g Error

Crowdsourced Results

Angles

Circular
areas

Rectangular
areas
(aligned or in
a
treemap)

1.0 1.5 2.0


Log Error
2.5 3.0 Heer & Bostock
(2010)
Expressiveness and Effectiveness
Effectiveness principle: the importance of the attribute should
match the salience of the channel; that is,
its noticeability.
(i.e., encode most important attributes with
highest ranked channels)
Expressiveness principle: the visual encoding should express all of,
and only, the information in the dataset
attributes.
(i.e., data characteristics should match the
channel) Mackinlay(1986) 69
Expressiveness and Effectiveness

Mackinlay(1986) 71
Expressiveness and Effectiveness

Mackinlay(1986) 72
Channels: Expressiveness Types and Effectiveness Ranks

Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes


Position on common scale Spatial region

Position on unaligned scale Color hue

Length (1D size) Motion

Tilt/angle Shape

Area (2D size)

Depth (3D position)

Color luminance

Color saturation

Curvature

Volume (3D size)


73
1

Lecture : Visualizationof
Multidimensional Data

DATA VISUALIZATION
SPRING 2019

Dr. Muhammad Faisal Cheema


COMSATS UNIVERSITY
Recap
The Visualization Pipeline (InfoVis)

tas
k

Raw data Data Visual Visualization


tables structures (views)
(information)
Data Visual View
transformations mappings transformations

User interaction
The Visualization Pipeline (InfoVis)

tas
k

Raw data Data Visual Visualization


tables structures (views)
(information)
Data Visual View
transformations mappings transformations

User interaction
The Visualization Pipeline (InfoVis)

tas
k

Raw data Data Visual Visualization


tables structures (views)
(information)
Data Visual View
transformations mappings transformations

User interaction
The Visualization Pipeline (InfoVis)

tas
k

Raw data Data Visual Visualization


tables structures (views)
(information)
Data Visual View
transformations mappings transformations

User interaction
Why Step 2 and 3 highly correlated ???

tas
k

Raw data Data Visual Visualization


tables structures (views)
(information)
Data Visual View
transformations mappings transformations

User interaction

Data dictates visualization design !!


How Data influences visualization design ??

• Visualization of Multi Dimensional Data


• Visualization of High Dimensional Data
• Visualization of Hierarchical Data (Trees)
• Visualization of Graph and Network Data
• Visualization of Spatial Data
• Visualization of Time series Data
• Visualization of Textual Data (Text)
• Etc.
How Data influences visualization design ??

• Visualization of Multi Dimensional Data


• Visualization of High Dimensional Data
• Visualization of Hierarchical Data (Trees)
• Visualization of Graph and Network Data
• Visualization of Spatial Data
• Visualization of Time series Data
• Visualization of Textual Data (Text)
• Etc.
Multidimensions / Multivariate Data
Beyond Tables and Charts
Data sets of dimensions 1,2,3 are common
Number of variables per class
1 - Univariate data
2 - Bivariate data
3 - Trivariate data
>3 - Hypervariate/Multivariate data
Univariate Data

Representations

7 Bill
Tukey box plot
5
low Middle 50% high
3

1 Mean

0 20
Bivariate Data

Representations

Scatter plot is common

price

mileage
Trivariate Data

Representations

3D scatter plot is possible


price

horsepower

mileage
Trivariate

3D scatterplot, spin plot


2D plot + size (or color…)
Multi-Dimensional / Multivariate Data

Each attribute defines a dimension


Small # of dimensions easy
Data mapping
What about many dimensional
data?
n-D

What does 10-D


space look like?
Map n-D space onto 2-D screen

Visual representations:
Complex glyphs
E.g. star glyphs, faces, embedded visualization, …
Multiple views of different dimensions
E.g. small multiples, plot matrices, brushing histograms, …
Non-orthogonal axes
E.g. Parallel coords, star coords, …
Tabular layout
E.g. TableLens, …
Interactions:
Dynamic Queries
Brushing & Linking
Selecting for details, …
Combinations (combine multiple techniques)
Glyphs
Glyphs: Chernoff Faces

Encode different variables’ values in characteristics


of human face
d1
Glyphs: Stars
d7 d2

d6 d3

d5 d4
Non-orthogonal axis
Parallel Coordinates (2D)
• Encode variables along a horizontal row
• Vertical line specifies values

Dataset in a Cartesian graph Same dataset in parallel coordinates


Parallel Coordinates (4D)

Forget about Cartesian orthogonal axes


(0,1,-1,2)=

x y z w

0 0 0 0
Parallel Coordinates Example

Basic

Grayscale

Color
Parallel Coordinates
Parallel Coordinates

Visualize up to ~two dozen dimensions at once


1. Draw parallel axes for each variable
2. For each tuple, connect points on each axis
Between adjacent axes: line crossings imply neg.
correlation, shared slopes imply pos. correlation.
Full plot can be cluttered. Interactive selection
can be used to assess multivariate relationships.
Highly sensitive to axis scale and ordering.
Expertise required to use effectively!
Radar Plot / Star Graph

“Parallel” dimensions in polar coordinate space


Best if same units apply to each axis
Chord Diagram
Multiple views of different dimensions
Scatterplot Matrix (SPLOM)
Scatter plots
for pairwise
comparison
of each data
dimension.
Scatterplot Matrix

http://noppa5.pc.helsinki.fi/koe/3d3.html
Multiple Views

Give each variable its own display

A B C D E
2
1 4 1 8 3 5
2 6 3 4 2 1
3
3 5 7 2 4 3
4 2 6 3 1 5
4

A B C D E
Small Multiples
Small Multiples
Small Multiples
Small Multiples
Trellis Plots

A trellis plot subdivides space to enable


comparison across multiple plots.
Typically nominal or ordinal variables are used
as dimensions for subdivision.
Example Simple Plot
Trellis Plots
Tabular Layout
Table Lens
Table Lens

Idea: Make the text more visual and symbolic


Just leverage basic bar chart idea

Characteristics
Can sort on any attribute (row)
Focus on an attribute value (show only cases having that value) by doubleclicking on it
Can type in queries on different attributes to limit what is presented to. Note this is
main contribution: dynamic control (selection/change/querying/filtering) of individual
attributes.
1

Lecture : Visualizing
Graphs and Networks

DATA VISUALIZATION
SPRING 2019

Dr. Muhammad Faisal Cheema


COMSATS UNIVERSITY
Why graphs and networks?

http://www.sci.utah.edu/~miriah/cs6630/lectures/L13-trees-graphs.pdf
Graph and Network Uses
• In Information Visualization, any number
of data sets can be modeled as a graph
– Telephone system
– World Wide Web
– Distribution network for on-line retailer
– Call graph of a large software system
– Semantic map in an AI algorithm
– Set of connected friends
– Social Networks
Graphs are more complicated
than trees
Graph Terminology

• Graphs can have cycles


• Edges can be directed or undirected
• Degree of a vertex = # connected nodes
• In-degree and out-degree for directed graphcs
• Graph edges can have values (weights)
• Nominal (N), ordinal (O), quantitative (Q)
Graph Drawing considerations
Vertex Issues
• Shape
• Color
• Size
• Location
• Label
Edge Issues
• Color
• Size
• Label
• Form
• Polyline, straight line, orthogonal, grid,
curved, planar, upward/downward, ...
Edge Drawing Strategies

Label

Thickness

Color

Directed
Complexity Considerations
• Crossings-- minimize towards planar
• Total Edge Length-- minimize towards
proper scale
• Area-- minimize towards efficient use of
space
• Maximum Edge Length-- minimize longest
edge
• Uniform Edge Lengths-- minimize variances
• Total Bends– minimize orthogonal towards
straight-line
Graph Visualization Problems
• Graph layout and positioning
– Make a concrete rendering of abstract graph
• Scale
– Not too much of a problem for small graphs, but
large ones are much tougher
• Navigation/Interaction
– How to support user changing focus and moving
around the graph
Graph Layout
• How to position the nodes and edges?
• Avoid clutter
• Maintain appropriate relations
The Hairball Problem
• How to position the nodes and
edges?
• Avoid clutter
• Maintain appropriate relations
Layout Types
• Grid Layout
– Put nodes on a grid
• Force Directed Layout
– Model graph as set of masses connected by
springs
• Planar Layout
– Detect part of graph that can be laid out without
edge crossings
• Attribute Based Layouts
Layout Subproblems
• Rank Assignment
– Compute which nodes have large degree, put
those at center of clusters
• Crossing Minimization
– Swap nodes to rearrange edges
• Subgraph Extraction
– Pull out cluster of nodes
• Planarization
– Pull out a set of nodes that can lay out on plane
Planar Layouts
Starting simple: planar 3-vertex
connected graphs (what?)
Tutte Embedding
• Each node should be the average of its neighbors

• Aside from the boundary, which is user-specified

• This gives a linear system

• Theorem: if graph is planar, embedding is


crossing-free
Tutte Embedding
Downsides

Follows Tutte Embedding correctly, but


visually cluttered
http://www.cs.arizona.edu/~kpavlou/Tutte_Embedding.pdf
SUGIYAMA-TYPE LAYOUT
- great for graphs that
have an intrinsic ordering
- ‘depth’ in graph mapped
to one axis

UNIX ancestry
SUGIYAMA STEP 1
- create layering of graph
- from domain specific knowledge
- longest path from root

- algorithmically determine best layering (NP-Hard)

- dummy nodes for long edges


1

2 3 4

5 6 7 8

9 10 11
SUGIYAMA STEP 2
- minimize crossings layer by layer (NP- hard)
- numerous heuristics available

2 3 4

8 6 5 7

11 9 10
SUGIYAMA STEP 3
- final assignment of x- coordinates
- routing of edges

2 3 4

8 6 5 7

11 9 10
Gansner 1993
SUGIYAMA
+ nice, readable top down flow
+ relatively fast (depending on heuristic
used for crossing minimization)

- not really suitable for graphs that don’t


have an intrinsic top down structure
- hard to implement
Force directed layouts
FORCE DIRECTED LAYOUT
- no intrinsic layering, now what?
- physics model

- edges = springs

- nodes = repulsive particles


Force-directed Layouts
• We want edges to be neither too small or too large

• Physical analogy: Springs compress or expand


to achieve ideal length

• We don’t want vertices to bunch up together

• Physical analogy: Electric charges with the


same sign don’t bunch up
AESTHETIC RESULTS

highschool
dating network
FORCE MODEL
- many variations, but usually physical
analogy:
- repulsion : fR(d) = CR * m1*m 2 / d2

- m1, m2 are node masses


- d is distance between nodes

- attraction : fA(d) = CA * (d – L)
- L is the rest length of the spring
- i.e. Hooke’s Law

- total force on a node x with position x’


-  neighbors(x) : f (||x’-y’||) * (x’-y’) + -f (||x’-y’||) * (x’-y’)
A R
ALGORITHM
- start from random layout
- (global) loop:
- for every node pair compute repulsive force

- for every edge compute attractive force

- accumulate forces per node

- update each node position in direction of


accumulated force
- stop when layout is ‘good enough’
FORCE DIRECTED LAYOUTS
+ very flexible, aesthetic
layouts on many types of
graphs
+ can add custom forces
+ relatively easy to implement

- repulsion loop is O(n 2) per


iteration
- can speed up to O(N log N) using
quadtree or k-d tree
- prone to local minima
- can use simulated annealing
mentionmap
Grid Layouts
Adjacency Matrix
Alternate to node-link diagram: adjacency matrix

27
Slide by Frank van Ham
Adjacency Matrix

29
Henry & Fekete (2006)
Adjacency Matrix
• Change network to tabular data
and use a matrix representation
• Derived data: nodes are keys,
edges are boolean values
• Task: lookup connections, find well-
connected clusters
• Scalability: millions of edges

• Can encode edge weight, too 1


8
Cliques in Adjacency Matrices

1
9
Node link and Adjacency Matrix
(can co-exist together)

2
0

[McGuffin]
Adjacency Matrix
Pros:
•great for dense graphs
• visually scalable
•can spot clusters
Cons:
• row order affects what you can see
• abstract visualization
• hard to follow (multilink) paths
Node-Link or Adjacency Matrix?
• Empirical study: For most tasks, node-link is better for small graphs and adjacency
better for large graphs 2

• Immediate connectivity or neighbors are ok, estimating size (nodes & edges also ok)
• People tend to be more familiar with node-link diagrams
3
0

https://bost.ocks.org/mike/miserables/
Attribute driven layouts
ATTRIBUTE-DRIVEN LAYOUT
- large node- link diagrams get messy!
- are there additional structures we
can exploit?

-idea: use data attributes to perform


layout
- e.g., scatterplot based on node values

- dynamic queries and/ or brushing can be


used to enhance perception of
connectivity

37
cerebral
metabolic
network
0 1.00

GLU

GLYCOLYSIS
G6P
a

PPP
b

R5 P
c

G3P

G3P

PYR

TCA
CIT 3

0 1.00
OTHER NODE LINK LAYOUTS
- orthogonal
- great for UML diagrams
- algorithmically complex

- circular layouts
- emphasizes ring topologies
- used in social network
diagrams
- nested layouts
- recursively apply layout
algorithms
- great for graphs with
hierarchical structure
More node link layouts - Arc Diagram

1
3
1

Lecture:
Interaction Techniques in Visualization

DATA VISUALIZATION
SPRING 2019

Dr. Muhammad Faisal Cheema


COMSATS UNIVERSITY
Fundamental idea
• Interpret the state of elements in the UI as a clause
in a query. As UI changes, update data
Interaction Summary
Change over Time

Select

Navigate
Item Reduction Attribute Reduction

Zoom Slice
Geometric or Semantic

Pan/Translate Cut

Constrained Project

[Munzner (ill. Maguire), 2014]


Selection
Selection
• Selection is often used to initiate other changes
• User needs to select something to drive the next change
• What can be a selection target?
- Items, links, attributes, (views)
• How?
- mouse click, mouse hover, touch
- keyboard modifiers, right/left mouse click, force
• Selection modes:
- Single, multiple
- Contiguous?
Highlighting
Highlighting
• Selection is the user action
• Feedback is important!
• How? Change selected item's visual encoding
- Change color: want to achieve visual popout
- Add outline mark: allows original color to be preserved
- Change size (line width)
- Add motion: marching ants

21
Highlighting
• Selection is the user action
• Feedback is important!
• How? Change selected item's visual encoding
- Change color: want to achieve visual popout
- Add outline mark: allows original color to be preserved
- Change size (line width)
- Add motion: marching ants

21
Highlighting

[http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html]
Navigation
Navigation
• Fix the layout of all visual elements but provide methods for the
viewpoint to change
• Camera analogy: only certain features visible in a frame
- Zooming
- Panning (aka scrolling)
- Translating
- Rotating (rare in 2D, important in 3D)
Panning

https://www.google.com/finance?q=INDEXFTSE
Zooming

https://www.google.com/finance?q=INDEXFTSE
“Geometric” vs.
“Semantic”
Zooming Zooming

Semantic Zooming: visual appearance of


objects can change at
Geometric Zooming: just like a camera
different scales
http://bl.ocks.org/mbostock/3680957
Multiple Views
Multiple Views
• Facet (noun and verb)
- particular aspect or feature of something
- to split
• Partition visualization into views/layers
- Either juxtapose or superimpose
- Depends on data and encoding
Multiple Views
Juxtapose and Coordinate Multiple Side-by-Side Views

Linked Highlighting

Share Data: All/Subset/None

Share Navigation
Multiple Views

All Subset None

Overview/
Same Redundant
Detail
Small Multiples

Multiform,
Overview/ No Linkage
Multiform Detail
Multiple Views
Partition into Side-by-Side Views

Superimpose Layers
Multiform Views
• The same data visualized in different ways
• Does not need to be a totally different encoding (all choices need
not be disjoint), e.g. horizontal positions could be the same
• One view becomes cluttered with too many attributes
• Consumes more screen space
• Allows greater separability between channels
Multiple Views
Example of Facets
Small Multiples
• Same encoding, but different data in each view (e.g. SPLOM)
sepal length
7

sepal width
4.0

3.5

3.0

2.5

2.0

petal length
6

1
2.5
petal width
2.0

1.5

1.0

0.5

0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7

[http://bl.ocks.org/mbostock/4063663]
2
Brushing
sepal length
7

sepal width
4.0

3.5

3.0

2.5

2.0

petal length
6

2.5
petal width
2.0

1.5

1.0

0.5

0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7

[http://bl.ocks.org/mbostock/4063663]
Multiple Views
Example of Partitioned Views
Overview-Detail View

[Wikipedia]
28
Partitioned View
Population

65 Years and Over


10M
45 to 64 Years

25 to 44 Years
9.0M
18 to 24 Years

14 to 17 Years
8.0M
5 to 13 Years

Under 5 Years
7.0M

6.0M

5.0M

4.0M

3.0M

2.0M

1.0M

0.0
CA TX NY FL IL PA

[M. Bostock, http://bl.ocks.org/mbostock/3887051]


5
Matrix Alignment

[Becker et al., 1996]


6
Recursive Subdivision
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Harrow Barnet Enfield Haringey Waltham Forest Redbridge Havering
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det

Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Hillingdon
Brent Camden Westminster Hackney Newham Barking
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det

Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Hounslow Ealing Hammersmith Kensington Islington Tower Hamlets Greenwich
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det

Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Richmond Kingston Merton Wandsworth City of London Southwark Bexley
Semi Semi Det
Det Semi Det Semi Det Semi Det Semi Det Semi Det

Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Sutton Croydon Lambeth Lewisham Bromley
Semi Det Semi Det Semi Det Semi Det Semi Det
[Slingsby et al., 2009]
7
Multiple Views
Example of Superimposed layers
Superimposed Line
Charts
Temperature (ºF)

80

Austin
70

New York
60

San Francisco

50

40

30

20
October November December 2012 February March April May June July August September

[M. Bostock, http://bl.ocks.org/mbostock/3884955]


30
Focus + Context
Focus and Context
Provides detailed view of a subset within context of
the full dataset

Multiform,
Overview/
Overview/
Detail
Detail

Why? For large or complex data, a single view of the entire


dataset cannot capture fine details
Brush & Link
Brushing & Linking multiple views that are simultaneously visible
and linked together such that actions in one
view affect the others

primary strategy: highlighting

Linked Highlighting
Zoom Techniques
Provides detailed view of a subset within context of
the full dataset

Multiform,
Overview/
Overview/
Detail
Detail
Focus+Context
• Show everything at once but compress regions that are not the
current focus
- User shouldn't lose sight of the overall picture
- May involve some aggregation in non-focused regions
- "Nonliteral navigation" like semantic zooming
• Elision
• Superimposition: more directly tied than with layers
• Distortion

35
Focus+Content
Overview
Embed Reduce
Elide Data Filter

Aggregate
Superimpose Layer

Embed

Distort Geometry

36
Focus + Context
Elision
Elision
• There are a number of examples of elision including in text ,
DOITrees, …
• Includes both filtering and aggregation but goal is to give overall
view of the data
• In visualization, usually correlated with focus regions

37
Elision: DOITrees

[Heer and Card, 2004]


38
Focus + Context
Superimposed layers
Superimposition with
Interactive Lenses

(a) Alteration (b) Suppression


[ChronoLenses and Sampling Lens in Tominski et al., 2014]
39
Superimposition with
Interactive

(c) Enrichment
[Extended Lens in Tominski et al., 2014]
40
Focus + Context
Distortion
Distortion: Fisheye
Lens

[M. Bostock, http://bost.ocks.org/mike/fisheye/]


16
Fisheye
Lens

Leung 1994 33
http://www.cs.umd.edu/class/fall2002/cmsc838s/tichi/fisheye.html
Distortion Choices
• How many focus regions?
- One
- Multiple
• Shape of the focus?
- Radial
- Rectangular
- Other
• Extent of the focus
- Constrained similar to magic lenses
- Entire view changes
• Type of interaction:
- Geometric, moveable lenses, rubber sheet
Examples of
Focus + Context
Stretch and Squish Navigation

[McLachlan et al., 2008]


20
Focus+Context in
Graph Exploration
Focus+Context in
Graph Exploration
Focus+Context in
Graph Exploration

(a) Bring (step 1) – Selecting a node fades out (b) Bring (step 2) – Neighbor nodes are pulled (c) Go – After selecting a neighbor (the green
all graph elements but the node neighborhood. close to the selected node. node in Fig. 4(b)), a short animation brings the
focus towards a new neighborhood.
1

Lecture : Color in Visualization Design

DATA VISUALIZATION
SPRING 2019

Dr. Muhammad Faisal Cheema


COMSATS UNIVERSITY
PERCEPTION

2
CONES & RODS

https://askabiologist.asu.edu/sites/default/files/resources/articles/seecolor/Light-though-eye-big.png 3
CONES & RODS

http://i.stack.imgur.com/wIbcE.jpg
10
http://thebrain.mcgill.ca/flash/a/a_02/a_02_m/a_02_m_vis/a_02_m_vis.html
CONES & RODS This is why we luminance
(brightness) is more
effective encoding channel!
Rods:120 million
Cones: 5-6 million

This is why we are so


Cones: sensitive to red!

64% red-sensitive
32% green-sensitive
2% blue-sensitive.

http://arthistoryresources.net/visual-experience-2014/visual-experience-2014-images/red-green-blue-wavelengths+rods-big.jpg 11
PERIPHERAL VISION

https://en.wikipedia.org/wiki/Peripheral_vision 6
LOW-LEVEL FEATURE ANALYSIS
Shape

Color

Motion

Ware,VTFD 7
Use these “popout” effects to
help design effective
visualizations!

(E.g., drawviewer’s attention to


main points, effective
redundant encodings, etc.)

Ware,VTFD 8
(A LITTLE MORE) PERCEPTION

9
“Get it right in black and white.”
-Maureen Stone

https://research.tableau.com/user/maureen-stone 10
Luminance
Luminance = the amount of visible light that comes to the eye from a surface

Luminance

(lightness)

Lightness = the perceived intensity of reflected light (reflectance) from a surface

Brightness = the perceived intensity of emitted light


11
Lightness Constancy

The perception that the apparent


brightness of light and dark surfaces
remains more or less the the same
under different luminance conditions
is called lightness constancy.

12
“Simultaneous Contrast”

13
“Simultaneous Contrast”

20
“Simultaneous Contrast”

Avoid gradients as backgrounds or bars!


20
21
22
23
23
24
25
Luminance Channel Summary
• No edges without lightness difference
• No shading without lightness variation
• Has higher spatial sensitivity than color channels
• Contrast defines legibility, attention, layering
• Controlling luminance is primary rule of design

26
COLOR

27
Why color…?
• Color for labeling and annotation
• Color for measuring (encoding sequential data)
• Color for encoding categories
• Color to encoding meaning (conventions, representation)
• Color as beauty (aesthetics)
29
Why color…?
Functions of color:
Identify, Group, Layer, Highlight

Ware “InformationVisualization:Perception for Design” 30


“… avoiding catastrophe becomes
the first principle in bringing color
to information: above all, do no harm.”
-Edward Tufte

Tufte,“Envisioning Information” 31
CONES & RODS Red
Green
Blue

trichromacy = possessing three independent


channels for conveying color information
https://askabiologist.asu.edu/sites/default/files/resources/articles/seecolor/Light-though-eye-big.png 32
CONES & RODS - COLOR PERCEPTION
opponent-process model: visual
system detects differences between
the response of cones
L L
3 opponent channels:
black vs. white (Luminance)
▸combination of R & G
red vs. green
▸difference between R & G
blue vs. yellow
▸difference between L & B

NOTE:opposite colors are never perceived together (no reddish green or bluish yellow)
33
Color
Constancy

34
35
“Simultaneous Contrast”

36
“Simultaneous Contrast”

37
“Simultaneous Contrast”

38
“Simultaneous Contrast”

Be careful with bars and scatter plot points - the colors mayappear differently with different background
colors and neighboring colors!

Be aware that colors in legends mayappear different than on the plot! 38


“Simultaneous Contrast”

39
“Simultaneous Contrast”

39
“Simultaneous Contrast”

40
Small Area Effects

“Bezold Spreading Effect” 41


Small Area Effects

“Bezold Spreading Effect” 42


Be careful with colors in scatter plots!
Be
Be aw
aware
are of
thatSmall Area Effects
color chain
colors ngelegends
s whenmayappear
adding bordedifferent
rs arounthan
d barsonand
theplots!
plot!

“Bezold Spreading Effect” 42


Which area is larger
(green or red)?

43
Which area is larger
(green or red)?

Areas are equal(!).

Cleveland & McGill,“A Color-Caused Optical Illusion on a Statistical Graph”,1983 44


Color Vocabulary Summary

Luminance

Saturation

Hue

45
Color Deficiencies (Color Blindness)
Person with faulty cones (or faulty pathways):
normal

Protanope = faulty red cones


Deuteranope = faulty green cones

Tritanope = faulty blue cones

46
Color Deficiencies (Color Blindness)

47
48
Those with deuteranope color blindness (red/green) will have difficulty seeing the numbers.
“Get it right in black and white.”

49
“Get it right in black and white.”

50
“Get it right in black and white.”

51
“Get it right in black and white.”

52
“Get it right in black and white.”

53
“Get it right in black and white.”

54
Color Deficiencies (Color Blindness)

http://www.vischeck.com/vischeck/vischeckImage.php 55
https://www.nytimes.com/interactive/2018/02/06/climate/flood-toxic-chemicals.html
Primary Colors?
• Red, Green, and Blue
• Red, Yellow, and Blue
• Orange, Green, and Violet
• Cyan, Magenta, and Yellow
• All of the above!

17
Color Addition and Subtraction
Color Spaces and Gamuts

[http://dot-color.com/2012/08/14/color-space-confusion/]
Color Spaces and Gamuts
• Color space: the organization of all colors in space
- Often human-specific, what we can see (e.g. CIELAB)
• Color gamut: a subset of colors
- Defined by corners on in the color space
- What can be produced on a monitor (e.g. using RGB)
- What can be produced on a printer (e.g. using CMYK)
- The gamut of your monitor != the gamut of someone else's != the
gamut of a printer
Color Models
• A color model is a representation of color using some basis
• RGB uses three numbers (red, blue, green) to represent color
• Color space ~ color model, but there can be many color models
used in the same color space (e.g. OGV)
• Hue-Saturation-Lightness (HSL) is more intuitive and useful
- Hue captures pure colors
- Saturation captures the amount of white mixed with the color
- Lightness captures the amount of black mixed with a color
- HSL color pickers are often circular
• Hue-Saturation-Value (HSV) is similar (swap black with gray for the
final value), linearly related
Color Maps
Color Map = mapping between color and value

http://matplotlib.org/mpl_examples/color/colormaps_reference_05.png 57
Rainbow Color Map
Why this color map is a poor choice...
• No perceptual ordering (confusing)
• No luminance variation (obscures details)
• Viewers perceive sharp transitions in color as sharp
transitions in the data, even when this is not the case
(misleading)

Borland & Russell (2007) 59


Rainbow Color Map
• No perceptual ordering (confusing)

Borland & Russell (2007) 60


Rainbow Color Map
• No luminance variation (obscures details)
• Viewers perceive sharp transitions in color as sharp transitions in the data,
even when this is not the case (misleading)

61
Artifacts from Rainbow Colormaps
Colormap

• A colormap specifies a mapping between colors and data values


• Colormap should follow the expressiveness principle
• Types of colormaps:

Binary Categorical

Diverging Sequential
Categorical vs. Ordered

• Hue has no implicit ordering: use for categorical data


• Saturation and luminance do: use for ordered data

Luminance

Saturation

Hue

[Munzner (ill. Maguire), 2014]


Color Maps
THREE MAIN TYPES:
Categorical Does not imply magnitude differences (categorical/
nominal data)
Distinct hues with similar emphasis

Sequential Best for ordered data that progresses from low to


high (ordinal, quantitative data)
Luminosity channel effectively employed

Diverging For data with a “diverging” (mid) point (quantitative


data)
Equal emphasis on mid-range critical values and
extremes at both ends of the data range
64
Brewer,CynthiaA.1994. Color use guidelines for mapping and visualization. Chapter 7 (pp. 123-147) inVisualization in Modern Cartography
Color Maps
ALSO...
Bivariate Displays two variables
Combination of two sequential color schemes
These are verydifficult to design effectively, make
intelligible, and be color blind friendly.

+ =

65
http://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/
Categorical Colormap Guidelines

• Don't use too many colors (~12)


• Remember your background has a color, too
• Nameable colors help
• Be aware of luminance (e.g. difference between blue and yellow)
• Think about other marks you might wish to use in the visualization
Categorical Colormaps

[colorbrewer2.org]
Categorical Colormaps

[colorbrewer2.org]
Number of distinguishable colors?

[Sinha & Meller, 2007]


13
Number of distinguishable colors?

[Sinha & Meller, 2007]


13
Discriminability
• Often, fewer colors are better
• Don't let viewers combine colors because they can't tell the
difference
• Make the combinations yourself
• Also, can use the "Other" category to reduce the number of colors
Ordered Colormaps
• Used for ordinal or quantitative attributes
• [0, N]: Sequential
• [-N, 0, N]: Diverging (has some meaningful midpoint)
• Can use hue, saturation, and luminance
• Remember hue is not a magnitude channel so be careful
• Can be continuous (smooth) or segmented (sharp boundaries)
- Segmented matches with ordinal attributes
- Can be used with quantitative data, too.
Continuous Colormap
Sequential Colormap
Color Maps

67
Color Maps

68
Color Maps

Sequential (wrong!) Diverging


Sequential rainbow(wrong!) https://www.research.ibm.com/people/l/lloydt/color/color.HTM 69
Color Brewer

http://colorbrewer2.org/ 70
Colorgorical

http://vrl.cs.brown.edu/color 71
Color Advice Summary
Use a limited hue palette
• Control color “pop out” with low-saturation colors
• Avoid clutter from too many competing colors
Use neutral backgrounds
• Control impact of color
• Minimize simultaneous contrast
Use Color Brewer for scales
Don’t forget aesthetics!
Based on Slides byHanspeter Pfister, Maureen Stone 72
Color Design Rules

Wang, et al.,“Color Design for IllustrativeVisualization” (2008) 73


Color Design Rules
R1:Vivid colors (bright,saturated colors) stand out.They guide attention to a particular feature, generating the pop-out effect.
R2:An excessive amount of vivid colors is perceived as unpleasant and overwhelming; use them between duller background tones.
R3: Foreground-background separation works best if the foreground color is bright and highly saturated, while the background is
de-saturated.
R4: Colors can be better discriminated if they differ simultaneously in hue,saturation and lightness.
R5:The low end lightness steps should be very small, while the high end requires larger steps (Weber’s Law).
R6: Discrimination is poorer for small objects. Hue, saturation and lightness discrimination all decrease.
R7: Complementary (opponent) colors are located opposite on the color wheel and have the highest chromatic contrast.When
mixing opponent colors they may cancel each other, giving neutral grey.
R8: Some hues appear inherently more saturated than others.Yellow has the least number of perceived saturation steps (10). For
hues on both sides of yellow, the saturation steps increase linearly.
R9:An opposite effect of R8 is that the brightest lights fall in the yellow range, while blues, violets (purples) and reds are least
bright.
R10: For labeling, apart from black, white, grey, there are 4 primary colors (red, green, blue, yellow) and 4 secondary colors (brown
orange, purple, pink). Also, the number of color labels should be ≤ 6-7.
R11:Warm colors (red,orange,yellow) excite emotions,grab attention. Cold colors (green to violet) create openness and
distance.
R12: Important for hue-based labeling is the fact that increasing the lightness (and saturation) does not change the perceived hue.
R13:Also important for labeling is that objects of similar hue are perceived as a group, while objects of different hues are
perceived as belonging to different groupings. Wang, et al.,“Color Design for IllustrativeVisualization” (2008) 74
Color Design Rules
R5:The low end lightness steps should be very small, while the high end requires larger steps (Weber’s Law).

Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.

ΔS = constant
S

where S is the initial stimulus and ΔS the difference between stimuli


(“just noticeable difference”).

*Ratios are more important than magnitude differences. 75


Color Design Rules
R5:The low end lightness steps should be very small, while the high end requires larger steps (Weber’s Law).

Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.

Just-noticeable Difference ΔS = constant ~ 1%


Background Intensity S
where S is the initial stimulus and ΔS the difference between stimuli
(“just noticeable difference”).

*Ratios are more important than magnitude differences. 75


Color Design Rules
R5:The low end lightness steps should be very small, while the high end requires larger steps (Weber’s Law).

Weber’s Law

We tend to perceive discrete steps in continuous


variations in magnitude.
79
Color Design Rules
R6: Discrimination is poorer for small objects. Hue, saturation and lightness discrimination all decrease.

Szafir "Modeling Color Difference forVisualization Design" (2017) 80


More (Advanced) Color Picking Advice

If picking colors and making your own palette, make sure to transition through and pick dimcriminatable
colors that varyin hue and brightness.
Wong "Points of view:Color coding" (2010) 81
More (Advanced) Color Picking Advice
COLOR SPACES
RGB HSL Lab

Great for monitor display Intuitive: Hue, Saturation, Lightness Perceptually Uniform!
Not perceptually uniform Not perceptually uniform (L approximates human
(HSV is a variation on HSL) perception of lightness)
a = R/G and b = Y/B channel
Perceptually uniform: a change of the same amount in a color value
82
should produce a change of about the same visual importance
More (Advanced) Color Picking Advice
Luminance is tricky…

HSL

Lab

83
(static or interactive) (abstract or spatial)

visualization: the visual representation of data


to reinforce human cognition

4
Why visualize your data?
• RECORD information
• ANALYZE data to support reasoning
• CONFIRM hypotheses
• COMMUNICATE ideas to others

5
GOALS FOR TODAY
• Learn basic “do’s and don’t”s of visualization design in
order to be honest, have integrity, and be clear
• Learn Tufte’s “Graphical Integrity” principles
• Be aware that there is a fuzzy gray area of interpretation
and opinion on integrity

6
9
2/14/2019

Graphical Excellence
that which gives the viewer the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
10
2/14/2019

Graphical Excellence
• that which gives the viewer • the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
Minard’s map of Napoleon’s march to and from Moscow
11
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
12
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


13
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


clear, detailed, and thorough labeling should be
used to defeat graphical distortion and ambiguity
14
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


clear, detailed, and thorough labeling should be
used to defeat graphical distortion and ambiguity
show data variation,
not design variation
15
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


clear, detailed, and thorough labeling should
be used to defeat graphical distortion and
ambiguity
show data variation,
not design variation
16
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


clear, detailed, and thorough labeling should 100
be used to defeat graphical distortion and
90
ambiguity
show data variation, 80
not design variation
70

60

50

40

30

20

10

0
1979 84 89 94 99 2004
17
2/14/2019

Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented

graphics must not quote out of context


clear, detailed, and thorough labeling should 100
be used to defeat graphical distortion and
90
ambiguity
show data variation, 80
not design variation
70

60

50

40

30

20

10

0
1979 84 89 94 99 2004
18
2/14/2019

Graphical Integrity
show data variation, not design variation

New York Times


19 Dec 1978
19
2/14/2019

Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity

New York Times


02 May 2010
20
2/14/2019

Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity

Washington Post
Some GRAPHICAL INTEGRITY
Principles in detail

21
“Graphical Integrity”

“Clear, detailed, and thorough labeling should be


used to defeat graphical distortion and ambiguity.
Write out explanations of the data on the
graphic itself. Label important events in the data.”

Tufte,“Visual Displayof Quantitative Information” (1983) 17


“Graphical Integrity”

“Clear, detailed, and thorough labeling should be


used to defeat graphical distortion and ambiguity.
Write out explanations of the data on the
graphic itself. Label important events in the data.”
(Axes and axis labels, titles, annotations, legends, etc.)

Tufte,“Visual Displayof Quantitative Information” (1983) 17


“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Tufte,“Visual Displayof Quantitative Information” (1983) 18
“Distorted Scales”

$11,014
$3,549,385

y-axis
baseline?!
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Tufte,“Visual Displayof Quantitative Information” (1983) 18
Interest Rates
3.154

3.152

3.149

Percent %
3.147

3.145

3.142

3.140
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 19
Interest Rates
4.00

3.20

2.40
Percent %

1.60

0.80

0.00
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 20
Interest Rates
4.00

CONTEXT!
3.20

2.40
Percent %

1.60

0.80

0.00
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 20
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html 21
“Double the axes, double the mischief ”

“Clear, detailed, and thorough labeling should be used to defeat graphical


distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html 21
http://www.babynamewizard.com/voyager 23
“Graphical Integrity”

“The representation of numbers, as physically


measured on the surface of the graphic itself,
should be directly proportional to the numerical
quantities measured.”

Tufte,“Visual Displayof Quantitative Information” (1983) 24


“The representation of numbers, as physically measured on the surface of
the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 25
Lie Factor
Lie Factor = (Size of effect in graphic)
(Size of effect in data)

Lie Factor = >1, overstating


Lie Factor = 1, accurate :)
Lie Factor = <1, understating

“The representation of numbers, as physically measured on the surface of


the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 26
Lie Factor
Lie Factor = (Size of effect in graphic)
(Size of effect in data)

“The representation of numbers, as physically measured on the surface of


the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 27
Lie Factor
Lie Factor = (Size of effect in graphic) Image = 5.3” - 0.6” = 7.83 = 783%
(Size of effect in data) 0.6”
Data = 27.5 - 18 = 0.53 = 53%
18
Lie Factor = 783% = 14.8
53%
Lie Factor = >1, overstating

“The representation of numbers, as physically measured on the surface of


the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 27
Lie Factor
Lie Factor = (Size of effect in graphic) Image = 5.3” - 0.6” = 7.83 = 783%
(Size of effect in data) 0.6”
Data = 27.5 - 18 = 0.53 = 53%
18
Lie Factor = 783% = 14.8
53%
Lie Factor = >1, overstating
18
27.5

“The representation of numbers, as physically measured on the surface of


the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 28
Data = 2 - 1 = 1 = 100%
IN-CLASS ACTIVITY:
Calculate for yourself! Lie Factor 1

Lie Factor = (Size of effect in graphic)


(Size of effect in data)
2 2

1 1

“The representation of numbers, as physically measured on the surface of


the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 29
Data = 2 - 1 = 1 = 100%
IN-CLASS ACTIVITY:
Calculate for yourself! Lie Factor 1

Lie Factor = (Size of effect in graphic) Make sure area is


(Size of effect in data) proportional to data!

2 2

1 ✓ 1 X
Image = 2 - 1 = 1 = 100% Image = 22 - 12 = 3 = 300%
1 12
Lie Factor = 100% = 1 Lie Factor = 300% = 3
100% 100%
“The representation of numbers, as physically measured on the surface of
the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 29
“Graphical Integrity”
Data Ink = the ink used to show data Tufte:maximize the data
ink ratio
Data Ink Ratio = data-ink
total ink in graphic

Tufte,“Visual Displayof Quantitative Information” (1983) 30


“Graphical Integrity”
Data Ink = the ink used to show data Tufte:maximize the data
ink ratio
Data Ink Ratio = data-ink
total ink in graphic
LowData Ink Ratio High Data Ink Ratio

Tufte,“Visual Displayof Quantitative Information” (1983) 30


“Graphical Integrity”

“The number of information-carrying (variable)


dimensions depicted should not exceed the
number of dimensions in the data.”

Tufte,“Visual Displayof Quantitative Information” (1983) 32


“No Unjustified 3D”

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 33
“No Unjustified 3D”

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 33
“No Unjustified 3D”
# Dimensions in data: 2 # Dimensions in data: 2
# Dimensions in plot: 3 # Dimensions in plot: 2

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 33
“No Unjustified 3D”
Occlusion!
Lie Factor!

http://help.infragistics.com/Help/Doc/WinForms/2014.2/CLR4.0/html/ http://img.brothersoft.com/screenshots/softimage/
Images/Chart_Bar_Chart_03.png 0/3d_charts-171418-1269568478.jpeg

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 34
“No Unjustified 3D”
Unjustified 3D!
Lie factor!

http://stats.stackexchange.com/questions/109076/what-is-your-favorite-statistical-graph/109080 35
“No Unjustified 3D”

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 36
“No Unjustified 3D”

This is not just a design principle, it has lots of


experimental and quantitative data to back it up!

“The number of information-carrying (variable) dimensions depicted


should not exceed the number of dimensions in the data.” 37
“The number of information-carrying (variable) dimensions depicted
should not exceed the number of dimensions in the data.” Tory,et al.(2007) 38
“Graphical Integrity”

To achieve graphical “excellence” according to Tufte:


1. Above all else show the data.
2. Maximize the data-ink ratio.
3. Erase non-data ink.
4. Erase redundant data ink.
5. Revise and edit.

Tufte,“Visual Displayof Quantitative Information” (1983) 39


“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”

Percentage

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
Month 40
“Chart Junk”

Bateman, et al.(2010) 41
Tufte,“Beautiful Evidence” (2006) 43
Not all “visual
embellishments”
are “chart junk”!

Tufte,“Beautiful Evidence” (2006) 43


“Chart Junk”

Chart junk can... persuade, help with memorability, engage


... bias, reduce data-ink ratio, clutter, degrade trust

Take-away:it depends on your audience,task, and context...

44
Similar advice of William Cleveland
(The Elements of Graphing Data, 1985)
• CLEAR VISION: Make clear visualizations, and ensure that the
data stands out.
• CLEAR UNDERSTANDING: Ensure that main points and
conclusions are graphically clear and represented.
• SCALES: Pick appropriate axes and tick-mark scales, and ensure
all the data is represented.
• GENERAL STRATEGY: Ensure all the data is represented. Design
your visualizations carefully and allow time to proofread.
46
Data Preparation as a step in the
Knowledge Discovery Process Knowledge
Evaluation and
Presentation

Data Mining

Selection and
Transformation

Cleaning and
Integration DW

DB
3
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Forms of Data Preprocessing
Data Preprocessing
• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
7
7
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

8
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
9
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, or “- ” a new class?!
– the attribute mean (symmetric) or median (skewed)
– the attribute mean for all samples belonging to the same class,
e.g. average income in same credit_risk
– the most probable value: inference-based such as Bayesian
formula or decision tree
10
How to Handle Missing Data?
• Fill in it automatically with
– the most probable value:
• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables


– Linear regression, Multiple linear regression, Nonlinear regression

• Nearest-Neighbour estimator
– Finding the k neighbours nearest to the point and fill in the most frequent value or
the average value
– Finding neighbours in a large dataset may be slow

11
Nearest-Neighbour

60
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
13
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

14
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions
– Linear regression involves finding “best” line to fit two attributes,
so that one attribute can be used to predict the other.
– Multiple Linear regression – more than two attributes involved
and data fit to a multidimensional surface
• Clustering
– detect and remove outliers
– Outliers – values outside of the set of clusters
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Preprocessing
• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
16
16
Data Integration
• Tuple Duplication
– The use of denormalized tables (improve performance by avoiding joins) creates
data redundancy
– Inconsistencies often arise between various duplicates, due to inaccurate data
entry
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
– Hotel chain – price difference in currencies and services and taxes
– Attributes may differ on level of abstraction, e.g. total_sales – at branch level or
region level
Data Integration- Entity identification problem

• Data integration:
– Combines data from multiple sources into a coherent store
– Integrate metadata from different sources
• Entity identification problem:
– Schema integration and object matching: e.g., A.cust-id B.cust-#
– Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
– Metadata – name, meaning, data type, range, null rules
– Metadata can help avoid errors in schema integration
– Metadata may help transform the data
– When matching attributes from two databases, structure of data should be
checked
18
18
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
19
19
Data Preprocessing
• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
20
20
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Parametric - Regression and Log-Linear Models
• Non-parametric - Histograms, clustering, sampling
• Data cube aggregation
– Data compression
• Lossless - Reconstruction without any loss of information
• Lossy – reconstruct only an approximation of the original data
21
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

22
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space

x2

x1
23
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength. The principal components serve as new set of axes for the
data, giving important information on variance
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
24
Principal Component Analysis
• Works for numeric data only
• PCA can be applied to ordered and unordered attributes and can
handle sparse and skewed data
• Multidimensional handled by reducing to two-dimensional
• PCA handles sparse data better than wavelet transforms

Y1 and Y2 are first two principal components


Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
– Ex.: Log-linear models—obtain value at a point in n-
dimensional space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
26
Parametric Data Reduction: Regression and Log-
Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple Linear regression
– Allows a response variable Y to be modeled as a linear function
of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
– Consider each tuple as a point in an n-dimensional space

27
y
Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more X1 x
independent variables (aka. explanatory
variables or predictors) • Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
• Most commonly the best fit is evaluated by relationships
using the least squares method, but other
criteria have also been used
28
Histogram Analysis
• Divide data into buckets and store
average (sum) for each bucket
• Partitioning rules: 40

35
– Equal-width: equal bucket 30

range 25

20
– Equal-frequency (or equal- 15

depth): frequency of each 10

5
bucket is constant 0
10000 30000 50000 70000 90000

29
Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms

30
Sampling

• Sampling: obtaining a small sample s to represent the whole data


set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)

31
Data Reduction 3: Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as
forms of data compression
32
Data Compression

Original Data Compressed


Data

lossless

Original Data
Approximated

33
Data Preprocessing
• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
34
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
• Methods
– Statistics: Descriptive and Distribution
– Smoothing: Remove noise from data – binning, regression, clustering
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, used in data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
35
Descriptive Statistics: Univariate
• Range, Min/Max
• Difference between minimum and maximum values in a data set
• Larger range usually (but not always) indicates a large spread or deviation in the
values of the data set.

• Average
• Sum of all values divided by the number of values in the data set.
• One measure of central location in the data set.

• Median
• The middle value in a sorted data set. Half the values are greater and half are less than
the median.

• Mode
• The most frequent occurring value.
• Another measure of central location in the data set.
Distribution Statistics

• Variance
• One measure of dispersion (deviation from the mean) of a data set. The larger the
variance, the greater is the average deviation of each datum from the average value
• Standard Deviation
• the average deviation from the mean of a data set.

• Histograms and Normal Distribution

• Variance and SD are critical in analyzing your data


distribution and determining how “meaningful” is the
chosen average
Distribution Statistics:
Normal and Skewed Distributions
• When data are
skewed, the mean and
SD can be misleading
• Skewness
sk= 3(mean-median)/SD
If sk>|1| then distribution is
non-symetrical
• Negatively skewed
• Mean<Median
• Sk is negative
• Positively Skewed
• Mean>Median
• Sk is positive
Distribution Statistics:
Problems in reading distribution
• We can’t really tell 120

much about this data 100

set 80

Data Values
60

• Even Min and Max are 40

hard to see 20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X-axis labels

The data can be presented such that more statistical info can be
estimated from the chart (average, standard deviation).
Distribution Statistics:
Plotting the distribution
• Determine a frequency table (bins)
• A histogram is a column chart of the frequencies
7
Category Labels Frequency
6
0-50 3
5
51-60 2

Frequency
4
61-70 6
3
71-80 5
2
81-90 3
1
>90 1 0
0-50 51-60 61-70 71-80 81-90 >90

Scores
Distribution Statistics: Histogram
• The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data
• For small data sets, histograms can be misleading. Small changes in
the data or to the bucket boundaries can result in very different
histograms.
• For large data sets, histograms can be quite effective at illustrating
general properties of the distribution.

• Histograms effectively only work with 1 variable at a time


• Difficult to extend to 2 dimensions, not possible for >2
• So histograms tell us nothing about the relationships among variables
Normalization
• The measurement unit can affect the data analysis
• Smaller unit leads to larger range and thus give more weight to an
attribute
• Normalize data between [-1,1] or [0,1] to avoid dependence on
choice of measurement unit
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,600 is mapped to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
– Min-max normalization preserves the relationships among the
original data values
Normalization
• Z-score normalization (μ: mean, σ: standard deviation):
v−
v' =
A

73,600 − 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
– Useful when the actual minimum and maximum values are unknown, or
when there are outliers that dominate the min-max normalization
– A variation replaces standard deviation by mean absolute deviation

• Normalization by decimal scaling

v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10 43
Discretization
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

44
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised (does not use class
information)
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-up
merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
45
Discretization Without Using Class Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

46
Discretization by Histogram Analysis
• Histogram analysis is an unsupervised discretization technique as
it does not use class information
• Equal-width – values are partitioned into equal sized partitions
or ranges
• Equal frequency – values are partitioned so each partition
contains the same number of data tuples
• Histogram analysis algorithm can be applied recursively to each
partition to automatically generate multilevel concept hierarchy
• Histogram can be partitioned based on cluster analysis of the
data distribution
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition

48
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and
is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.

49
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Specification of a set of attributes, but not of their partial ordering
– Concept hierarchy based on number of distinct values
– E.g., for a set of attributes: {street, city, state, country}

50
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the
data set
– The attribute with the most distinct values is placed at the
lowest level of the hierarchy

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


51

You might also like