Ilovepdf Merged

1
Lecture : Visualization Building Blocks

(Marks and Channels)
DATA VISUALIZATION
SPRING 2019
Dr. Muhammad Faisal Cheema

COMSATS UNIVERSITY
GOALS FOR TODAY
• Learn the basic visual primitives of visualizations
(marks and channels)
• Understand how marks and channels are
assembled to make visualizations
• Learn which marks and channels are most
effective for a given task (“perceptual ordering”)
32
In-class exercise: Critique & Redesign
Describe: What do you see? 1.Who is the intended audience?
2.What information does this visualization represent?
Analyze: How is the work organized? What are the visual
3. How many data dimensions does it encode?
encodings?
4. List several tasks, comparisons or evaluations it enables
Task: What is the purpose of the visualization? 5.What principles of excellence best describe why it is good / bad?
Decide: Is this a successful (effective) visualization? 6.Can you suggest any improvements?
7.Why do you like / dislike this visualization?
http://www.theatlantic.com/past/docs/images/issues/200709/win.jpg 30
MARKS & CHANNELS
38
Visualization Building Blocks
MARK = basic graphical element in an image
Points Lines Areas
Munzner,“VisualizationAnalysis and Design” (2014) 39

CHANNEL = way to control the appearance of marks,
independent of the dimensionality of the geometric primitive
Position Color
Horizontal Vertical Both
Shape Tilt
Size
Length Area Volume

41
# of attributes encoded:
41
# of attributes encoded: 2
41
# of attributes encoded: 2 MARK:
Points Lines Areas
41
Points Lines Areas
41
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
MARK: Areas
Points Lines
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
46
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
46
# of attributes encoded: 3 MARK: Lines Areas
Points
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Points Lines Areas
(4 WITH
POSITION)
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
Kindlmann (2004) 50
51
Marks as Items/Nodes
Points Lines Areas
Marks as Links
Containment Connection

Channels :
Marks as Items/Nodes Position Color
Points Lines Areas
Marks as Links
Shape Tilt
Size
Length Area Volume
54
Marks as Links
53
How do I pick which marks or channels to use?
Bertin’s Semiology of Graphics
1.A, B, C are distinguishable
2.B is between A and C.
C
3.BC is twice as long asAB.
B
A Encode quantitative variables
"Resemblance, order and proportion are the three

signifieds in graphics.” - Bertin
Visual encoding variables
Position (x 2)
Size
Value
Texture
Color
Orientation
Shape
Characteristics of Visual Variables
Selective
Is a mark distinct from other marks?
Can we make out the difference between two marks?
Associative
Does it support grouping?
Quantitative
Can we quantify the difference between two marks?
Characteristics of Visual Variables
Order
Can we see a change in order?
Length
How many unique marks can we make?
Position
• Strongest visual variable Suitable
for all data types
• Problems:
• Sometimes not available
• Cluttering
Size & Length
• Good visual variable
• Easy to see whether one is bigger Grouping
works
• Judging differences
• Good for aligned bars (position)
• OK for changes in length
• Bad for changes in area
Shape
Great to recognize many classes.

No ordering.
Value
Good for quantitative data when length & size
are used.
Not very many shades recognizable
Supports grouping
Is preattentive (stands out) if sufficiently different
Color
Good for qualitative data Limited

number of classes!
Not good for quantitative data!
Is preattentive if sufficiently different.
Lots of pitfalls! Be careful!
Information in color and value
Value is perceived as ordered
Encode ordinal variables (O)
Encode continuous variables (Q) [not as well]
Hue is normally perceived as unordered

Encode nominal variables (N) using color
Bertin’s “Levels of Organization”
Position N O Q
Nominal
N O Q Ordered
Size
N O Q Quantitative
Value Note: Q <O <N
O
N
Texture N
Color N
Orientation
N
“Ordering of Elemental Perceptual Tasks”
Cleveland & McGill (1984) 56

TASK:Which segment/bar is the maximum, and what is its percentage/value?


This is why pie

charts are bad!

This is why pie
charts are bad!
https://www.washingtonpost.com/news/wonk/wp/2013/06/17/the-usefulness-of-pie-charts-in-two-pie-charts/ 59
This is why pie

charts are bad!

Cleveland & McGill’s Results
1.0 1.5 2.0 2.5 3.0

Log Error

Channel Ranking by Data Type
(Categorical)
Mackinlay
(1986)
(Categorical)
Mackinlay
(1986)
Mackinlay
(1986)
AREA
Quantitative
Ordinal
Categorical
Mackinlay
(1986)
Mackinlay
(1986)
Mackinlay
(1986)
Cleveland & McGill’s Results
Positions
1.0 1.5 2.0 2.5 3.0
Lo g Error
Crowdsourced Results
Angles
Circular
areas
Rectangular
areas
(aligned or in
a
treemap)
1.0 1.5 2.0

Log Error
2.5 3.0 Heer & Bostock
(2010)
Expressiveness and Effectiveness
Effectiveness principle: the importance of the attribute should
match the salience of the channel; that is,
its noticeability.
(i.e., encode most important attributes with
highest ranked channels)
Expressiveness principle: the visual encoding should express all of,
and only, the information in the dataset
attributes.
(i.e., data characteristics should match the
channel) Mackinlay(1986) 69
Mackinlay(1986) 71
Mackinlay(1986) 72
Channels: Expressiveness Types and Effectiveness Ranks
Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes

Position on common scale Spatial region
Position on unaligned scale Color hue
Length (1D size) Motion
Tilt/angle Shape
Area (2D size)
Depth (3D position)
Color luminance
Color saturation
Curvature
Volume (3D size)

73
1
Lecture : Visualizationof
Multidimensional Data
DATA VISUALIZATION
SPRING 2019

COMSATS UNIVERSITY
Recap
The Visualization Pipeline (InfoVis)
tas
k
Raw data Data Visual Visualization

tables structures (views)
(information)
Data Visual View
transformations mappings transformations
User interaction
tas
k

(information)
Data Visual View
User interaction
tas
k

(information)
Data Visual View
User interaction
tas
k

(information)
Data Visual View
User interaction
Why Step 2 and 3 highly correlated ???
tas
k

(information)
Data Visual View
User interaction
Data dictates visualization design !!

How Data influences visualization design ??
• Visualization of Multi Dimensional Data

• Visualization of High Dimensional Data
• Visualization of Hierarchical Data (Trees)
• Visualization of Graph and Network Data
• Visualization of Spatial Data
• Visualization of Time series Data
• Visualization of Textual Data (Text)
• Etc.
How Data influences visualization design ??
• Visualization of Multi Dimensional Data

• Visualization of High Dimensional Data
• Visualization of Hierarchical Data (Trees)
• Visualization of Graph and Network Data
• Visualization of Spatial Data
• Visualization of Time series Data
• Visualization of Textual Data (Text)
• Etc.
Multidimensions / Multivariate Data
Beyond Tables and Charts
Data sets of dimensions 1,2,3 are common
Number of variables per class
1 - Univariate data
2 - Bivariate data
3 - Trivariate data
>3 - Hypervariate/Multivariate data
Univariate Data
Representations
7 Bill
Tukey box plot
5
low Middle 50% high
3
1 Mean
0 20
Bivariate Data
Representations
Scatter plot is common
price
mileage
Trivariate Data
Representations
3D scatter plot is possible

price
horsepower
mileage
Trivariate
3D scatterplot, spin plot

2D plot + size (or color…)
Multi-Dimensional / Multivariate Data
Each attribute defines a dimension

Small # of dimensions easy
Data mapping
What about many dimensional
data?
n-D
What does 10-D

space look like?
Map n-D space onto 2-D screen
Visual representations:
Complex glyphs
E.g. star glyphs, faces, embedded visualization, …
Multiple views of different dimensions
E.g. small multiples, plot matrices, brushing histograms, …
Non-orthogonal axes
E.g. Parallel coords, star coords, …
Tabular layout
E.g. TableLens, …
Interactions:
Dynamic Queries
Brushing & Linking
Selecting for details, …
Combinations (combine multiple techniques)
Glyphs
Glyphs: Chernoff Faces
Encode different variables’ values in characteristics

of human face
d1
Glyphs: Stars
d7 d2
d6 d3
d5 d4
Non-orthogonal axis
Parallel Coordinates (2D)
• Encode variables along a horizontal row
• Vertical line specifies values
Dataset in a Cartesian graph Same dataset in parallel coordinates

Parallel Coordinates (4D)
Forget about Cartesian orthogonal axes

(0,1,-1,2)=
x y z w
0 0 0 0
Parallel Coordinates Example
Basic
Grayscale
Color
Parallel Coordinates
Parallel Coordinates
Visualize up to ~two dozen dimensions at once

1. Draw parallel axes for each variable
2. For each tuple, connect points on each axis
Between adjacent axes: line crossings imply neg.
correlation, shared slopes imply pos. correlation.
Full plot can be cluttered. Interactive selection
can be used to assess multivariate relationships.
Highly sensitive to axis scale and ordering.
Expertise required to use effectively!
Radar Plot / Star Graph
“Parallel” dimensions in polar coordinate space

Best if same units apply to each axis
Chord Diagram
Multiple views of different dimensions
Scatterplot Matrix (SPLOM)
Scatter plots
for pairwise
comparison
of each data
dimension.
Scatterplot Matrix
http://noppa5.pc.helsinki.fi/koe/3d3.html
Multiple Views
Give each variable its own display
A B C D E
2
1 4 1 8 3 5
2 6 3 4 2 1
3
3 5 7 2 4 3
4 2 6 3 1 5
4
A B C D E
Small Multiples
Small Multiples
Small Multiples
Small Multiples
Trellis Plots
A trellis plot subdivides space to enable

comparison across multiple plots.
Typically nominal or ordinal variables are used
as dimensions for subdivision.
Example Simple Plot
Trellis Plots
Tabular Layout
Table Lens
Table Lens
Idea: Make the text more visual and symbolic

Just leverage basic bar chart idea
Characteristics
Can sort on any attribute (row)
Focus on an attribute value (show only cases having that value) by doubleclicking on it
Can type in queries on different attributes to limit what is presented to. Note this is
main contribution: dynamic control (selection/change/querying/filtering) of individual
attributes.
1
Lecture : Visualizing
Graphs and Networks
DATA VISUALIZATION
SPRING 2019

COMSATS UNIVERSITY
Why graphs and networks?
http://www.sci.utah.edu/~miriah/cs6630/lectures/L13-trees-graphs.pdf
Graph and Network Uses
• In Information Visualization, any number
of data sets can be modeled as a graph
– Telephone system
– World Wide Web
– Distribution network for on-line retailer
– Call graph of a large software system
– Semantic map in an AI algorithm
– Set of connected friends
– Social Networks
Graphs are more complicated
than trees
Graph Terminology
• Graphs can have cycles

• Edges can be directed or undirected
• Degree of a vertex = # connected nodes
• In-degree and out-degree for directed graphcs
• Graph edges can have values (weights)
• Nominal (N), ordinal (O), quantitative (Q)
Graph Drawing considerations
Vertex Issues
• Shape
• Color
• Size
• Location
• Label
Edge Issues
• Color
• Size
• Label
• Form
• Polyline, straight line, orthogonal, grid,
curved, planar, upward/downward, ...
Edge Drawing Strategies
Label
Thickness
Color
Directed
Complexity Considerations
• Crossings-- minimize towards planar
• Total Edge Length-- minimize towards
proper scale
• Area-- minimize towards efficient use of
space
• Maximum Edge Length-- minimize longest
edge
• Uniform Edge Lengths-- minimize variances
• Total Bends– minimize orthogonal towards
straight-line
Graph Visualization Problems
• Graph layout and positioning
– Make a concrete rendering of abstract graph
• Scale
– Not too much of a problem for small graphs, but
large ones are much tougher
• Navigation/Interaction
– How to support user changing focus and moving
around the graph
Graph Layout
• How to position the nodes and edges?
• Avoid clutter
• Maintain appropriate relations
The Hairball Problem
• How to position the nodes and
edges?
• Avoid clutter
• Maintain appropriate relations
Layout Types
• Grid Layout
– Put nodes on a grid
• Force Directed Layout
– Model graph as set of masses connected by
springs
• Planar Layout
– Detect part of graph that can be laid out without
edge crossings
• Attribute Based Layouts
Layout Subproblems
• Rank Assignment
– Compute which nodes have large degree, put
those at center of clusters
• Crossing Minimization
– Swap nodes to rearrange edges
• Subgraph Extraction
– Pull out cluster of nodes
• Planarization
– Pull out a set of nodes that can lay out on plane
Planar Layouts
Starting simple: planar 3-vertex
connected graphs (what?)
Tutte Embedding
• Each node should be the average of its neighbors
• Aside from the boundary, which is user-specified
• This gives a linear system
• Theorem: if graph is planar, embedding is

crossing-free
Tutte Embedding
Downsides
Follows Tutte Embedding correctly, but

visually cluttered
http://www.cs.arizona.edu/~kpavlou/Tutte_Embedding.pdf
SUGIYAMA-TYPE LAYOUT
- great for graphs that
have an intrinsic ordering
- ‘depth’ in graph mapped
to one axis
UNIX ancestry
SUGIYAMA STEP 1
- create layering of graph
- from domain specific knowledge
- longest path from root
- algorithmically determine best layering (NP-Hard)
- dummy nodes for long edges

1
2 3 4
5 6 7 8
9 10 11
SUGIYAMA STEP 2
- minimize crossings layer by layer (NP- hard)
- numerous heuristics available
2 3 4
8 6 5 7
11 9 10
SUGIYAMA STEP 3
- final assignment of x- coordinates
- routing of edges
2 3 4
8 6 5 7
11 9 10
Gansner 1993
SUGIYAMA
+ nice, readable top down flow
+ relatively fast (depending on heuristic
used for crossing minimization)
- not really suitable for graphs that don’t

have an intrinsic top down structure
- hard to implement
Force directed layouts
FORCE DIRECTED LAYOUT
- no intrinsic layering, now what?
- physics model
- edges = springs
- nodes = repulsive particles

Force-directed Layouts
• We want edges to be neither too small or too large
• Physical analogy: Springs compress or expand

to achieve ideal length
• We don’t want vertices to bunch up together
• Physical analogy: Electric charges with the

same sign don’t bunch up
AESTHETIC RESULTS
highschool
dating network
FORCE MODEL
- many variations, but usually physical
analogy:
- repulsion : fR(d) = CR * m1*m 2 / d2
- m1, m2 are node masses

- d is distance between nodes
- attraction : fA(d) = CA * (d – L)
- L is the rest length of the spring
- i.e. Hooke’s Law
- total force on a node x with position x’

-  neighbors(x) : f (||x’-y’||) * (x’-y’) + -f (||x’-y’||) * (x’-y’)
A R
ALGORITHM
- start from random layout
- (global) loop:
- for every node pair compute repulsive force
- for every edge compute attractive force
- accumulate forces per node
- update each node position in direction of

accumulated force
- stop when layout is ‘good enough’
FORCE DIRECTED LAYOUTS
+ very flexible, aesthetic
layouts on many types of
graphs
+ can add custom forces
+ relatively easy to implement
- repulsion loop is O(n 2) per

iteration
- can speed up to O(N log N) using
quadtree or k-d tree
- prone to local minima
- can use simulated annealing
mentionmap
Grid Layouts
Adjacency Matrix
Alternate to node-link diagram: adjacency matrix
27
Slide by Frank van Ham
Adjacency Matrix
29
Henry & Fekete (2006)
Adjacency Matrix
• Change network to tabular data
and use a matrix representation
• Derived data: nodes are keys,
edges are boolean values
• Task: lookup connections, find well-
connected clusters
• Scalability: millions of edges
• Can encode edge weight, too 1

8
Cliques in Adjacency Matrices
1
9
Node link and Adjacency Matrix
(can co-exist together)
2
0
[McGuffin]
Adjacency Matrix
Pros:
•great for dense graphs
• visually scalable
•can spot clusters
Cons:
• row order affects what you can see
• abstract visualization
• hard to follow (multilink) paths
Node-Link or Adjacency Matrix?
• Empirical study: For most tasks, node-link is better for small graphs and adjacency
better for large graphs 2
• Immediate connectivity or neighbors are ok, estimating size (nodes & edges also ok)
• People tend to be more familiar with node-link diagrams
3
0
https://bost.ocks.org/mike/miserables/
Attribute driven layouts
ATTRIBUTE-DRIVEN LAYOUT
- large node- link diagrams get messy!
- are there additional structures we
can exploit?
-idea: use data attributes to perform

layout
- e.g., scatterplot based on node values
- dynamic queries and/ or brushing can be

used to enhance perception of
connectivity
37
cerebral
metabolic
network
0 1.00
GLU
GLYCOLYSIS
G6P
a
PPP
b
R5 P
c
G3P
G3P
PYR
TCA
CIT 3
0 1.00
OTHER NODE LINK LAYOUTS
- orthogonal
- great for UML diagrams
- algorithmically complex
- circular layouts
- emphasizes ring topologies
- used in social network
diagrams
- nested layouts
- recursively apply layout
algorithms
- great for graphs with
hierarchical structure
More node link layouts - Arc Diagram
1
3
1
Lecture:
Interaction Techniques in Visualization
DATA VISUALIZATION
SPRING 2019

COMSATS UNIVERSITY
Fundamental idea
• Interpret the state of elements in the UI as a clause
in a query. As UI changes, update data
Interaction Summary
Change over Time
Select
Navigate
Item Reduction Attribute Reduction
Zoom Slice
Geometric or Semantic
Pan/Translate Cut
Constrained Project
[Munzner (ill. Maguire), 2014]

Selection
Selection
• Selection is often used to initiate other changes
• User needs to select something to drive the next change
• What can be a selection target?
- Items, links, attributes, (views)
• How?
- mouse click, mouse hover, touch
- keyboard modifiers, right/left mouse click, force
• Selection modes:
- Single, multiple
- Contiguous?
Highlighting
Highlighting
• Selection is the user action
• Feedback is important!
• How? Change selected item's visual encoding
- Change color: want to achieve visual popout
- Add outline mark: allows original color to be preserved
- Change size (line width)
- Add motion: marching ants
21
Highlighting
• Selection is the user action
• Feedback is important!
• How? Change selected item's visual encoding
- Change color: want to achieve visual popout
- Add outline mark: allows original color to be preserved
- Change size (line width)
- Add motion: marching ants
21
Highlighting
[http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html]
Navigation
Navigation
• Fix the layout of all visual elements but provide methods for the
viewpoint to change
• Camera analogy: only certain features visible in a frame
- Zooming
- Panning (aka scrolling)
- Translating
- Rotating (rare in 2D, important in 3D)
Panning
https://www.google.com/finance?q=INDEXFTSE
Zooming
https://www.google.com/finance?q=INDEXFTSE
“Geometric” vs.
“Semantic”
Zooming Zooming
Semantic Zooming: visual appearance of

objects can change at
Geometric Zooming: just like a camera
different scales
http://bl.ocks.org/mbostock/3680957
Multiple Views
Multiple Views
• Facet (noun and verb)
- particular aspect or feature of something
- to split
• Partition visualization into views/layers
- Either juxtapose or superimpose
- Depends on data and encoding
Multiple Views
Juxtapose and Coordinate Multiple Side-by-Side Views
Linked Highlighting
Share Data: All/Subset/None
Share Navigation
Multiple Views
All Subset None
Overview/
Same Redundant
Detail
Small Multiples
Multiform,
Overview/ No Linkage
Multiform Detail
Multiple Views
Partition into Side-by-Side Views
Superimpose Layers
Multiform Views
• The same data visualized in different ways
• Does not need to be a totally different encoding (all choices need
not be disjoint), e.g. horizontal positions could be the same
• One view becomes cluttered with too many attributes
• Consumes more screen space
• Allows greater separability between channels
Multiple Views
Example of Facets
Small Multiples
• Same encoding, but different data in each view (e.g. SPLOM)
sepal length
7
sepal width
4.0
3.5
3.0
2.5
2.0
petal length
6
1
2.5
petal width
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7
[http://bl.ocks.org/mbostock/4063663]
2
Brushing
sepal length
7
sepal width
4.0
3.5
3.0
2.5
2.0
petal length
6
2.5
petal width
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7
[http://bl.ocks.org/mbostock/4063663]
Multiple Views
Example of Partitioned Views
Overview-Detail View
[Wikipedia]
28
Partitioned View
Population
65 Years and Over

10M
45 to 64 Years
25 to 44 Years
9.0M
18 to 24 Years
14 to 17 Years
8.0M
5 to 13 Years
Under 5 Years
7.0M
6.0M
5.0M
4.0M
3.0M
2.0M
1.0M
0.0
CA TX NY FL IL PA
[M. Bostock, http://bl.ocks.org/mbostock/3887051]

5
Matrix Alignment
[Becker et al., 1996]

6
Recursive Subdivision
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Harrow Barnet Enfield Haringey Waltham Forest Redbridge Havering
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det
Hillingdon
Brent Camden Westminster Hackney Newham Barking
Hounslow Ealing Hammersmith Kensington Islington Tower Hamlets Greenwich
Richmond Kingston Merton Wandsworth City of London Southwark Bexley
Semi Semi Det
Det Semi Det Semi Det Semi Det Semi Det Semi Det
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Sutton Croydon Lambeth Lewisham Bromley
Semi Det Semi Det Semi Det Semi Det Semi Det
[Slingsby et al., 2009]
7
Multiple Views
Example of Superimposed layers
Superimposed Line
Charts
Temperature (ºF)
80
Austin
70
New York
60
San Francisco
50
40
30
20
October November December 2012 February March April May June July August September
[M. Bostock, http://bl.ocks.org/mbostock/3884955]

30
Focus + Context
Focus and Context
Provides detailed view of a subset within context of
the full dataset
Multiform,
Overview/
Overview/
Detail
Detail
Why? For large or complex data, a single view of the entire

dataset cannot capture fine details
Brush & Link
Brushing & Linking multiple views that are simultaneously visible
and linked together such that actions in one
view affect the others
primary strategy: highlighting
Linked Highlighting
Zoom Techniques
Provides detailed view of a subset within context of
the full dataset
Multiform,
Overview/
Overview/
Detail
Detail
Focus+Context
• Show everything at once but compress regions that are not the
current focus
- User shouldn't lose sight of the overall picture
- May involve some aggregation in non-focused regions
- "Nonliteral navigation" like semantic zooming
• Elision
• Superimposition: more directly tied than with layers
• Distortion
35
Focus+Content
Overview
Embed Reduce
Elide Data Filter
Aggregate
Superimpose Layer
Embed
Distort Geometry
36
Focus + Context
Elision
Elision
• There are a number of examples of elision including in text ,
DOITrees, …
• Includes both filtering and aggregation but goal is to give overall
view of the data
• In visualization, usually correlated with focus regions
37
Elision: DOITrees
[Heer and Card, 2004]

38
Focus + Context
Superimposed layers
Superimposition with
Interactive Lenses
(a) Alteration (b) Suppression

[ChronoLenses and Sampling Lens in Tominski et al., 2014]
39
Superimposition with
Interactive
(c) Enrichment
[Extended Lens in Tominski et al., 2014]
40
Focus + Context
Distortion
Distortion: Fisheye
Lens
[M. Bostock, http://bost.ocks.org/mike/fisheye/]

16
Fisheye
Lens
Leung 1994 33
http://www.cs.umd.edu/class/fall2002/cmsc838s/tichi/fisheye.html
Distortion Choices
• How many focus regions?
- One
- Multiple
• Shape of the focus?
- Radial
- Rectangular
- Other
• Extent of the focus
- Constrained similar to magic lenses
- Entire view changes
• Type of interaction:
- Geometric, moveable lenses, rubber sheet
Examples of
Focus + Context
Stretch and Squish Navigation
[McLachlan et al., 2008]

20
Focus+Context in
Graph Exploration
Focus+Context in
Graph Exploration
Focus+Context in
Graph Exploration
(a) Bring (step 1) – Selecting a node fades out (b) Bring (step 2) – Neighbor nodes are pulled (c) Go – After selecting a neighbor (the green
all graph elements but the node neighborhood. close to the selected node. node in Fig. 4(b)), a short animation brings the
focus towards a new neighborhood.
1
Lecture : Color in Visualization Design
DATA VISUALIZATION
SPRING 2019

COMSATS UNIVERSITY
PERCEPTION
2
CONES & RODS
https://askabiologist.asu.edu/sites/default/files/resources/articles/seecolor/Light-though-eye-big.png 3
CONES & RODS
http://i.stack.imgur.com/wIbcE.jpg
10
http://thebrain.mcgill.ca/flash/a/a_02/a_02_m/a_02_m_vis/a_02_m_vis.html
CONES & RODS This is why we luminance
(brightness) is more
effective encoding channel!
Rods:120 million
Cones: 5-6 million
This is why we are so

Cones: sensitive to red!
64% red-sensitive
32% green-sensitive
2% blue-sensitive.
http://arthistoryresources.net/visual-experience-2014/visual-experience-2014-images/red-green-blue-wavelengths+rods-big.jpg 11
PERIPHERAL VISION
https://en.wikipedia.org/wiki/Peripheral_vision 6
LOW-LEVEL FEATURE ANALYSIS
Shape
Color
Motion
Ware,VTFD 7
Use these “popout” effects to
help design effective
visualizations!
(E.g., drawviewer’s attention to

main points, effective
redundant encodings, etc.)
Ware,VTFD 8
(A LITTLE MORE) PERCEPTION
9
“Get it right in black and white.”
-Maureen Stone
https://research.tableau.com/user/maureen-stone 10
Luminance
Luminance = the amount of visible light that comes to the eye from a surface
Luminance
(lightness)
Lightness = the perceived intensity of reflected light (reflectance) from a surface
Brightness = the perceived intensity of emitted light

11
Lightness Constancy
The perception that the apparent

brightness of light and dark surfaces
remains more or less the the same
under different luminance conditions
is called lightness constancy.
12
“Simultaneous Contrast”
13
20
Avoid gradients as backgrounds or bars!

20
21
22
23
23
24
25
Luminance Channel Summary
• No edges without lightness difference
• No shading without lightness variation
• Has higher spatial sensitivity than color channels
• Contrast defines legibility, attention, layering
• Controlling luminance is primary rule of design
26
COLOR
27
Why color…?
• Color for labeling and annotation
• Color for measuring (encoding sequential data)
• Color for encoding categories
• Color to encoding meaning (conventions, representation)
• Color as beauty (aesthetics)
29
Why color…?
Functions of color:
Identify, Group, Layer, Highlight
Ware “InformationVisualization:Perception for Design” 30

“… avoiding catastrophe becomes
the first principle in bringing color
to information: above all, do no harm.”
-Edward Tufte
Tufte,“Envisioning Information” 31
CONES & RODS Red
Green
Blue
trichromacy = possessing three independent

channels for conveying color information
https://askabiologist.asu.edu/sites/default/files/resources/articles/seecolor/Light-though-eye-big.png 32
CONES & RODS - COLOR PERCEPTION
opponent-process model: visual
system detects differences between
the response of cones
L L
3 opponent channels:
black vs. white (Luminance)
▸combination of R & G
red vs. green
▸difference between R & G
blue vs. yellow
▸difference between L & B
NOTE:opposite colors are never perceived together (no reddish green or bluish yellow)
33
Color
Constancy
34
35
36
37
38
Be careful with bars and scatter plot points - the colors mayappear differently with different background
colors and neighboring colors!
Be aware that colors in legends mayappear different than on the plot! 38

39
39
40
Small Area Effects
“Bezold Spreading Effect” 41

Small Area Effects

Be careful with colors in scatter plots!
Be
Be aw
aware
are of
thatSmall Area Effects
color chain
colors ngelegends
s whenmayappear
adding bordedifferent
rs arounthan
d barsonand
theplots!
plot!

Which area is larger
(green or red)?
43
Which area is larger
(green or red)?
Areas are equal(!).
Cleveland & McGill,“A Color-Caused Optical Illusion on a Statistical Graph”,1983 44

Color Vocabulary Summary
Luminance
Saturation
Hue
45
Color Deficiencies (Color Blindness)
Person with faulty cones (or faulty pathways):
normal
Protanope = faulty red cones

Deuteranope = faulty green cones
Tritanope = faulty blue cones
46
47
48
Those with deuteranope color blindness (red/green) will have difficulty seeing the numbers.
49
50
51
52
53
54
http://www.vischeck.com/vischeck/vischeckImage.php 55
https://www.nytimes.com/interactive/2018/02/06/climate/flood-toxic-chemicals.html
Primary Colors?
• Red, Green, and Blue
• Red, Yellow, and Blue
• Orange, Green, and Violet
• Cyan, Magenta, and Yellow
• All of the above!
17
Color Addition and Subtraction
Color Spaces and Gamuts
[http://dot-color.com/2012/08/14/color-space-confusion/]
Color Spaces and Gamuts
• Color space: the organization of all colors in space
- Often human-specific, what we can see (e.g. CIELAB)
• Color gamut: a subset of colors
- Defined by corners on in the color space
- What can be produced on a monitor (e.g. using RGB)
- What can be produced on a printer (e.g. using CMYK)
- The gamut of your monitor != the gamut of someone else's != the
gamut of a printer
Color Models
• A color model is a representation of color using some basis
• RGB uses three numbers (red, blue, green) to represent color
• Color space ~ color model, but there can be many color models
used in the same color space (e.g. OGV)
• Hue-Saturation-Lightness (HSL) is more intuitive and useful
- Hue captures pure colors
- Saturation captures the amount of white mixed with the color
- Lightness captures the amount of black mixed with a color
- HSL color pickers are often circular
• Hue-Saturation-Value (HSV) is similar (swap black with gray for the
final value), linearly related
Color Maps
Color Map = mapping between color and value
http://matplotlib.org/mpl_examples/color/colormaps_reference_05.png 57
Rainbow Color Map
Why this color map is a poor choice...
• No perceptual ordering (confusing)
• No luminance variation (obscures details)
• Viewers perceive sharp transitions in color as sharp
transitions in the data, even when this is not the case
(misleading)
Borland & Russell (2007) 59

Rainbow Color Map
• No perceptual ordering (confusing)
Borland & Russell (2007) 60

Rainbow Color Map
• No luminance variation (obscures details)
• Viewers perceive sharp transitions in color as sharp transitions in the data,
even when this is not the case (misleading)
61
Artifacts from Rainbow Colormaps
Colormap
• A colormap specifies a mapping between colors and data values

• Colormap should follow the expressiveness principle
• Types of colormaps:
Binary Categorical
Diverging Sequential
Categorical vs. Ordered
• Hue has no implicit ordering: use for categorical data

• Saturation and luminance do: use for ordered data
Luminance
Saturation
Hue
[Munzner (ill. Maguire), 2014]

Color Maps
THREE MAIN TYPES:
Categorical Does not imply magnitude differences (categorical/
nominal data)
Distinct hues with similar emphasis
Sequential Best for ordered data that progresses from low to

high (ordinal, quantitative data)
Luminosity channel effectively employed
Diverging For data with a “diverging” (mid) point (quantitative

data)
Equal emphasis on mid-range critical values and
extremes at both ends of the data range
64
Brewer,CynthiaA.1994. Color use guidelines for mapping and visualization. Chapter 7 (pp. 123-147) inVisualization in Modern Cartography
Color Maps
ALSO...
Bivariate Displays two variables
Combination of two sequential color schemes
These are verydifficult to design effectively, make
intelligible, and be color blind friendly.
+ =
65
http://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/
Categorical Colormap Guidelines
• Don't use too many colors (~12)

• Remember your background has a color, too
• Nameable colors help
• Be aware of luminance (e.g. difference between blue and yellow)
• Think about other marks you might wish to use in the visualization
Categorical Colormaps
[colorbrewer2.org]
Categorical Colormaps
[colorbrewer2.org]
Number of distinguishable colors?
[Sinha & Meller, 2007]

13
Number of distinguishable colors?
[Sinha & Meller, 2007]

13
Discriminability
• Often, fewer colors are better
• Don't let viewers combine colors because they can't tell the
difference
• Make the combinations yourself
• Also, can use the "Other" category to reduce the number of colors
Ordered Colormaps
• Used for ordinal or quantitative attributes
• [0, N]: Sequential
• [-N, 0, N]: Diverging (has some meaningful midpoint)
• Can use hue, saturation, and luminance
• Remember hue is not a magnitude channel so be careful
• Can be continuous (smooth) or segmented (sharp boundaries)
- Segmented matches with ordinal attributes
- Can be used with quantitative data, too.
Continuous Colormap
Sequential Colormap
Color Maps
67
Color Maps
68
Color Maps
Sequential (wrong!) Diverging

Sequential rainbow(wrong!) https://www.research.ibm.com/people/l/lloydt/color/color.HTM 69
Color Brewer
http://colorbrewer2.org/ 70
Colorgorical
http://vrl.cs.brown.edu/color 71
Color Advice Summary
Use a limited hue palette
• Control color “pop out” with low-saturation colors
• Avoid clutter from too many competing colors
Use neutral backgrounds
• Control impact of color
• Minimize simultaneous contrast
Use Color Brewer for scales
Don’t forget aesthetics!
Based on Slides byHanspeter Pfister, Maureen Stone 72
Color Design Rules
Wang, et al.,“Color Design for IllustrativeVisualization” (2008) 73

Color Design Rules
R1:Vivid colors (bright,saturated colors) stand out.They guide attention to a particular feature, generating the pop-out effect.
R2:An excessive amount of vivid colors is perceived as unpleasant and overwhelming; use them between duller background tones.
R3: Foreground-background separation works best if the foreground color is bright and highly saturated, while the background is
de-saturated.
R4: Colors can be better discriminated if they differ simultaneously in hue,saturation and lightness.
R5:The low end lightness steps should be very small, while the high end requires larger steps (Weber’s Law).
R6: Discrimination is poorer for small objects. Hue, saturation and lightness discrimination all decrease.
R7: Complementary (opponent) colors are located opposite on the color wheel and have the highest chromatic contrast.When
mixing opponent colors they may cancel each other, giving neutral grey.
R8: Some hues appear inherently more saturated than others.Yellow has the least number of perceived saturation steps (10). For
hues on both sides of yellow, the saturation steps increase linearly.
R9:An opposite effect of R8 is that the brightest lights fall in the yellow range, while blues, violets (purples) and reds are least
bright.
R10: For labeling, apart from black, white, grey, there are 4 primary colors (red, green, blue, yellow) and 4 secondary colors (brown
orange, purple, pink). Also, the number of color labels should be ≤ 6-7.
R11:Warm colors (red,orange,yellow) excite emotions,grab attention. Cold colors (green to violet) create openness and
distance.
R12: Important for hue-based labeling is the fact that increasing the lightness (and saturation) does not change the perceived hue.
R13:Also important for labeling is that objects of similar hue are perceived as a group, while objects of different hues are
perceived as belonging to different groupings. Wang, et al.,“Color Design for IllustrativeVisualization” (2008) 74
Color Design Rules
Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.
ΔS = constant
S
where S is the initial stimulus and ΔS the difference between stimuli

(“just noticeable difference”).
*Ratios are more important than magnitude differences. 75

Color Design Rules
Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.
Just-noticeable Difference ΔS = constant ~ 1%

Background Intensity S
where S is the initial stimulus and ΔS the difference between stimuli
(“just noticeable difference”).
*Ratios are more important than magnitude differences. 75

Color Design Rules
Weber’s Law
We tend to perceive discrete steps in continuous

variations in magnitude.
79
Color Design Rules
R6: Discrimination is poorer for small objects. Hue, saturation and lightness discrimination all decrease.
Szafir "Modeling Color Difference forVisualization Design" (2017) 80

More (Advanced) Color Picking Advice
If picking colors and making your own palette, make sure to transition through and pick dimcriminatable
colors that varyin hue and brightness.
Wong "Points of view:Color coding" (2010) 81
COLOR SPACES
RGB HSL Lab
Great for monitor display Intuitive: Hue, Saturation, Lightness Perceptually Uniform!
Not perceptually uniform Not perceptually uniform (L approximates human
(HSV is a variation on HSL) perception of lightness)
a = R/G and b = Y/B channel
Perceptually uniform: a change of the same amount in a color value
82
should produce a change of about the same visual importance
Luminance is tricky…
HSL
Lab
83
(static or interactive) (abstract or spatial)
visualization: the visual representation of data

to reinforce human cognition
4
Why visualize your data?
• RECORD information
• ANALYZE data to support reasoning
• CONFIRM hypotheses
• COMMUNICATE ideas to others
5
GOALS FOR TODAY
• Learn basic “do’s and don’t”s of visualization design in
order to be honest, have integrity, and be clear
• Learn Tufte’s “Graphical Integrity” principles
• Be aware that there is a fuzzy gray area of interpretation
and opinion on integrity
6
9
2/14/2019
Graphical Excellence
that which gives the viewer the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
10
2/14/2019
Graphical Excellence
• that which gives the viewer • the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
Minard’s map of Napoleon’s march to and from Moscow
11
2/14/2019
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
12
2/14/2019
Graphical Integrity
graphics must not quote out of context

13
2/14/2019
Graphical Integrity

clear, detailed, and thorough labeling should be
used to defeat graphical distortion and ambiguity
14
2/14/2019
Graphical Integrity

clear, detailed, and thorough labeling should be
used to defeat graphical distortion and ambiguity
show data variation,
not design variation
15
2/14/2019
Graphical Integrity

clear, detailed, and thorough labeling should
be used to defeat graphical distortion and
ambiguity
show data variation,
16
2/14/2019
Graphical Integrity

clear, detailed, and thorough labeling should 100
90
ambiguity
show data variation, 80
70
60
50
40
30
20
10
0
1979 84 89 94 99 2004
17
2/14/2019
Graphical Integrity

clear, detailed, and thorough labeling should 100
90
ambiguity
show data variation, 80
70
60
50
40
30
20
10
0
1979 84 89 94 99 2004
18
2/14/2019
Graphical Integrity
show data variation, not design variation
New York Times

19 Dec 1978
19
2/14/2019
Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity
New York Times

02 May 2010
20
2/14/2019
Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity
Washington Post
Some GRAPHICAL INTEGRITY
Principles in detail
21
“Graphical Integrity”
“Clear, detailed, and thorough labeling should be

used to defeat graphical distortion and ambiguity.
Write out explanations of the data on the
graphic itself. Label important events in the data.”
Tufte,“Visual Displayof Quantitative Information” (1983) 17

“Clear, detailed, and thorough labeling should be

used to defeat graphical distortion and ambiguity.
Write out explanations of the data on the
graphic itself. Label important events in the data.”
(Axes and axis labels, titles, annotations, legends, etc.)

“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Tufte,“Visual Displayof Quantitative Information” (1983) 18
“Distorted Scales”
$11,014
$3,549,385
y-axis
baseline?!
graphic itself. Label important events in the data.” Tufte,“Visual Displayof Quantitative Information” (1983) 18
Interest Rates
3.154
3.152
3.149
Percent %
3.147
3.145
3.142
3.140
2008 2009 2010 2011 2012
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 19
Interest Rates
4.00
3.20
2.40
Percent %
1.60
0.80
0.00
2008 2009 2010 2011 2012
Interest Rates
4.00
CONTEXT!
3.20
2.40
Percent %
1.60
0.80
0.00
2008 2009 2010 2011 2012
graphic itself. Label important events in the data.” http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html 21
“Double the axes, double the mischief ”

graphic itself. Label important events in the data.” http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html 21
http://www.babynamewizard.com/voyager 23
“The representation of numbers, as physically

measured on the surface of the graphic itself,
should be directly proportional to the numerical
quantities measured.”

“The representation of numbers, as physically measured on the surface of
the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 25
Lie Factor
Lie Factor = (Size of effect in graphic)
(Size of effect in data)
Lie Factor = >1, overstating

Lie Factor = 1, accurate :)
Lie Factor = <1, understating

Lie Factor

Lie Factor
Lie Factor = (Size of effect in graphic) Image = 5.3” - 0.6” = 7.83 = 783%
(Size of effect in data) 0.6”
Data = 27.5 - 18 = 0.53 = 53%
18
Lie Factor = 783% = 14.8
53%

Lie Factor
Lie Factor = (Size of effect in graphic) Image = 5.3” - 0.6” = 7.83 = 783%
(Size of effect in data) 0.6”
Data = 27.5 - 18 = 0.53 = 53%
18
Lie Factor = 783% = 14.8
53%
18
27.5

Data = 2 - 1 = 1 = 100%
IN-CLASS ACTIVITY:
Calculate for yourself! Lie Factor 1

2 2
1 1

Data = 2 - 1 = 1 = 100%
IN-CLASS ACTIVITY:
Calculate for yourself! Lie Factor 1
Lie Factor = (Size of effect in graphic) Make sure area is

(Size of effect in data) proportional to data!
2 2
1 ✓ 1 X
Image = 2 - 1 = 1 = 100% Image = 22 - 12 = 3 = 300%
1 12
Lie Factor = 100% = 1 Lie Factor = 300% = 3
100% 100%
Data Ink = the ink used to show data Tufte:maximize the data
ink ratio
Data Ink Ratio = data-ink
total ink in graphic

Data Ink = the ink used to show data Tufte:maximize the data
ink ratio
Data Ink Ratio = data-ink
total ink in graphic
LowData Ink Ratio High Data Ink Ratio

“The number of information-carrying (variable)

dimensions depicted should not exceed the
number of dimensions in the data.”

“No Unjustified 3D”
“The number of information-carrying (variable) dimensions depicted

should not exceed the number of dimensions in the data.” 33

# Dimensions in data: 2 # Dimensions in data: 2
# Dimensions in plot: 3 # Dimensions in plot: 2

Occlusion!
Lie Factor!
http://help.infragistics.com/Help/Doc/WinForms/2014.2/CLR4.0/html/ http://img.brothersoft.com/screenshots/softimage/
Images/Chart_Bar_Chart_03.png 0/3d_charts-171418-1269568478.jpeg

Unjustified 3D!
Lie factor!
http://stats.stackexchange.com/questions/109076/what-is-your-favorite-statistical-graph/109080 35

This is not just a design principle, it has lots of

experimental and quantitative data to back it up!

should not exceed the number of dimensions in the data.” Tory,et al.(2007) 38
To achieve graphical “excellence” according to Tufte:

1. Above all else show the data.
2. Maximize the data-ink ratio.
3. Erase non-data ink.
4. Erase redundant data ink.
5. Revise and edit.

IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
IN-CLASS ACTIVITY:
40
IN-CLASS ACTIVITY:
40
IN-CLASS ACTIVITY:
40
IN-CLASS ACTIVITY:
40
IN-CLASS ACTIVITY:
40
Percentage
IN-CLASS ACTIVITY:
Month 40
“Chart Junk”
Bateman, et al.(2010) 41
Tufte,“Beautiful Evidence” (2006) 43
Not all “visual
embellishments”
are “chart junk”!
Tufte,“Beautiful Evidence” (2006) 43

“Chart Junk”
Chart junk can... persuade, help with memorability, engage

... bias, reduce data-ink ratio, clutter, degrade trust
Take-away:it depends on your audience,task, and context...
44
Similar advice of William Cleveland
(The Elements of Graphing Data, 1985)
• CLEAR VISION: Make clear visualizations, and ensure that the
data stands out.
• CLEAR UNDERSTANDING: Ensure that main points and
conclusions are graphically clear and represented.
• SCALES: Pick appropriate axes and tick-mark scales, and ensure
all the data is represented.
• GENERAL STRATEGY: Ensure all the data is represented. Design
your visualizations carefully and allow time to proofread.
46
Data Preparation as a step in the
Knowledge Discovery Process Knowledge
Evaluation and
Presentation
Data Mining
Selection and
Transformation
Cleaning and
Integration DW
DB
3
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Forms of Data Preprocessing
Data Preprocessing
• Data Preprocessing: An Overview
– Data Quality
– Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
7
7
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
8
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
9
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, or “- ” a new class?!
– the attribute mean (symmetric) or median (skewed)
– the attribute mean for all samples belonging to the same class,
e.g. average income in same credit_risk
– the most probable value: inference-based such as Bayesian
formula or decision tree
10
How to Handle Missing Data?
• Fill in it automatically with
– the most probable value:
• Inference-based such as Bayesian formula or decision tree
• Identify relationships among variables

– Linear regression, Multiple linear regression, Nonlinear regression
• Nearest-Neighbour estimator
– Finding the k neighbours nearest to the point and fill in the most frequent value or
the average value
– Finding neighbours in a large dataset may be slow
11
Nearest-Neighbour
60
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
13
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
14
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions
– Linear regression involves finding “best” line to fit two attributes,
so that one attribute can be used to predict the other.
– Multiple Linear regression – more than two attributes involved
and data fit to a multidimensional surface
• Clustering
– detect and remove outliers
– Outliers – values outside of the set of clusters
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Reduction
• Summary
16
16
Data Integration
• Tuple Duplication
– The use of denormalized tables (improve performance by avoiding joins) creates
data redundancy
– Inconsistencies often arise between various duplicates, due to inaccurate data
entry
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
– Hotel chain – price difference in currencies and services and taxes
– Attributes may differ on level of abstraction, e.g. total_sales – at branch level or
region level
Data Integration- Entity identification problem
• Data integration:
– Combines data from multiple sources into a coherent store
– Integrate metadata from different sources
• Entity identification problem:
– Schema integration and object matching: e.g., A.cust-id B.cust-#
– Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
– Metadata – name, meaning, data type, range, null rules
– Metadata can help avoid errors in schema integration
– Metadata may help transform the data
– When matching attributes from two databases, structure of data should be
checked
18
18
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple

databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
19
19
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Reduction
• Summary
20
20
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Parametric - Regression and Log-Linear Models
• Non-parametric - Histograms, clustering, sampling
• Data cube aggregation
– Data compression
• Lossless - Reconstruction without any loss of information
• Lossy – reconstruct only an approximation of the original data
21
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
22
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation in data

• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space
x2
x1
23
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength. The principal components serve as new set of axes for the
data, giving important information on variance
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
24
Principal Component Analysis
• Works for numeric data only
• PCA can be applied to ordered and unordered attributes and can
handle sparse and skewed data
• Multidimensional handled by reducing to two-dimensional
• PCA handles sparse data better than wavelet transforms
Y1 and Y2 are first two principal components

Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
– Ex.: Log-linear models—obtain value at a point in n-
dimensional space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
26
Parametric Data Reduction: Regression and Log-
Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple Linear regression
– Allows a response variable Y to be modeled as a linear function
of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
– Consider each tuple as a point in an n-dimensional space
27
y
Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more X1 x
independent variables (aka. explanatory
variables or predictors) • Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
• Most commonly the best fit is evaluated by relationships
using the least squares method, but other
criteria have also been used
28
Histogram Analysis
• Divide data into buckets and store
average (sum) for each bucket
• Partitioning rules: 40
35
– Equal-width: equal bucket 30
range 25
20
– Equal-frequency (or equal- 15
depth): frequency of each 10
5
bucket is constant 0
10000 30000 50000 70000 90000
29
Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms
30
Sampling
• Sampling: obtaining a small sample s to represent the whole data

set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
31
Data Reduction 3: Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as
forms of data compression
32
Data Compression
Original Data Compressed

Data
lossless
Original Data
Approximated
33
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Reduction
• Summary
34
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
• Methods
– Statistics: Descriptive and Distribution
– Smoothing: Remove noise from data – binning, regression, clustering
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, used in data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
35
Descriptive Statistics: Univariate
• Range, Min/Max
• Difference between minimum and maximum values in a data set
• Larger range usually (but not always) indicates a large spread or deviation in the
values of the data set.
• Average
• Sum of all values divided by the number of values in the data set.
• One measure of central location in the data set.
• Median
• The middle value in a sorted data set. Half the values are greater and half are less than
the median.
• Mode
• The most frequent occurring value.
• Another measure of central location in the data set.
Distribution Statistics
• Variance
• One measure of dispersion (deviation from the mean) of a data set. The larger the
variance, the greater is the average deviation of each datum from the average value
• Standard Deviation
• the average deviation from the mean of a data set.
• Histograms and Normal Distribution
• Variance and SD are critical in analyzing your data

distribution and determining how “meaningful” is the
chosen average
Distribution Statistics:
Normal and Skewed Distributions
• When data are
skewed, the mean and
SD can be misleading
• Skewness
sk= 3(mean-median)/SD
If sk>|1| then distribution is
non-symetrical
• Negatively skewed
• Mean<Median
• Sk is negative
• Positively Skewed
• Mean>Median
• Sk is positive
Problems in reading distribution
• We can’t really tell 120
much about this data 100
set 80
Data Values
60
• Even Min and Max are 40
hard to see 20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X-axis labels
The data can be presented such that more statistical info can be
estimated from the chart (average, standard deviation).
Plotting the distribution
• Determine a frequency table (bins)
• A histogram is a column chart of the frequencies
7
Category Labels Frequency
6
0-50 3
5
51-60 2
Frequency
4
61-70 6
3
71-80 5
2
81-90 3
1
>90 1 0
0-50 51-60 61-70 71-80 81-90 >90
Scores
Distribution Statistics: Histogram
• The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data
• For small data sets, histograms can be misleading. Small changes in
the data or to the bucket boundaries can result in very different
histograms.
• For large data sets, histograms can be quite effective at illustrating
general properties of the distribution.
• Histograms effectively only work with 1 variable at a time

• Difficult to extend to 2 dimensions, not possible for >2
• So histograms tell us nothing about the relationships among variables
Normalization
• The measurement unit can affect the data analysis
• Smaller unit leads to larger range and thus give more weight to an
attribute
• Normalize data between [-1,1] or [0,1] to avoid dependence on
choice of measurement unit
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,600 is mapped to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
– Min-max normalization preserves the relationships among the
original data values
Normalization
• Z-score normalization (μ: mean, σ: standard deviation):
v−
v' =
A
73,600 − 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
– Useful when the actual minimum and maximum values are unknown, or
when there are outliers that dominate the min-max normalization
– A variation replaces standard deviation by mean absolute deviation
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10 43
Discretization
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
44
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised (does not use class
information)
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-up
merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
45
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results
46
Discretization by Histogram Analysis
• Histogram analysis is an unsupervised discretization technique as
it does not use class information
• Equal-width – values are partitioned into equal sized partitions
or ranges
• Equal frequency – values are partitioned so each partition
contains the same number of data tuples
• Histogram analysis algorithm can be applied recursively to each
partition to automatically generate multilevel concept hierarchy
• Histogram can be partitioned based on cluster analysis of the
data distribution
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition
48
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and
is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.
49
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Specification of a set of attributes, but not of their partial ordering
– Concept hierarchy based on number of distinct values
– E.g., for a set of attributes: {street, city, state, country}
50
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the
data set
– The attribute with the most distinct values is placed at the
lowest level of the hierarchy
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values

51

Ilovepdf Merged

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ilovepdf Merged

Uploaded by

Copyright:

Available Formats

1

Lecture : Visualization Building Blocks

Dr. Muhammad Faisal Cheema

MARK = basic graphical element in an image

Points Lines Areas

Munzner,“VisualizationAnalysis and Design” (2014) 39

Munzner,“VisualizationAnalysis and Design” (2014) 40

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Horizontal Vertical Both

Munzner,“VisualizationAnalysis and Design” (2014) 52

"Resemblance, order and proportion are the three

Great to recognize many classes.

Good for qualitative data Limited

Encode continuous variables (Q) [not as well]

Hue is normally perceived as unordered

Cleveland & McGill (1984) 56

TASK:Which segment/bar is the maximum, and what is its percentage/value?

Cleveland & McGill (1984) 58

This is why pie

Cleveland & McGill (1984) 58

This is why pie

Cleveland & McGill (1984) 64

1.0 1.5 2.0 2.5 3.0

Cleveland & McGill (1984) 65

1.0 1.5 2.0

Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes

Position on unaligned scale Color hue

Length (1D size) Motion