Professional Documents
Culture Documents
32
In-class exercise: Critique & Redesign
Describe: What do you see? 1.Who is the intended audience?
2.What information does this visualization represent?
Analyze: How is the work organized? What are the visual
3. How many data dimensions does it encode?
encodings?
4. List several tasks, comparisons or evaluations it enables
Task: What is the purpose of the visualization? 5.What principles of excellence best describe why it is good / bad?
Decide: Is this a successful (effective) visualization? 6.Can you suggest any improvements?
7.Why do you like / dislike this visualization?
http://www.theatlantic.com/past/docs/images/issues/200709/win.jpg 30
MARKS & CHANNELS
38
Visualization Building Blocks
Shape Tilt
Size
Length Area Volume
41
Visualization Building Blocks
# of attributes encoded:
41
Visualization Building Blocks
# of attributes encoded: 2
41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
41
Visualization Building Blocks
MARK: Areas
Points Lines
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Visualization Building Blocks
# of attributes encoded: 2 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
42
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
43
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Visualization Building Blocks
# of attributes encoded: 4 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Visualization Building Blocks
# of attributes encoded: 4 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
44
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
45
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
46
Visualization Building Blocks
# of attributes encoded: 1 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
46
Visualization Building Blocks
# of attributes encoded: 3 MARK: Lines Areas
Points
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
47
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Visualization Building Blocks
# of attributes encoded: MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Visualization Building Blocks
# of attributes encoded: 3 MARK:
Points Lines Areas
(4 WITH
POSITION)
CHANNEL :
Position Color
Shape Tilt
Size
Length Area Volume
+ position in 3D space 48
Kindlmann (2004) 50
51
Visualization Building Blocks
Marks as Items/Nodes
Points Lines Areas
Marks as Links
Containment Connection
Marks as Links
Containment Connection
Shape Tilt
Size
Length Area Volume
54
Visualization Building Blocks
Marks as Links
Containment Connection
53
How do I pick which marks or channels to use?
Bertin’s Semiology of Graphics
1.A, B, C are distinguishable
2.B is between A and C.
C
3.BC is twice as long asAB.
B
A Encode quantitative variables
Selective
Is a mark distinct from other marks?
Can we make out the difference between two marks?
Associative
Does it support grouping?
Quantitative
Can we quantify the difference between two marks?
Characteristics of Visual Variables
Order
Can we see a change in order?
Length
How many unique marks can we make?
Position
• Strongest visual variable Suitable
for all data types
• Problems:
• Sometimes not available
• Cluttering
Size & Length
• Good visual variable
• Easy to see whether one is bigger Grouping
works
• Judging differences
• Good for aligned bars (position)
• OK for changes in length
• Bad for changes in area
Shape
Orientation
N
“Ordering of Elemental Perceptual Tasks”
https://www.washingtonpost.com/news/wonk/wp/2013/06/17/the-usefulness-of-pie-charts-in-two-pie-charts/ 59
“Ordering of Elemental Perceptual Tasks”
Mackinlay
(1986)
Channel Ranking by Data Type
(Categorical)
Mackinlay
(1986)
Channel Ranking by Data Type
Mackinlay
(1986)
Channel Ranking by Data Type
AREA
Quantitative
Ordinal
Categorical
Mackinlay
(1986)
Channel Ranking by Data Type
Mackinlay
(1986)
Channel Ranking by Data Type
Mackinlay
(1986)
Cleveland & McGill’s Results
Positions
1.0 1.5 2.0 2.5 3.0
Lo g Error
Crowdsourced Results
Angles
Circular
areas
Rectangular
areas
(aligned or in
a
treemap)
Mackinlay(1986) 71
Expressiveness and Effectiveness
Mackinlay(1986) 72
Channels: Expressiveness Types and Effectiveness Ranks
Tilt/angle Shape
Color luminance
Color saturation
Curvature
Lecture : Visualizationof
Multidimensional Data
DATA VISUALIZATION
SPRING 2019
tas
k
User interaction
The Visualization Pipeline (InfoVis)
tas
k
User interaction
The Visualization Pipeline (InfoVis)
tas
k
User interaction
The Visualization Pipeline (InfoVis)
tas
k
User interaction
Why Step 2 and 3 highly correlated ???
tas
k
User interaction
Representations
7 Bill
Tukey box plot
5
low Middle 50% high
3
1 Mean
0 20
Bivariate Data
Representations
price
mileage
Trivariate Data
Representations
horsepower
mileage
Trivariate
Visual representations:
Complex glyphs
E.g. star glyphs, faces, embedded visualization, …
Multiple views of different dimensions
E.g. small multiples, plot matrices, brushing histograms, …
Non-orthogonal axes
E.g. Parallel coords, star coords, …
Tabular layout
E.g. TableLens, …
Interactions:
Dynamic Queries
Brushing & Linking
Selecting for details, …
Combinations (combine multiple techniques)
Glyphs
Glyphs: Chernoff Faces
d6 d3
d5 d4
Non-orthogonal axis
Parallel Coordinates (2D)
• Encode variables along a horizontal row
• Vertical line specifies values
x y z w
0 0 0 0
Parallel Coordinates Example
Basic
Grayscale
Color
Parallel Coordinates
Parallel Coordinates
http://noppa5.pc.helsinki.fi/koe/3d3.html
Multiple Views
A B C D E
2
1 4 1 8 3 5
2 6 3 4 2 1
3
3 5 7 2 4 3
4 2 6 3 1 5
4
A B C D E
Small Multiples
Small Multiples
Small Multiples
Small Multiples
Trellis Plots
Characteristics
Can sort on any attribute (row)
Focus on an attribute value (show only cases having that value) by doubleclicking on it
Can type in queries on different attributes to limit what is presented to. Note this is
main contribution: dynamic control (selection/change/querying/filtering) of individual
attributes.
1
Lecture : Visualizing
Graphs and Networks
DATA VISUALIZATION
SPRING 2019
http://www.sci.utah.edu/~miriah/cs6630/lectures/L13-trees-graphs.pdf
Graph and Network Uses
• In Information Visualization, any number
of data sets can be modeled as a graph
– Telephone system
– World Wide Web
– Distribution network for on-line retailer
– Call graph of a large software system
– Semantic map in an AI algorithm
– Set of connected friends
– Social Networks
Graphs are more complicated
than trees
Graph Terminology
Label
Thickness
Color
Directed
Complexity Considerations
• Crossings-- minimize towards planar
• Total Edge Length-- minimize towards
proper scale
• Area-- minimize towards efficient use of
space
• Maximum Edge Length-- minimize longest
edge
• Uniform Edge Lengths-- minimize variances
• Total Bends– minimize orthogonal towards
straight-line
Graph Visualization Problems
• Graph layout and positioning
– Make a concrete rendering of abstract graph
• Scale
– Not too much of a problem for small graphs, but
large ones are much tougher
• Navigation/Interaction
– How to support user changing focus and moving
around the graph
Graph Layout
• How to position the nodes and edges?
• Avoid clutter
• Maintain appropriate relations
The Hairball Problem
• How to position the nodes and
edges?
• Avoid clutter
• Maintain appropriate relations
Layout Types
• Grid Layout
– Put nodes on a grid
• Force Directed Layout
– Model graph as set of masses connected by
springs
• Planar Layout
– Detect part of graph that can be laid out without
edge crossings
• Attribute Based Layouts
Layout Subproblems
• Rank Assignment
– Compute which nodes have large degree, put
those at center of clusters
• Crossing Minimization
– Swap nodes to rearrange edges
• Subgraph Extraction
– Pull out cluster of nodes
• Planarization
– Pull out a set of nodes that can lay out on plane
Planar Layouts
Starting simple: planar 3-vertex
connected graphs (what?)
Tutte Embedding
• Each node should be the average of its neighbors
UNIX ancestry
SUGIYAMA STEP 1
- create layering of graph
- from domain specific knowledge
- longest path from root
2 3 4
5 6 7 8
9 10 11
SUGIYAMA STEP 2
- minimize crossings layer by layer (NP- hard)
- numerous heuristics available
2 3 4
8 6 5 7
11 9 10
SUGIYAMA STEP 3
- final assignment of x- coordinates
- routing of edges
2 3 4
8 6 5 7
11 9 10
Gansner 1993
SUGIYAMA
+ nice, readable top down flow
+ relatively fast (depending on heuristic
used for crossing minimization)
- edges = springs
highschool
dating network
FORCE MODEL
- many variations, but usually physical
analogy:
- repulsion : fR(d) = CR * m1*m 2 / d2
- attraction : fA(d) = CA * (d – L)
- L is the rest length of the spring
- i.e. Hooke’s Law
27
Slide by Frank van Ham
Adjacency Matrix
29
Henry & Fekete (2006)
Adjacency Matrix
• Change network to tabular data
and use a matrix representation
• Derived data: nodes are keys,
edges are boolean values
• Task: lookup connections, find well-
connected clusters
• Scalability: millions of edges
1
9
Node link and Adjacency Matrix
(can co-exist together)
2
0
[McGuffin]
Adjacency Matrix
Pros:
•great for dense graphs
• visually scalable
•can spot clusters
Cons:
• row order affects what you can see
• abstract visualization
• hard to follow (multilink) paths
Node-Link or Adjacency Matrix?
• Empirical study: For most tasks, node-link is better for small graphs and adjacency
better for large graphs 2
• Immediate connectivity or neighbors are ok, estimating size (nodes & edges also ok)
• People tend to be more familiar with node-link diagrams
3
0
https://bost.ocks.org/mike/miserables/
Attribute driven layouts
ATTRIBUTE-DRIVEN LAYOUT
- large node- link diagrams get messy!
- are there additional structures we
can exploit?
37
cerebral
metabolic
network
0 1.00
GLU
GLYCOLYSIS
G6P
a
PPP
b
R5 P
c
G3P
G3P
PYR
TCA
CIT 3
0 1.00
OTHER NODE LINK LAYOUTS
- orthogonal
- great for UML diagrams
- algorithmically complex
- circular layouts
- emphasizes ring topologies
- used in social network
diagrams
- nested layouts
- recursively apply layout
algorithms
- great for graphs with
hierarchical structure
More node link layouts - Arc Diagram
1
3
1
Lecture:
Interaction Techniques in Visualization
DATA VISUALIZATION
SPRING 2019
Select
Navigate
Item Reduction Attribute Reduction
Zoom Slice
Geometric or Semantic
Pan/Translate Cut
Constrained Project
21
Highlighting
• Selection is the user action
• Feedback is important!
• How? Change selected item's visual encoding
- Change color: want to achieve visual popout
- Add outline mark: allows original color to be preserved
- Change size (line width)
- Add motion: marching ants
21
Highlighting
[http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html]
Navigation
Navigation
• Fix the layout of all visual elements but provide methods for the
viewpoint to change
• Camera analogy: only certain features visible in a frame
- Zooming
- Panning (aka scrolling)
- Translating
- Rotating (rare in 2D, important in 3D)
Panning
https://www.google.com/finance?q=INDEXFTSE
Zooming
https://www.google.com/finance?q=INDEXFTSE
“Geometric” vs.
“Semantic”
Zooming Zooming
Linked Highlighting
Share Navigation
Multiple Views
Overview/
Same Redundant
Detail
Small Multiples
Multiform,
Overview/ No Linkage
Multiform Detail
Multiple Views
Partition into Side-by-Side Views
Superimpose Layers
Multiform Views
• The same data visualized in different ways
• Does not need to be a totally different encoding (all choices need
not be disjoint), e.g. horizontal positions could be the same
• One view becomes cluttered with too many attributes
• Consumes more screen space
• Allows greater separability between channels
Multiple Views
Example of Facets
Small Multiples
• Same encoding, but different data in each view (e.g. SPLOM)
sepal length
7
sepal width
4.0
3.5
3.0
2.5
2.0
petal length
6
1
2.5
petal width
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7
[http://bl.ocks.org/mbostock/4063663]
2
Brushing
sepal length
7
sepal width
4.0
3.5
3.0
2.5
2.0
petal length
6
2.5
petal width
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 2.0 2.5 3.0 3.5 4.0 5 6 7
[http://bl.ocks.org/mbostock/4063663]
Multiple Views
Example of Partitioned Views
Overview-Detail View
[Wikipedia]
28
Partitioned View
Population
25 to 44 Years
9.0M
18 to 24 Years
14 to 17 Years
8.0M
5 to 13 Years
Under 5 Years
7.0M
6.0M
5.0M
4.0M
3.0M
2.0M
1.0M
0.0
CA TX NY FL IL PA
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Hillingdon
Brent Camden Westminster Hackney Newham Barking
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Hounslow Ealing Hammersmith Kensington Islington Tower Hamlets Greenwich
Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det Semi Det
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Richmond Kingston Merton Wandsworth City of London Southwark Bexley
Semi Semi Det
Det Semi Det Semi Det Semi Det Semi Det Semi Det
Flat Ter Flat Ter Flat Ter Flat Ter Flat Ter
Sutton Croydon Lambeth Lewisham Bromley
Semi Det Semi Det Semi Det Semi Det Semi Det
[Slingsby et al., 2009]
7
Multiple Views
Example of Superimposed layers
Superimposed Line
Charts
Temperature (ºF)
80
Austin
70
New York
60
San Francisco
50
40
30
20
October November December 2012 February March April May June July August September
Multiform,
Overview/
Overview/
Detail
Detail
Linked Highlighting
Zoom Techniques
Provides detailed view of a subset within context of
the full dataset
Multiform,
Overview/
Overview/
Detail
Detail
Focus+Context
• Show everything at once but compress regions that are not the
current focus
- User shouldn't lose sight of the overall picture
- May involve some aggregation in non-focused regions
- "Nonliteral navigation" like semantic zooming
• Elision
• Superimposition: more directly tied than with layers
• Distortion
35
Focus+Content
Overview
Embed Reduce
Elide Data Filter
Aggregate
Superimpose Layer
Embed
Distort Geometry
36
Focus + Context
Elision
Elision
• There are a number of examples of elision including in text ,
DOITrees, …
• Includes both filtering and aggregation but goal is to give overall
view of the data
• In visualization, usually correlated with focus regions
37
Elision: DOITrees
(c) Enrichment
[Extended Lens in Tominski et al., 2014]
40
Focus + Context
Distortion
Distortion: Fisheye
Lens
Leung 1994 33
http://www.cs.umd.edu/class/fall2002/cmsc838s/tichi/fisheye.html
Distortion Choices
• How many focus regions?
- One
- Multiple
• Shape of the focus?
- Radial
- Rectangular
- Other
• Extent of the focus
- Constrained similar to magic lenses
- Entire view changes
• Type of interaction:
- Geometric, moveable lenses, rubber sheet
Examples of
Focus + Context
Stretch and Squish Navigation
(a) Bring (step 1) – Selecting a node fades out (b) Bring (step 2) – Neighbor nodes are pulled (c) Go – After selecting a neighbor (the green
all graph elements but the node neighborhood. close to the selected node. node in Fig. 4(b)), a short animation brings the
focus towards a new neighborhood.
1
DATA VISUALIZATION
SPRING 2019
2
CONES & RODS
https://askabiologist.asu.edu/sites/default/files/resources/articles/seecolor/Light-though-eye-big.png 3
CONES & RODS
http://i.stack.imgur.com/wIbcE.jpg
10
http://thebrain.mcgill.ca/flash/a/a_02/a_02_m/a_02_m_vis/a_02_m_vis.html
CONES & RODS This is why we luminance
(brightness) is more
effective encoding channel!
Rods:120 million
Cones: 5-6 million
64% red-sensitive
32% green-sensitive
2% blue-sensitive.
http://arthistoryresources.net/visual-experience-2014/visual-experience-2014-images/red-green-blue-wavelengths+rods-big.jpg 11
PERIPHERAL VISION
https://en.wikipedia.org/wiki/Peripheral_vision 6
LOW-LEVEL FEATURE ANALYSIS
Shape
Color
Motion
Ware,VTFD 7
Use these “popout” effects to
help design effective
visualizations!
Ware,VTFD 8
(A LITTLE MORE) PERCEPTION
9
“Get it right in black and white.”
-Maureen Stone
https://research.tableau.com/user/maureen-stone 10
Luminance
Luminance = the amount of visible light that comes to the eye from a surface
Luminance
(lightness)
12
“Simultaneous Contrast”
13
“Simultaneous Contrast”
20
“Simultaneous Contrast”
26
COLOR
27
Why color…?
• Color for labeling and annotation
• Color for measuring (encoding sequential data)
• Color for encoding categories
• Color to encoding meaning (conventions, representation)
• Color as beauty (aesthetics)
29
Why color…?
Functions of color:
Identify, Group, Layer, Highlight
Tufte,“Envisioning Information” 31
CONES & RODS Red
Green
Blue
NOTE:opposite colors are never perceived together (no reddish green or bluish yellow)
33
Color
Constancy
34
35
“Simultaneous Contrast”
36
“Simultaneous Contrast”
37
“Simultaneous Contrast”
38
“Simultaneous Contrast”
Be careful with bars and scatter plot points - the colors mayappear differently with different background
colors and neighboring colors!
39
“Simultaneous Contrast”
39
“Simultaneous Contrast”
40
Small Area Effects
43
Which area is larger
(green or red)?
Luminance
Saturation
Hue
45
Color Deficiencies (Color Blindness)
Person with faulty cones (or faulty pathways):
normal
46
Color Deficiencies (Color Blindness)
47
48
Those with deuteranope color blindness (red/green) will have difficulty seeing the numbers.
“Get it right in black and white.”
49
“Get it right in black and white.”
50
“Get it right in black and white.”
51
“Get it right in black and white.”
52
“Get it right in black and white.”
53
“Get it right in black and white.”
54
Color Deficiencies (Color Blindness)
http://www.vischeck.com/vischeck/vischeckImage.php 55
https://www.nytimes.com/interactive/2018/02/06/climate/flood-toxic-chemicals.html
Primary Colors?
• Red, Green, and Blue
• Red, Yellow, and Blue
• Orange, Green, and Violet
• Cyan, Magenta, and Yellow
• All of the above!
17
Color Addition and Subtraction
Color Spaces and Gamuts
[http://dot-color.com/2012/08/14/color-space-confusion/]
Color Spaces and Gamuts
• Color space: the organization of all colors in space
- Often human-specific, what we can see (e.g. CIELAB)
• Color gamut: a subset of colors
- Defined by corners on in the color space
- What can be produced on a monitor (e.g. using RGB)
- What can be produced on a printer (e.g. using CMYK)
- The gamut of your monitor != the gamut of someone else's != the
gamut of a printer
Color Models
• A color model is a representation of color using some basis
• RGB uses three numbers (red, blue, green) to represent color
• Color space ~ color model, but there can be many color models
used in the same color space (e.g. OGV)
• Hue-Saturation-Lightness (HSL) is more intuitive and useful
- Hue captures pure colors
- Saturation captures the amount of white mixed with the color
- Lightness captures the amount of black mixed with a color
- HSL color pickers are often circular
• Hue-Saturation-Value (HSV) is similar (swap black with gray for the
final value), linearly related
Color Maps
Color Map = mapping between color and value
http://matplotlib.org/mpl_examples/color/colormaps_reference_05.png 57
Rainbow Color Map
Why this color map is a poor choice...
• No perceptual ordering (confusing)
• No luminance variation (obscures details)
• Viewers perceive sharp transitions in color as sharp
transitions in the data, even when this is not the case
(misleading)
61
Artifacts from Rainbow Colormaps
Colormap
Binary Categorical
Diverging Sequential
Categorical vs. Ordered
Luminance
Saturation
Hue
+ =
65
http://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/
Categorical Colormap Guidelines
[colorbrewer2.org]
Categorical Colormaps
[colorbrewer2.org]
Number of distinguishable colors?
67
Color Maps
68
Color Maps
http://colorbrewer2.org/ 70
Colorgorical
http://vrl.cs.brown.edu/color 71
Color Advice Summary
Use a limited hue palette
• Control color “pop out” with low-saturation colors
• Avoid clutter from too many competing colors
Use neutral backgrounds
• Control impact of color
• Minimize simultaneous contrast
Use Color Brewer for scales
Don’t forget aesthetics!
Based on Slides byHanspeter Pfister, Maureen Stone 72
Color Design Rules
Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.
ΔS = constant
S
Weber’s Law
Our ability to detect a difference between two objects with a certain
attribute is related to the percent difference in the attribute, not the
absolute difference.
Weber’s Law
If picking colors and making your own palette, make sure to transition through and pick dimcriminatable
colors that varyin hue and brightness.
Wong "Points of view:Color coding" (2010) 81
More (Advanced) Color Picking Advice
COLOR SPACES
RGB HSL Lab
Great for monitor display Intuitive: Hue, Saturation, Lightness Perceptually Uniform!
Not perceptually uniform Not perceptually uniform (L approximates human
(HSV is a variation on HSL) perception of lightness)
a = R/G and b = Y/B channel
Perceptually uniform: a change of the same amount in a color value
82
should produce a change of about the same visual importance
More (Advanced) Color Picking Advice
Luminance is tricky…
HSL
Lab
83
(static or interactive) (abstract or spatial)
4
Why visualize your data?
• RECORD information
• ANALYZE data to support reasoning
• CONFIRM hypotheses
• COMMUNICATE ideas to others
5
GOALS FOR TODAY
• Learn basic “do’s and don’t”s of visualization design in
order to be honest, have integrity, and be clear
• Learn Tufte’s “Graphical Integrity” principles
• Be aware that there is a fuzzy gray area of interpretation
and opinion on integrity
6
9
2/14/2019
Graphical Excellence
that which gives the viewer the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
10
2/14/2019
Graphical Excellence
• that which gives the viewer • the greatest number of ideas
in the shortest time
with the least ink
in the smallest space
Minard’s map of Napoleon’s march to and from Moscow
11
2/14/2019
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
12
2/14/2019
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
60
50
40
30
20
10
0
1979 84 89 94 99 2004
17
2/14/2019
Graphical Integrity
representation of numbers should be directly proportional
to the numerical quantities represented
60
50
40
30
20
10
0
1979 84 89 94 99 2004
18
2/14/2019
Graphical Integrity
show data variation, not design variation
Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity
Graphical Integrity
clear, detailed, and thorough labeling should be used to defeat graphical distortion
and ambiguity
Washington Post
Some GRAPHICAL INTEGRITY
Principles in detail
21
“Graphical Integrity”
$11,014
$3,549,385
y-axis
baseline?!
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Tufte,“Visual Displayof Quantitative Information” (1983) 18
Interest Rates
3.154
3.152
3.149
Percent %
3.147
3.145
3.142
3.140
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 19
Interest Rates
4.00
3.20
2.40
Percent %
1.60
0.80
0.00
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 20
Interest Rates
4.00
CONTEXT!
3.20
2.40
Percent %
1.60
0.80
0.00
2008 2009 2010 2011 2012
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” Based on http://data.heapanalytics.com/how-to-lie-with-data-visualization 20
“Clear, detailed, and thorough labeling should be used to defeat graphical
distortion and ambiguity. Write out explanations of the data on the
graphic itself. Label important events in the data.” http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html 21
“Double the axes, double the mischief ”
1 1
2 2
1 ✓ 1 X
Image = 2 - 1 = 1 = 100% Image = 22 - 12 = 3 = 300%
1 12
Lie Factor = 100% = 1 Lie Factor = 300% = 3
100% 100%
“The representation of numbers, as physically measured on the surface of
the graphic itself, should be directly proportional to the numerical
quantities measured.” Tufte,“Visual Displayof Quantitative Information” (1983) 29
“Graphical Integrity”
Data Ink = the ink used to show data Tufte:maximize the data
ink ratio
Data Ink Ratio = data-ink
total ink in graphic
http://help.infragistics.com/Help/Doc/WinForms/2014.2/CLR4.0/html/ http://img.brothersoft.com/screenshots/softimage/
Images/Chart_Bar_Chart_03.png 0/3d_charts-171418-1269568478.jpeg
http://stats.stackexchange.com/questions/109076/what-is-your-favorite-statistical-graph/109080 35
“No Unjustified 3D”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
40
“Graphical Integrity”
Percentage
IN-CLASS ACTIVITY:
Use paper/pen to sketch
“Tufte” version!
Month 40
“Chart Junk”
Bateman, et al.(2010) 41
Tufte,“Beautiful Evidence” (2006) 43
Not all “visual
embellishments”
are “chart junk”!
44
Similar advice of William Cleveland
(The Elements of Graphing Data, 1985)
• CLEAR VISION: Make clear visualizations, and ensure that the
data stands out.
• CLEAR UNDERSTANDING: Ensure that main points and
conclusions are graphically clear and represented.
• SCALES: Pick appropriate axes and tick-mark scales, and ensure
all the data is represented.
• GENERAL STRATEGY: Ensure all the data is represented. Design
your visualizations carefully and allow time to proofread.
46
Data Preparation as a step in the
Knowledge Discovery Process Knowledge
Evaluation and
Presentation
Data Mining
Selection and
Transformation
Cleaning and
Integration DW
DB
3
Data Quality: Why Preprocess the Data?
4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Forms of Data Preprocessing
Data Preprocessing
• Data Preprocessing: An Overview
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
7
7
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
8
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
9
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, or “- ” a new class?!
– the attribute mean (symmetric) or median (skewed)
– the attribute mean for all samples belonging to the same class,
e.g. average income in same credit_risk
– the most probable value: inference-based such as Bayesian
formula or decision tree
10
How to Handle Missing Data?
• Fill in it automatically with
– the most probable value:
• Inference-based such as Bayesian formula or decision tree
• Nearest-Neighbour estimator
– Finding the k neighbours nearest to the point and fill in the most frequent value or
the average value
– Finding neighbours in a large dataset may be slow
11
Nearest-Neighbour
60
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
13
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
14
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions
– Linear regression involves finding “best” line to fit two attributes,
so that one attribute can be used to predict the other.
– Multiple Linear regression – more than two attributes involved
and data fit to a multidimensional surface
• Clustering
– detect and remove outliers
– Outliers – values outside of the set of clusters
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Preprocessing
• Data Preprocessing: An Overview
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
16
16
Data Integration
• Tuple Duplication
– The use of denormalized tables (improve performance by avoiding joins) creates
data redundancy
– Inconsistencies often arise between various duplicates, due to inaccurate data
entry
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
– Hotel chain – price difference in currencies and services and taxes
– Attributes may differ on level of abstraction, e.g. total_sales – at branch level or
region level
Data Integration- Entity identification problem
• Data integration:
– Combines data from multiple sources into a coherent store
– Integrate metadata from different sources
• Entity identification problem:
– Schema integration and object matching: e.g., A.cust-id B.cust-#
– Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
– Metadata – name, meaning, data type, range, null rules
– Metadata can help avoid errors in schema integration
– Metadata may help transform the data
– When matching attributes from two databases, structure of data should be
checked
18
18
Handling Redundancy in Data Integration
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
20
20
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Parametric - Regression and Log-Linear Models
• Non-parametric - Histograms, clustering, sampling
• Data cube aggregation
– Data compression
• Lossless - Reconstruction without any loss of information
• Lossy – reconstruct only an approximation of the original data
21
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
22
Principal Component Analysis (PCA)
x2
x1
23
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength. The principal components serve as new set of axes for the
data, giving important information on variance
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
24
Principal Component Analysis
• Works for numeric data only
• PCA can be applied to ordered and unordered attributes and can
handle sparse and skewed data
• Multidimensional handled by reducing to two-dimensional
• PCA handles sparse data better than wavelet transforms
27
y
Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more X1 x
independent variables (aka. explanatory
variables or predictors) • Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
• Most commonly the best fit is evaluated by relationships
using the least squares method, but other
criteria have also been used
28
Histogram Analysis
• Divide data into buckets and store
average (sum) for each bucket
• Partitioning rules: 40
35
– Equal-width: equal bucket 30
range 25
20
– Equal-frequency (or equal- 15
5
bucket is constant 0
10000 30000 50000 70000 90000
29
Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms
30
Sampling
31
Data Reduction 3: Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as
forms of data compression
32
Data Compression
lossless
Original Data
Approximated
33
Data Preprocessing
• Data Preprocessing: An Overview
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
34
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
• Methods
– Statistics: Descriptive and Distribution
– Smoothing: Remove noise from data – binning, regression, clustering
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, used in data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
35
Descriptive Statistics: Univariate
• Range, Min/Max
• Difference between minimum and maximum values in a data set
• Larger range usually (but not always) indicates a large spread or deviation in the
values of the data set.
• Average
• Sum of all values divided by the number of values in the data set.
• One measure of central location in the data set.
• Median
• The middle value in a sorted data set. Half the values are greater and half are less than
the median.
• Mode
• The most frequent occurring value.
• Another measure of central location in the data set.
Distribution Statistics
• Variance
• One measure of dispersion (deviation from the mean) of a data set. The larger the
variance, the greater is the average deviation of each datum from the average value
• Standard Deviation
• the average deviation from the mean of a data set.
set 80
Data Values
60
hard to see 20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X-axis labels
The data can be presented such that more statistical info can be
estimated from the chart (average, standard deviation).
Distribution Statistics:
Plotting the distribution
• Determine a frequency table (bins)
• A histogram is a column chart of the frequencies
7
Category Labels Frequency
6
0-50 3
5
51-60 2
Frequency
4
61-70 6
3
71-80 5
2
81-90 3
1
>90 1 0
0-50 51-60 61-70 71-80 81-90 >90
Scores
Distribution Statistics: Histogram
• The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data
• For small data sets, histograms can be misleading. Small changes in
the data or to the bucket boundaries can result in very different
histograms.
• For large data sets, histograms can be quite effective at illustrating
general properties of the distribution.
73,600 − 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
– Useful when the actual minimum and maximum values are unknown, or
when there are outliers that dominate the min-max normalization
– A variation replaces standard deviation by mean absolute deviation
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10 43
Discretization
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
44
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised (does not use class
information)
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-up
merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
45
Discretization Without Using Class Labels
(Binning vs. Clustering)
46
Discretization by Histogram Analysis
• Histogram analysis is an unsupervised discretization technique as
it does not use class information
• Equal-width – values are partitioned into equal sized partitions
or ranges
• Equal frequency – values are partitioned so each partition
contains the same number of data tuples
• Histogram analysis algorithm can be applied recursively to each
partition to automatically generate multilevel concept hierarchy
• Histogram can be partitioned based on cluster analysis of the
data distribution
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition
48
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and
is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.
49
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Specification of a set of attributes, but not of their partial ordering
– Concept hierarchy based on number of distinct values
– E.g., for a set of attributes: {street, city, state, country}
50
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the
data set
– The attribute with the most distinct values is placed at the
lowest level of the hierarchy