You are on page 1of 260

7.

Visualization Fundamentals and Design principles

netquest.com 3
What is Data Visualization

Data visualization is the process of acquiring, interpreting and comparing data in


order to clearly communicate complex ideas, thereby facilitating the identification
and analysis of meaningful patterns.

Data visualization can be essential to strategic communication: it helps us interpret


available data; detect patterns, trends, and anomalies; make decisions; and analyze
inherent processes. All told, it can have a powerful impact on the business world.

Data Visualization Application enables users to visualize data, draw insights


and understand it better. It allows people to organize and present information
intuitively. People can understand pictures better than tables that contain rows and
columns.

Tableau, Roambi, Qlik, Salesforce Einstein Analytics, High Charts, Google Charts,
Fusion Charts, Infogram, Sisence and Final Words are some of the Web
Applications of Data Visualization
The data
visualization process

Several different fields are involved in


the data visual- ization process, with
the aim of simplifying or revealing
existing relationships, or discovering
something new within a data set.

Visualization process1

Filtering & processing. Refining and cleaning


data to convert it into information through
analysis, interpreta- tion, contextualization,
comparison, and research.

Translation & visual representation.


Shaping the visual representation by
defining graphic resources, language,
context, and the tone of the
representation, all of which are adapted
for the recipient.

Perception & interpretation. Finally, the


visualization becomes effective when it
1Pérez, J. and Vialcanet, G. (2013). Guía de visualización de datos aplicada al
has a perceptive impact on the
marketing digital: Cómo transformar datos en conocimiento (p.5-6).
construction of knowledge.
Why is data All of this indicates that human beings are better at Identifying the evolution of sales over the course of the

visualization so processing visual information, which is lodged in our


long-term memory.
year isn’t easy. However, when we present the same
information in a visual, the results are much clearer (see
important in reports the graph below).

and statements? Consequently, for reports and statements, a visual rep-


resentation that uses images is a much more effective The graph takes what the numbers cannot communi-
way to communicate information than text or a table; it cate on their own and conveys it in a visible, memorable
We live in the era of visual information, and visual also takes up muchless space. way. This is the real strength of data visualization.
content plays an important role in every moment of
our lives. A study by SH!FT Disruptive Learning demon- This means that data visuals are more attractive,
strated that we typically process images 60,000 simpler to take in, and easier to remember.
times faster than a table or a text, and that our brains
typically do a better job remembering them in the long Try it for yourself. Take a look at this table:
term. That same research detected that after three days, Graphical excellence is that which gives to
Month Jan Feb Mar Apr May Jun
analyzed subjects retained between 10%and 20% of the viewer the greatest number of ideas in
written or spoken information, compared with 65% of the shortest time with the least ink in the
Sales 45 56 36 58 75 62
visual information. smallest space.”
- Edward Tufte(2001)
Sales

The rationale behind the power


100
of visuals:

80
• The human mind can see an image for just 13mil-
liseconds and store the information, provided that 75 62
56 58
it is associated with a concept. Our eyes can take in 60
36,000 visual messages per hour. 45
40
• 40% of nerve fibers are connected to the retina.
36
20

Month Jan Feb Mar Apr May Jun

4
Data visualization chiefly helps in 3 key aspects of For example: an interactive graphic from The Guardian2 invites us to explore how the
reports and statements: linguistic standard of U.S. presidential addresses has declined over time. The visual is
interactive and explanatory, in addition to indicating the readability score of various
presidents’ speeches.
1)Explaining
Visuals aim to lead the viewer down a path in order to describe situations, answer
questions, support decisions, communicate information, or solve specific problems. 3)Analyzing
When you attempt to explain something through data visualization, you start with a
question, which interacts with the data set in such a way that enables viewers to make Other visuals prompt viewers to inspect, distill, and transform the most significant
a decision and, subsequently, answer the question. information in a data set so that they can discover something new or predict upcom-
ing situations.
For example: This graphic below could clearly explain the country with the greatest
demand for a certain product compared globally, in a concrete month. For example: this interactive graphic about learning machine3 invites us to explore
and discover information within the visual by scrolling through it. Using the machine
500 learning method, the visual explains the patterns detected in the data in order to cate-
400 gorize characteristics.

300

200
We’ll close this introduction with a 2012 reflection by Alberto Cairo, a specialist in
100 information visualization and a leader in the world of data visualization. For the
0 author, a good visual must provide clarity, highlight trends, uncover patterns, and
United Russia South Europe Canada Australia Japan
States Africa reveal unseen realities:

2)Exploring We create visuals so that users can analyze data and, from it, dis-
cover realities that not even the designer, in some instances, had
Some visuals are designed to lend a data set spatial dimensions, or to offer numerous
considered.”
subsets of data in order to raise questions, find answers, and discover opportunities.
When the goal of a visual is to explore, the viewers start by familiarizing themselves 2 Available at: https://www.fusioncharts.com/whitepapers/downloads/Principles-of-Data-Visualization.pdf
3 Available at: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
with the dataset, then identifying an area of interest, asking questions, exploring, and
finding several solutions or answers.
Data types,

2. relationships, and
visualization formats

netquest.com

8
Data types, 2 kinds of data
relationships, and Before we talk about visuals themselves, we must first understand the different
kinds of data that can be visualized and how they relate to one another.
visualization formats The most common kinds of data are4:

There are a number of methods and approaches to


creating visuals based on the nature and complexity 1)Quantitative (numeric) 2) Qualitative (categoric)
of the data and the information. Different kinds of
Data that can be quantified and measured. This kind of This kind of data is divided into categories based on
graphics are used in data visualizations, including
data explains a trend or the results of research through non-numeric characteristics. It may or may not have a
representations of statistics, maps, and diagrams.
numeric values. This category of data can be further logical order, and it measures qualities and generates
These schematic, visual representations of content
subdivided into: categorical answers. It can be:
vary in their degree of abstraction.

In order to communicate effectively, it is important to


• Discrete: Data that consists of whole numbers (0, 1,2, • Ordinal: Meaning it follows an order or sequence.
understand different kinds of data and to establish
3...).For example, the number of children in a family. That might be the alphabet or the months of the year.
visual relationships through the proper use of graphics.
• Continuous: Data that can take any value within an • Categorical: Meaning it follows no fixed order. For
Enrique Rodríguez (2012), a data analyst at DataNauta,
interval. For example, people’s height (between 60 - example, varieties of productssold.
once explained in an interview that...
70 inches) or weight (between 90 and 110pounds).

A good graphic is one that synthesizes


and contextualizes all of the information
that’s necessary to understand a situa-
tion and decide how to move forward.”
Quantitative Qualitative

5 Source: Hubspot, Prezy, and Infogram (2018). Presenting Data People Can’t
Ignore: How to Communicate Effectively Using Data. |p.10 of 16 |Available at:
https://offers.hubspot.com/presenting-data-people-cant-ignore.

netquest.com 7
7 data relationships
Data relationships can be simple, like the progress of a single metric over time (such as visits to a blog over the course of 30 days or the number of users on a social network),
or they can be complex, precisely comparing relationships, revealing structure, and extracting patterns from data. There are seven data relationships to consider:

Ranking: A visualization that relates two or more values Nominal comparisons: Visualizations that compare Series over time: Here we can trace the changes in the
with respect to a relative magnitude. For example: a quantitative values from different subcategories. For values of a constant metric over the course of time. For
company’s most sold products. example: product prices invarious supermarkets. example: monthly sales of a product over the course of
two years.

Correlation: Data with two or more variables that can


demonstrate a positive or negative correlation with one
another. For example: salaries based on level of education.

Deviation: Examines how each data point relates to the Distribution: Visualization that shows the distribu-
others and, particularly, to what point its value differs tion of data spatially, often around a central value.
from the average. For example: the line of deviation for For example: the heights of players on a basketball team.
tickets to an amusement park sold on a rainy versus a Partial and total relationships: Show a subset of data
normal day. as compared with a larger total. For example: the per-
centage of clients that buy specific products.

netquest.com 8
11formats 1. Bar chart

There are two types of visualizations: static and Bar charts are one of the most popular ways of visual- They are very versatile, and they are typically used
interactive. Their use depends on the search and izing data because they present a data set in a quickly to compare discrete categories, to analyze changes
analysis dimension level. Static visuals can only understood format that enables viewers to identify over time, or to compare parts of a whole.
analyze data in one dimension, whereas inter- highs and lows at a glance. The three variations on the bar chart are:
active visuals can analyze it in several.

As with any other form of communication, familiar-


ity with the code and resources that are available to
us is essential if we’re going to use them successfully Vertical column Horizontal column Full stackedcolumn
our goal. In this page, we present the different kinds
of graphics that we can use to transform our data Used for chronological data, and it Used to visualizecategories. Used to visualize categories that
into information. This group of visualization types should bein left-to-right format. collectively add up to 100%.
is listed in order of popularity in the “Visualization
Universe” project by Google News Lab and Adioma,
as of the publication of this report.
6,000

5,500

5,000

4,500 Education Jan

4,000

3,500

3,000 Feb
Entertainment
2,500

2,000

1,500 Mar
Heatlh
1,000

500

0
Jan Feb M ar Apr M ay 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

netquest.com 9
2. Histograms

Histograms represent a variable in the form of bars, 400K

where the surface of each bar is proportional to the 350K


>120
frequency of the values represented. They offer an 300K
overview of the distribution of a population or sample 101-120
250K
with respect to a given characteristic.
200K 81-100
The two variations on the histogram are:
150K

60-80
• Vertical columns 100K

• Horizontal columns 50K


<60
0

25 30 35 40 45 50 55 60 65 -60 -40 -20 0 20 40 60

Vertical columns Horizontal columns

3. Piecharts

Pie charts consist of a circle divided into sectors, each


of which represents a portion of the total. They can
be subdivided into no more than five data groups. They
can be useful for comparing discrete or continuous data.
The two variations on the pie chart are:

• Standard: Used to exhibit relationship between parts.


• Donut: A stylistic variation that facilitates the inclu-
sion of a total value or a design element in the center.
A B A B C D

Standard pie chart Donut piechart

netquest.com 10
4. Scatter plots
1.0 30.000

Scatter plots use the spread of points over a Car- 25.000


0.8
tesian coordinate plane to show the relationship
20.000
between two variables. They also help us determine
0.6
whether or not different groups of data are correlated. 15.000

0.4
10.000

0.2 5.000

0 0

0.2 0.4 0.6 0.8 1.0 1.2

Scatter plot Scatter plot with grid

5. Heat maps

Heat maps represent individual values from a data


set on a matrix using variations in color or color
intensity. They often use color to help viewers com-
E
pare and distinguish between data in two different
categories at a glance. They are useful for visualizing D
webpages, where the areas that users interact with most 2

are represented with “hot” colors, and the pages that C 1


receive the fewest clicks are presented in “cold” colors.
0
B
The two variations on the heat map are:
-1
A
• Mosaic diagram
0% 10% 30% 50% 70% 100%
• Color map 1 2 3 4 5 6

Mosaic diagram Color map

netquest.com 11
6. Linecharts 7. Bubble charts 8. Radarcharts

These are used to display changes or trends in data These graphics display three-dimensional data and These are a form of representation built around a
over a period of time. They are especially useful for accentuate data in dispersion diagrams and maps. regular polygon that is contained within a circle,
showcasing relationships, acceleration, deceleration, Their purpose is to highlight nominal comparisons and where the radii that guide the vertices are the axes
and volatility in a data set. classification relationships. The size and color of the over which the values are represented. They are
bubbles represent a dimension that, along with the equivalent to graphics with parallel coordinates on polar
data, is very useful for visually stressing specific values. coordinates. Typically, they are used to represent the
The two variations on the bubble chart are: behavior of a metric over the course of a set time cycle,
such as the hours of the day, months of the year, or days
• The bubble plot: used to show a variable in three of the week.
dimensions, position coordinates (x, y) and size.

Line chart
• Bubble map: used to visualizethree-dimensional
values for geographic regions.

Radar chart

netquest.com 12
9. Waterfallcharts

400K
These help us understand the cumulative effect
350K
of positive and negative values on variables in a
sequential fashion. 300K

250K

200K

150K

100K

50K

Start A B C D E F G H I J K L End

Fall Rise

10. Treemaps

A
Tree maps display hierarchical data (in a tree struc-
B C
ture) as a set of nested rectangles that occupy sur- A
200
face areas proportional to the value of the variable
they represent. Each tree branch is given a rectangle, E H
which is later placed in a mosaic with smaller rectangles B C
that represent secondary branches. The finished prod- 80 120

uct is an intuitive, dynamic visual of a plane divided into


areas that are proportional to hierarchical data, which
has been sorted by size and given a color key. D E F G H D
G
F
30 50 20 40 60

netquest.com 13
11.Areacharts Selecting the right graphic to effectively communicate
through our visualizations is no easy task. Stephen
1.0
These represent the relationship of a series over Few (2009), a specialist in data visualization, proposes
time, but unlike line charts, they can represent taking a practical approach to selecting and using an
0.8
volume. The three variations on the area chart are: appropriate graphic:
0.6
• Standard area: used to display or compare a pro- • Choose a graphic that will capture the viewer’s
gression over time. 0.4 attention for sure.
• Stacked area: used to visualize relationships as part
of the whole, thus demonstrating the contribution of 0.2 • Represent the information in a simple, clear, and
each category to the cumulative total. precise way (avoid unnecessary flourishes).
• 100% stacked area: used to communicate the dis- 0
1 2 3 4 5 6
tribution of categories as part of a whole, where the • Make it easy to compare data; highlight trends
cumulative total does not matter. and differences.
Standard area

• Establish an order for the elements based on the


quantity that they represent; that is, detect maxi-
1.0 mums and minimums.
1.0

0.8 • Give the viewer a clear way to explore the


0.8
graphic and understand its goals; make use of
0.6
0.6 guide tags.

0.4
0.4

0.2
0.2

0
0
0 1 2 3 4 5 6
0 1 2 3 4 5 6

A B C

Stacked area 100% stacked area

netquest.com 14
3. Basic principles for
data visualization

netquest.com 17
Basic principles for Shneiderman introduces his famous mantra on how
to approach the quest for visual information, which he
data visualization breaks down intothree tasks:

Graphics with
1. Overview first: This ensures viewers have a general
understanding of the data set, as their starting point for
1.
System Context
an objective:seeking exploration. This means offering them a visual snapshot
of the different kinds of data, explaining their relation- Thesystem plus users and
your mantra ship in a single glance. This strategy helps us visualize the system dependencies
data, at all its different levels, at one time. OVERVIEW
FIRST
The goal of data visualizations is to help us understand
the object they represent. They are a medium for com-
2. Zoom and filter: The second step involves supple-
menting the first so that viewers understand the data’s
2.
municating stories and the results of research, as well underlying structure. The zoom in/zoom out mechanism Containers
as a platform for analyzing and exploring data. There- enables us to select interesting subsets of data that meet The overall shape of the archi-
fore, having a sound understanding of how to create certain criteria while maintaining the sense of position tecture and technology choices.
data visualizations will help us create meaningful and and context.
easy-to-remember reports, infographics, and dash-
boards. Creating suitable visuals helps us solve problems 3. Details on demand: This makes it possible to select
and analyze a study’s objects in greater detail. a narrower subset of data, enabling the user to interact 3.
with the information and use filters by hovering or click- Components ZOOM AND
FILTER
The first step in representing information is trying ing on the data to pull up additional information. Logical components and their
to understand that data visualization. interactions withina container.
The chart on the right side summarizes the key points to
Ben Shneiderman gave us a useful starting point in his designing such a graphic, with an eye to human visual
text “The Visual Information-Seeking Mantra” (1996),
which remains a touchstone work in the field. This
perception, so that users can translate an idea into a set
of physical attributes.
4.
Classes DETAILSON
author suggests a simple methodology for novice users
DEMAND
to delve into the world of data visualization and experi- These attributes are: structure, position, form size, Component or patternimple-
ment with basic visual representation tasks. 5
and color. When properly applied, these attributes can mentation details.
5 Shneiderman, B. (1996). The Eyes Have It: A Task by Data Type Taxonomy for present information effectively andmemorably.
Information Visualizations. Visual Information Seeking Mantra (p. 336). Available at:
https://www.cs.umd.edu/~ben/papers/Shneiderman1996eyes.pdf

netquest.com 18
Structuring: the importance
Layout anddesign: Furthermore, the visual hierarchy of elements plays a
of layout
communicative role in this encoding process, because the elements’
organization and distribution must have a well-defined
elements All visual representations begin with a blank dimensional hierarchical system in order to communicate effec-
space that will eventually hold the information which tively (Meirelles: 2014). In a sense, visualizations are
will be communicated. The process of spatial coding is paragraphs about data, and they should be treated
a fundamental part of visual representation because it as such. Words, images, and numbers are part of the
In order to begin designing our reports and state- is the medium in which the results of our compositional information that will be visualized. When all of the
ments, it is essential to understand that visual repre- decisions and the meaning of our visual statement will elements are integrated in a single structure and visual
sentations are cognitive tools that complement and be visualized, thereby having an impact on the user. hierarchy, the infographic or report will organize space
strengthen our mental ability to encode and decode properly and communicate effectively, according to
information6. Meirelles (2014) notes that: “Allgraphic Edward Tufte (1990) defines “layout” as a scheme for your user’s needs.
representation affects our visual perception, distributing visual elements in order to achieve organi-
because the elements of transmission utilized act zation and harmony in the final composition. Layout
as external stimuli, which activate our emotional planning and design serve as a template for applying
state and knowledge.” hierarchy and control to information at varying levels of
detail.7 In his book Envisioning Information, Tufte offers
Thus, when our mind visualizes a representation, it several guidelines for informationdesign:
transforms the information, merges it, and applies a
hierarchical structure to it tofacilitate interpretation. • Have a properly chosen format.
• Give a broad visual tour and offer a focused reading
For this reason, in order to have an efficient per- at different detail levels.
ceptive impact, it is important to adhere to a series • Use words, numbers, anddrawings.
of best practices when creating reports and info- • Reflect a balance, a proportion, a sense of relevant
graphics. As with any other form of communication, scale, and a context.
success depends largely on the business’s familiarity
with the established code and the resources available. Spatial encoding requires processing spatial proportions
Space, shapes, color, icons, and typography are a (position and size), which have a determining role in the
few of the essential elements of a striking visual with organization ofperception and memory.
communicative power.

6 Meirelles, I (2014). “La información en el diseño,” (p.21-22).Barcelona: Parramón. 7 Tufte, E. (1990). Envisioning Information. Cheshire: Graphics Press.

netquest.com 19
Visual variables
and theirsemantics

Visual variables are the building blocks of visual repre-


sentation. They conform to an order and spatial con-
Variables Point Line Area
text in order to convey a quantitative message. These
resources can be used to categorize meaningful prop-
erties and amplify the message being represented. Let’s 2dimensions
take a look at theirsemantics: (X,Y)

• Point: Has no dimensions and indicates a place.

• Line: Has one dimension and indicates length


Size
and direction.

• Plane: Has two dimensions and indicates space


and scale.

Jacques Bertin, cited in Meirelles (2014), used the term Value


“visual variables” for the first time in his book Semiol-
ogie Graphique, where he presented them as a system
of perceptive variables with corresponding properties
of meaning. He offered a guide for combining graphic
Visual variables
elements in an appropriate way according to their order,
position, orientation, size, texture, and value.

netquest.com 18
Using consistent andattractive
color schemes

Color is one of the most powerful resources for data


visualization, and it is essential if we are going to under-
stand information properly. Grayscale

Color can be used to categorize elements, quantify


or represent values, and communicate cultural attri-
butes associated with a specific color. Double complementary

It dominates our perception and, in order to analyze it,


we must first understand its three dimensions.
Complementary

Hue: this is what we normally imagine when we picture


colors. There is no order to colors; they can only be dis-
tinguished by their characteristics (blue, red, yellow, etc.). Monochromatic

Brightness: the color’s luminosity. This is a relative mea-


sure that describes the amount of light reflected by one
object with respect to another. Brightness is measured Split complementary
on a scale, and we can talk about brighter and darker
values of a singlehue.

Cool colors
Saturation: this refers to the intensity of a given color’s
hue. It varies based on brightness. Darker colors are less
saturated, and the less saturated a color is, the closer
it gets to gray. In other words, it gets closer to a neutral Saturated colors
(hueless) color. The following graphic offers a brief sum-
mary of color application.

netquest.com 19
Isabel Meirelles (2014) notes that selecting a color pal- 2. Divergingpalettes TIP: The qualitative color scheme is perfect for visualiz-
ette in order to visualize data is no easy task, and she ing data because it affords a high degree of contrast and
recommends following Cynthia Brewer’s advice uses These are more suitable for ordering categorical data, helps you draw attention to important points, especially
three different kinds of color schemes, based on the and they are more effective when the categorical if you use one predominant color and use the second as
nature of the data: division is in the middle of the sequence. The change in an accent in yourdesign.
brightness highlights a critical value in the data, such as
the mean or median, or a zero. Colors become darker to
1. Monochromatic sequential palettesor represent differences in both directions, based on this Finally, don’t forget to use palettes that are comprehen-
their analogue meaningful value in the middle of the data. sible to people who can’t see color. Color blindness is a
disability or limited ability that makes it difficult to distin-
These palettes are great for ordering numeric data that guish certain pairs of colors, such as blue and yellow, or
progresses from small to large. It is best to use brighter red and green. One strategy for avoiding this problem is
color gradients for low values and darker ones for to adapt designs that use more than just hue to codify
higher values. TIP: Try to emphasize the most important information information; create schemes that slightly vary another
using arrows and text, circles, rectangles, or contrasting channel, such as brightness orsaturation.
colors. This way, when you visualize your data, your
analysis will be moreunderstandable.

Thus, brightness levels can be used as a visible, coherent 3. Qualitative palettes


aspect of a graphic scheme. Sequential color schemes
make it possible to create a smooth, low-contrast These are better for representing ordinal or categorical
design. This color scheme is better for an image than for data to create primary visual differences between catego-
data visualization. ries. Most qualitative schemes are based on differences in
hue, with differences in brightness between the colors.

TIP: To create a color hierarchy in a sequential scheme,


choose one dominant color and use the others with
moderation; alternatively, you can simply use two softer
versions of the dominant color, which will naturally
make them feel lower on the hierarchy.

netquest.com 22
Use icons and symbols to aid 82%
77%
88%
76% 73%
63%
in understanding and limit Notebooks
64%
55% 54%
unnecessary tagging
Entertainment

Symbols and icons are another avenue for visualizing Lifestyle products
information that goes beyond merely being decorative.
They draw strength from their ability to exhibit a gen- Singles Couples Families
eral context in an attractive, precise way. Icons illustrate
concepts. Viewers can understand what the information
is about by just glancing at the illustration.

Alexander Skorka (2018), chief evangelist for theDapresy


Singles Couples Families
Group, recommends using symbols and icons because
they simplify communication. Symbols are self-ex- Notebooks
planatory, and our mind can process icons more 82% 76% 63%

easily than text. It is important to consider that an


Entertainment
icon’s success depends largely on cultural context, so it is
55% 64% 88%
important to select universally understandable images.

Lifestyle products
That said, they certainly should not be complex illustra- 77% 73% 54%
tions. An icon with too many details could hinder viewers’
understanding. Keep it simple: icons’ meaning should be
immediately clear, even when they’re very small.

The ease with which we recognize icons enables us to


process data faster than we can process information
conveyed textually. Therefore, when designing informa-
tion, it is wise to use both graphics and icons to convey
proportions in greaterdetail.

netquest.com 23
The typography in our reports: sense of tradition, security, history, integrity, author-
effective applications ity, integrity, and other such concepts. Sans-serif
fonts stand out because they have a more polished,
sophisticated feel; they convey a sense of modernity,
Typography plays an important role in the design order, cleanliness, elegance, avant-garde, and style.
of reports and statements. Selecting the right font • Pay attention to legibility. Remember that screen
strengthens your message and captures the audience’s type does not appear in the same way as print type.
attention. Müller-Brockmann (1961), a graphic designer, It is best to choose a more responsive (sans-serif) font
defines typography as the proper visual element for for on-screen texts, and fonts with serifs for printed
composition. He notes that “the reader must be able to reports. That said, there’s an exception to every rule,
read the message from a text easily and comfortably. and today there is a bounty of fonts that are perfectly
This depends largely on the size of the text, the length of suitable for both digital and print media.
the lines, and the spacing between the lines”. 8
• Watch your weight (light, regular, bold). When
it comes to bolding your text, a value of two or three
Typography is an art form in and of itself, in which should be plenty. It is better to reserve the heaviest
every font has its own characteristics, which should weight for headlines and then apply a stylistic hierar-
be strategically combined. chy based on your content. Avoid fonts that only offer
one weight or style, since their applications are limited.
For people outside the world of graphic design, choos- • Don’t forget that some fonts use more memory
ing a font and setting other typographical features can than others. Fonts with serifs generally monopolize
be tricky, but it doesn’t have to be. Let’s take a practical more of your computer’s brain power than sans-serif
look at the steps you should take when determining fonts. This is an important consideration in interactive
your typography, and then consider the images and reports, since a document that occupies more RAM
visual elements that best accompany your text. Consid- will be lessresponsive.
erations when setting yourtypography:
Fonts have personalities that help us establish a more
• Determining the goal of your report’s content. attractive visual tone for our audience. Familiarizing
• Select a font that strengthens that goal. yourself with a few can go a long way. There are:
Fonts come in two types: with serifs or without (sans)
serifs. Serif fonts have an extra stroke that conveys a • Professional fonts • Handwritten fonts

8 The Graphic Artist and his Design Problems (Gestaltungsprobleme


• Fun font • Minimalist fonts
des Grafikers), Teufen, 1961

netquest.com 2
4
Prioritize patterns in your visualizations: Gestalt

The basic elements of the visualization process also involve preattentive attributes. Preattentive attributes are visual
features that facilitate the rapid visual perception of a graphic in a space. Designers use these characteristics to
better uncover relevant information in visuals, because these characteristics attract the eye.

Colin Ware, Director of the Data Visualization Research Lab at the University of New Hampshire, has highlighted
that preattentive attributes can be used as resources for drawing viewers’ immediate attention to certain
parts of visual representations (2004). According to Ware, preattentive processing happens very quickly—typi-
cally in the first 10 milliseconds. This process is the mind’s attempt to rapidly extract basic visual characteristics from
the graphic (stage 1).These characteristics are then consciously processed, along with the perception of the object,
so that the mind can extract patterns (stage 2), ultimately enabling the information to move to the highest level of
perception (stage 3). This makes it possible to find answers to the initial visual question, utilizing the information
saved in our minds. Colin Ware, cited in Meirelles (2014), explains it as follows:

Bottom up information contributes to the pattern creation process

Top down process reinforces relevant information

Preattentive attributes enhance object perception and cognition processes, leveraging our mind’s visual capacities.
Good data visualizations deliberately make use of these attributes because they boost the mind’s discovery and rec-
ognition of patterns such as lines, planes, colors, movements, and spatial positioning.9

9 Dondis, D.A. (2015). La sintaxis de la imagen: introducción al alfabeto visual. Editorial Gustavo Gili: Barcelona
Meirelles, I. (2014). La información en el diseño. Barcelona: Parramón.

netquest.com 23
The visual below lists preattentive attributes that represent
aspects of lines and planes when visualizing and analyzing
graphic representation: shape, color, and spatial position.

Orientation Line Length Thickness Curvature

Shape
Orientation Line Length Line Width Size

Added marcks Enclosure Color Intensity/value

Shape Curvature Added Marks Enclosure

Color Spatial Position


Intensity Hue 2-D Position
Shape Size Sharpness Numerosity

netquest.com 24
Detecting patterns is fundamental to structuring and
organizing visual information. When we create visuals,
we often want to highlight certain patterns over others.
Preattentive attributes are the alphabet of visual lan-
guage; analytic patterns are the words that we write
by using them. When we see a good visualization, we
immediately detect the preattentive attributes and rec-
ognize analytic patterns in the visualization. The follow-
ing table summarizes a few basic analytic patterns:

Analytic patterns

netquest.com 25
We have seen how preattentive attributes and patterns Gestalt’s principles are the principles that enable us to According to Dondis (2015), Gestalt’s principles help
make it possible to process and analyze visual informa- understand the requirements posed by certain prob- describe the way we organize and merge elements
tion; they also enable us to improve pattern discovery lems so that we see everything as an integral, coherent in our minds. They quiet the noise of the graphics so
and perceptive inferences and provide processes for whole. It involves proximity, similarity, shared destiny, that we relate, combine, and analyze them. These
solving visualization problems. “pragnanz” or pithiness, closure, simplicity, familiarity, principles come into play whenever we analyze any
and discernment between figure and ground. sort of visualization. Only position and length can be
used to accurately perceive quantitative data. The
other attributes are useful for perceiving other sorts
of data, such as categorical and relational data.

We’ll close this section with one piece of practical


advice on how to effectively visualize data. Colin
Ware in The Visual Thinking: for Graphic Design
(2008) summarizes the importance of always being
mindful of preattentive attributes and patterns when
designing a visualization:

Good design optimizes the visual thinking process.


The choice of patterns and symbols is important so
that visual queries can be efficiently processed by
the intended viewer. This means choosing words
and patterns each to their best advantage.”

Gestalt’s principles

netquest.com 28
Session No: CO2-7
Session Topic: The Process of Visualization

DATA MODELING AND


VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An Ability To Understand The Stages Involved In The
Process Of Visualization .

• An Ability To Apply The Steps By Creating A Simple Case


Study On Given Dataset.
Poll Question-01
Charles Babbage Is Called As Father Of Computer?
Option A: Yes
Option B: No
Option C: Cant Determine
Option D: None Of The Mentioned
The Process of Visualization

The process of understanding data begins with a set of numbers and a question.
The following steps form a path to the answer:
 Acquire
Parse
Filter
Mine
Represent
Refine
Interact
The Process of Visualization…

Steps involved in the process of Visualization


1. Acquire
Obtain the data, whether from a file on a disk or a source over a network.
2. Parse
Provide some structure for the data’s meaning, and order it into categories.
3. Filter
Remove all but the data of interest.
4. Mine
Apply methods from statistics or data mining as a way to discern patterns or place the data in mathematical con
5. Represent
Choose a basic visual model, such as a bar graph, list, or tree.
6. Refine
Improve the basic representation to make it clearer and more visually engaging.
7. Interact
Add methods for manipulating the data or controlling what features are visible.
Example To Illustrate The Process of Visualization

• To illustrate the seven steps listed in the Earlier Slides , and how they contribute
to effective information visualization, let’s look at how the process can be
applied to understanding a simple data set.

• In this case, we’ll take the zip code numbering system that the U.S. Postal
Service uses.

• The application is not particularly advanced, but it provides a skeleton for how
the process works
Example To Illustrate The Process of Visualization(Acquire)
• The acquisition step involves obtaining the data
• A copy of the zip code listing can be found on the U.S. Census Bureau web site, as it is frequently used for
geographic coding of statistical data

Figure Zip codes in the format provided by the U.S. Census Bureau
• Acquisition concerns how the user downloads your data as well as how you obtained the data in the first
place
• As you design the application, you have to take into account the time required to download data into the
browser.
• And because data downloaded to the browser is probably part of an even larger data set stored on the
server, you may have to structure the data on the server to facilitate retrieval of common subsets
Example To Illustrate The Process of Visualization(Parse)
• After you acquire the data, it needs to be parsed—changed into a format that tags each part of the data with
its intended use.
• Each line of the file must be broken along its individual parts; in this case, it must be delimited at each tab
character.
• Then, each piece of data needs to be converted to a useful format. Figure shows the layout of each line in
the census listing, which we have to understand to parse it and get out of it what we want.

figure: Structure of acquired data


Example To Illustrate The Process of Visualization(Parse)…
Each field is formatted as a data type that we’ll handle in a conversion program:
String
A set of characters that forms a word or a sentence. Here, the city or town name is designated as a string. Because the zip
codes themselves are not so much numbers as a series of digits (if they were numbers, the code 02139 would be stored as
2139, which is not the same thing), they also might be considered strings.
Float
A number with decimal points (used for the latitudes and longitudes of each location). The name is short for floating point,
from programming nomenclature that describes how the numbers are stored in the computer’s memory.
Character
A single letter or other symbol. In this data set, a character sometimes designates special post offices.
Integer
A number without a fractional portion, and hence no decimal points (e.g., −14, 0, or 237).
Index
Data (commonly an integer or string) that maps to a location in another table of data. In this case, the index maps numbered
codes to the names and two-digit abbreviations of states.
This is common in databases, where such an index is used as a pointer into another table, sometimes as a way to compact
the data further
Example To Illustrate The Process of Visualization(Filter and
Mine)
Filter
• The next step involves filtering the data to remove portions not relevant to our use.
• In this example, for the sake of keeping it simple, we’ll be focusing on the contiguous 48 states, so the
records for cities and towns that are not part of those states—Alaska, Hawaii, and territories such as Puerto
Rico—are removed.

Mine
• This step involves math, statistics, and data mining.
• The data in this case receives only a simple treatment: the program must figure out the minimum and
maximum values for latitude and longitude by running through the data (as shown in Figure ) so that it can
be presented on a screen at a proper scale.
• Most of the time, this step will be far more complicated than a pair of simple math operations.
Example To Illustrate The Process of Visualization(Filter and
Mine)
Example To Illustrate The Process of Visualization (Mine)

Figure:Mining the data: just compare values to find the minimum and maximum
Example To Illustrate The Process of Visualization(Represent)
• This step determines the basic form that a set of data will take.
• Some data sets are shown as lists, others are structured like trees, and so forth.
• In this case, each zip code has a latitude and longitude, so the codes can be mapped as a two-dimensional
plot, with the minimum and maximum values for the latitude and longitude used for the start and end of the
scale in each dimension. This is illustrated in Figure

Basic visual representation of zip code data


The Represent stage is a linchpin that informs the single most important decision in a visualization project and
can make you rethink earlier stages. How you choose to represent the data can influence the very first step
(what data you acquire) and the third step (what particular pieces you extract).
Example To Illustrate The Process of Visualization(Refine)
• In this step, graphic design methods are used to further clarify the representation by calling more
attention to particular data (establishing hierarchy) or by changing attributes (such as color) that
contribute to readability.
• Hierarchy is established in Figure, for instance, by coloring the background deep gray and
displaying the selected points (all codes beginning with four) in white and the deselected points
in medium yellow.

Figure . Using color to refine the representation


Example To Illustrate The Process of Visualization (Interact)
• The next stage of the process adds interaction, letting the user control or explore the data.
• Interaction might cover things like selecting a subset of the data or changing the viewpoint.
• As another example of a stage affecting an earlier part of the process, this stage can also affect
the refinement step, as a change in viewpoint might require the data to be designed differently.
• In the Zipdecode project, typing a number selects all zip codes that begin with that number

Figure : The user can alter the display through choices (zip codes starting with 0)
Example To Illustrate The Process of Visualization(Interact)

(zip codes starting with 9) Honing in with two digits (02)


Example To Illustrate The Process of Visualization(Interact)
• In addition, users can enable a “zoom” feature that draws them closer to each
subsequent digit, revealing more detail around the area and showing a constant
rate of detail at each level.
• Because we’ve chosen a map as a representation, we could add more details of
state and county boundaries or other geographic features to help viewers
associate the “data” space of zip code points with what they know about the local
environment.

Figure : Honing in further with four digits (0213)


Iteration and Combination

Interactions between the seven stages


Sample Video Show How Effective Visualization is
Referrences

1.https://www.oreilly.com/library/view/visualizing-
data/9780596514556/ch01.html
Will Resume After 5 minutes........
Case Study
A Simple Case Study That performs all the Steps of Visualization

• As a part of your case study Download the Superstoreus2015 data file from
superdatascience.com
• Find The Type of Data Present in Each Column
• Select any four columns using any one of the visualization tools(Python or
Tableau)
• Find the Maximum Value in the column named profit and replace with either
maximum or minimum value by using (Python or Tableau)
• Create Bar Chart, Pie Chart and Histogram for any two columns that you selected
earlier by using any of the above tools
• Create interactions between the selected columns who they are related to each
other by using data visualization tools
Poll Question-02
What are the steps in process of visualization?
Option A: Accuire
Option B:Parse
Option C: Mine
Option D: All the Above
9.
Data Abstraction in Visualization
The Big Picture
• The four basic dataset types are tables, networks, fields, and
geometry; other possible collections of items include clusters, sets,
and lists.
• These datasets are made up of different combinations of the five data
types: items, attributes, links, positions, and grids.
• For any of these dataset types, the full dataset could be available
immediately in the form of a static file, or it might be dynamic data
processed gradually in the form of a stream.
• The type of an attribute can be categorical or ordered, with a further
split into ordinal and quantitative.
• The ordering direction of attributes can be sequential, diverging, or
cyclic.
The Big Picture…
Figure shows the abstract types of what can be visualized.
Why Do Data Semantics and Types Matter?
• Many aspects of vis design are driven by the kind of data that you have at your disposal
• . What kind of data are you given?
• What information can you figure out from the data, versus the meanings that you must be told explicitly?
• What high-level concepts will allow you to split datasets apart into general and useful pieces?
• Suppose that you see the following data:
14, 2.6, 30, 30, 15, 100001
What does this sequence of six numbers mean? You can’t possibly know yet, without more information about
how to interpret each number. Is it locations for two points far from each other in three-dimensional space, 14,
2.6, 30 and 30, 15, 100001? Is it two points closer to each other in two-dimensional space, 14, 2.6 and 30, 30,
with the fifth number meaning that there are 15 links between these two points, and the sixth number
assigning the weight of ‘100001’ to that link?
• Similarly, suppose that you see the following data:
Basil, 7, S, Pear
These numbers and words could have many possible meanings. Maybe a food shipment of produce has arrived
in satisfactory condition on the 7th day of the month, containing basil and pears. Maybe the Basil Point
neighborhood of the city has had 7 inches of snow cleared by the Pear Creek Limited snow removal service.
Maybe the lab rat named Basil has made seven attempts to find his way through the south section of the maze,
lured by scent of the reward food for this trial, a pear.
Why Do Data Semantics and Types Matter?
• The type of the data is its structural or mathematical interpretation.
• At the data level, what kind of thing is it: an item, a link, an attribute?
• At the dataset level, how are these data types combined into a larger structure: a table, a tree, a field of
sampled values?
• At the attribute level, what kinds of mathematical operations are meaningful for it?
• For example, if a number represents a count of boxes of detergent, then its type is a quantity, and adding
two such numbers together makes sense.
• If the number represents a postal code, then its type is a code rather than a quantity—it is simply the name
for a category that happens to be a number rather than a textual name. Adding two of these numbers
together does not make sense.
• Sometimes types and semantics can be correctly inferred simply by observing the syntax of a data file or the
names of variables within it, but often they must be provided along with the dataset in order for it to be
interpreted correctly.
• Sometimes this kind of additional information is called metadata; the line between data and metadata is not
clear, especially given that the original data is often derived and transformed
•.
Dataset Types

• A dataset is any collection of information that is the target of analysis.


• The four basic dataset types are tables, networks, fields, and geometry.
• Other ways to group items together include clusters, sets, and lists.
• In real-world situations, complex combinations of these basic types are common.
• Figure shows that these basic dataset types arise from combinations of the data types of items, attributes,
links, positions, and grids.
Dataset Types…
• Figure shows the internal structure of the four basic dataset types in detail.
• Tables have cells indexed by items and attributes, for either the simple flat case or the more complex
multidimensional case.
• In a network, items are usually called nodes, and they are connected with links;
• A special case of networks is trees.
• Continuous fields have grids based on spatial positions where cells contain attributes.
• Spatial geometry has only position information.
Dataset Types…

Tables
• Many datasets come in the form of tables that are made up of rows and columns, a familiar form to anybody who
has used a spreadsheet.
• For a simple flat table, the terms used here are that each row represents an item of data, and each column is
an attribute of the dataset. Each cell in the table is fully specified by the combination of a row and a column—an
item and an attribute—and contains a value for that pair.;
• A multidimensional table has a more complex structure for indexing into a cell, with multiple keys.
Dataset Types…
Networks
The dataset type of networks is well suited for specifying that there is some kind of relationship between two
or more items.
• An item in a network is often called a node.
• A link is a relation between two items.
• For example, in an articulated social network the nodes are people, and links mean friendship.
• In a gene interaction network, the nodes are genes, and links between them mean that these genes have
been observed to interact with each other.
• In a computer network, the nodes are computers, and the links represent the ability to send messages
directly between two computers using physical cables or a wireless connection.
Trees
• Networks with hierarchical structure are more specifically called trees.
• In contrast to a general network, trees do not have cycles: each child node has only one parent node
pointing to it.
• One example of a tree is the organization chart of a company, showing who reports to whom;
• Another example is a tree showing the evolutionary relationships between species in the biological tree of
life, where the child nodes of humans and monkeys both share the same parent node of primates.
Dataset Types…
Fields
• The field dataset type also contains attribute values associated with cells.
• Each cell in a field contains measurements or calculations from a continuous domain: there are conceptually
infinitely many values that you might measure, so you could always take a new measurement between any
two existing ones.
• Continuous phenomena that might be measured in the physical world or simulated in software include
temperature, pressure, speed, force, and density; mathematical functions can also be continuous.
For example, consider a field dataset representing a medical scan of a human body containing measurements
indicating the density of tissue at many sample points, spread regularly throughout a volume of 3D space. A
low-resolution scan would have 262,144 cells, providing information about a cubical volume of space with 64
bins in each direction. Each cell is associated with a specific region in 3D space. The density measurements
could be taken closer together with a higher resolution grid of cells, or further apart for a coarser grid.
• Continuous data requires careful treatment that takes into account the mathematical questions of sampling,
how frequently to take the measurements, and interpolation, how to show values in between the sampled
points in a way that does not mislead.
• Interpolating appropriately between the measurements allows you to reconstruct a new view of the data
from an arbitrary viewpoint that’s faithful to what you measured.
• These general mathematical problems are studied in areas such as signal processing and statistics.
Visualizing fields requires grappling extensively with these concerns
Spatial Fields
• Continuous data is often found in the form of a spatial field, where the cell structure of the field is based on
sampling at spatial positions.
• Most datasets that contain inherently spatial data occur in the context of tasks that require understanding aspects
of its spatial structure, especially shape.
For example, with a spatial field dataset that is generated with a medical imaging instrument, the user’s task could be
to locate suspected tumors that can be recognized through distinctive shapes or densities. An obvious choice for
visual encoding would be to show something that spatially looks like an X-ray image of the human body and to use
color coding to highlight suspected tumors
Grid Types
• When a field contains data created by sampling at completely regular intervals, the cells form a uniform grid.
• There is no need to explicitly store the grid geometry in terms of its location in space, or the grid topology in
terms of how each cell connects with its neighboring cells.
• More complicated examples require storing different amounts of geometric and topological information about the
underlying grid.
• A rectilinear grid supports nonuniform sampling, allowing efficient storage of information that has high complexity
in some areas and low complexity in others, at the cost of storing some information about the geometric location
of each each row.
• A structured grid allows curvilinear shapes, where the geometric location of each cell needs to be specified.
• Finally, unstructured grids provide complete flexibility, but the topological information about how the cells connect
to each other must be stored explicitly in addition to their spatial positions.
Geometry
• The geometry dataset type specifies information about the shape of items with explicit spatial positions.
• The items could be points, or one-dimensional lines or curves, or 2D surfaces or regions, or 3D volumes.
• Geometry datasets are intrinsically spatial, and like spatial fields they typically occur in the context of tasks
that require shape understanding.
• Spatial data often includes hierarchical structure at multiple scales. Sometimes this structure is provided
intrinsically with the dataset, or a hierarchy may be derived from the original data.
• Geometry datasets do not necessarily have attributes, in contrast to the other three basic dataset types
• One classic example is when contours are derived from a spatial field.
• Another is when shapes are generated at an appropriate level of detail for the task at hand from raw
geographic data, such as the boundaries of a forest or a city or a country, or the curve of a road.
• Geometric data is sometimes shown alone, particularly when shape understanding is the primary task.
• In other cases, it is the backdrop against which additional information is overlaid.
Dataset Availability

Figure shows the two kinds of dataset availability: static or dynamic.

• The default approach to vis assumes that the entire dataset is available all at once, as a static file.
• However, some datasets are instead dynamic streams, where the dataset information trickles in over the
course of the vis session.
• One kind of dynamic change is to add new items or delete previous items. Another is to change the values
of existing items.
• This distinction in availability crosscuts the basic dataset types: any of them can be static or dynamic.
• Designing for streaming data adds complexity to many aspects of the vis process that are straightforward
when there is complete dataset availability up front.
Attribute Types
• Figure shows the attribute types.
• The major disinction is between categorical versus ordered.
• Within the ordered type is a further differentiation between ordinal versus quantitative.
• Ordered data might range sequentially from a minimum to a maximum value, or it might diverge in both
directions from a zero point in the middle of a range, or the values may wrap around in a cycle.
• Also, attributes may have hierarchical structure.
Semantics
• Knowing the type of an attribute does not tell us about its semantics, because these two questions are
crosscutting: one does not dictate the other.
• Different approaches to considering the semantics of attributes that have been proposed across the many fields
where these semantics are studied.
• The classification in this book is heavily focused on the semantics of keys versus values, and the related questions
of spatial and continuous data versus nonspatial and discrete data, to match up with the idiom design choice
analysis framework.
• One additional consideration is whether an attribute is temporal.
Key versus Value Semantics
• A key attribute acts as an index that is used to look up value attributes.*The distinction between key and value
attributes is important for the dataset types of tables and fields, as shown in Figure
Semantics…
• Flat Tables ,Multidimensional Tables, Fields, Scalar Fields, Vector Fields, Tensor Fields, Field Semantics
Temporal Semantics
• A temporal attribute is simply any kind of information that relates to time.
• Data about time is complicated to handle because of the rich hierarchical structure that we use to reason
about time, and the potential for periodic structure.
• The time hierarchy is deeply multiscale: the scale of interest could range anywhere from nanoseconds to
hours to decades to millennia. Even the common words time and date are a way to partially specify the scale
of temporal interest.
• Temporal analysis tasks often involve finding or verifying periodicity either at a predetermined scale or at
some scale not known in advance.
• Moreover, the temporal scales of interest do not all fit into a strict hierarchy; for instance, weeks do not fit
cleanly into months.
• Thus, the generic vis problems of transformation and aggregation are often particularly complex when
dealing with temporal data.
• One important idea is that even though the dataset semantics involves change over time, there are many
approaches to visually encoding that data—and only one of them is to show it changing over time in the
form of an animation.
Time-Varying Data
• A dataset has time-varying semantics when time is one of the key attributes, as opposed to when the
temporal attribute is a value rather than a key.
• As with other decisions about semantics, the question of whether time has key or value semantics requires
external knowledge about the nature of the dataset and cannot be made purely from type information.
• An example of a dataset with time-varying semantics is one created with a sensor network that tracks the
location of each animal within a herd by taking new measurements every second.
• Each animal will have new location data at every time point, so the temporal attribute is an independent
key and is likely to be a central aspect of understanding the dataset.
• In contrast, a horse-racing dataset covering a year’s worth of races could have temporal value attributes such
as the race start time and the duration of each horse’s run.
• These attributes do indeed deal with temporal information, but the dataset is not time-varying.
• A common case of temporal data occurs in a time-series dataset, namely, an ordered sequence of time–
value pairs.
• These datasets are a special case of tables, where time is the key. These time-value pairs are often but not
always spaced at uniform temporal intervals.
• Typical time-series analysis tasks involve finding trends, correlations, and variations at multiple time scales
such as hourly, daily, weekly, and seasonal.
Referrence
https://learning.oreilly.com/library/view/visualization-analysis-
and/9781466508910/K14708_C002.xhtml
Filtering and Aggregation
Reduce Item And Attributes

• Reduce Item And Attributes


Why Reduce
Why Reduce
• Reduction is one of the major strategies for managing complexity in visualizations
which are not mutually exclusive, and various combinations of them are common.
Filter
1. Item Filtering
• In item filtering, the goal is to eliminate items based on their values
with respect to specific attributes. Fewer items are shown, but the
number of attributes shown does not change
2. Attribute Filtering
• Attributes can also be filtered. With attribute filtering, the goal is to eliminate
attributes rather than items; that is, to show the same number of items, but
fewer attributes for each item.
• Item filtering and attribute filtering can be combined, with the result of showing
both fewer items and fewer attributes
Aggregation
• The other major reduction design choice is aggregation, so that a group of elements is
represented by a new derived element that stands in for the entire group.
• Elements are merged together with aggregation, as opposed to eliminated completely with
filtering.
• Aggregation and filtering can be used in conjunction with each other.
• As with filtering, aggregation can be used for both items and attributes.
1. Item Aggregation
Scatter Plot
Box plot
Solar plot
Clustering
2. Spatial Aggregation
• The challenge of spatial aggregation is to take the spatial nature of data into account correctly when
aggregating it. In the cartography literature, the modifiable areal unit problem (MAUP) is a major concern:
changing the boundaries of the regions used to analyze data can yield dramatically different results.
• Even if the number of units and their size does not change, any change of spatial grouping can lead to a very
significant change in analysis results
Example: Geographically Weighted Boxplots
• Figure 13.11 shows a multivariate geographic dataset used to explore social issues in 19th century France.
The six quantitative attributes are population per crime against persons (x1), population per crime against
property (x2), percentage who can read and write (x3), donations to the poor (x4), population per
illegitimate birth (x5), and population per suicide (x6).
3. Attribute Aggregation: Dimensionality Reduction
• Just as attributes can be filtered, attributes can also be aggregated, where a new attribute is synthesized to
take the place of multiple original attributes.
• A very simple approach to aggregating attributes is to group them by some kind of similarity measure, and
then synthesize the new attribute by calculate an average across that similar set.
• A more complex approach to aggregation is dimensionality reduction (DR), where the goal is to preserve the
meaningful structure of a dataset while using fewer attributes to represent the items.
Interacting with
Visualizations

Ware Chapter 10

University of Texas – Pan American


CSCI 6361, Spring 2014
Interacting with Visualizations - Introduction
The very big picture

• Best visualizations support


productive interaction
– Interactive visualizations
– Not merely static representations of data
• Though certainly has its place
– Allows, e.g.,
• Inspection of underlying data from the
visualization
• Transformation of data
• Filtering – removal of data by some
criteria
– E.g., visual analytics systems we have
seen clearly demonstrate use of highly
interactive systems, indeed, across
visual mappings
VxInsight, Sandia Labs
• E.g., “Overview first, zoom and filter,
then details on demand”
– Shneiderman, 1996 (at class site)
– Though in fact may see interesting detail,
zoom out, find others, zoom in, …
Recall, Amplifying Cognition
Norman, 1993

• Humans think by interleaving internal mental action with


perceptual interaction with the world
• This interleaving is how human intelligence is expanded
– Within a task (by external aids)
– Across generations (by passing on techniques)

• External graphic (visual) representations are an important class


of external aids

• Don Norman is an influential cognitive scientist


– The power of the unaided mind is highly overrated. Without external aids, memory,
thought, and reasoning are all constrained. But human intelligence is highly flexible
and adaptive, superb at inventing procedures and objects that overcome its own
limits. The real powers come from devising external aids that enhance cognitive
abilities. How have we increased memory, thought, and reasoning? By the invention
of external aids:
– It is things that make us smart. (Norman, 1993, p. 43)

• External Cognition
Introduction and Overview
• Visualization as an “internal interface”
– Interface between human and computer in a man-machine problem-solving system
• Computer-based information system supports data gathering, calculation, and analysis
• Augments investigator’s working memory
– Provides visual markers for concepts
– Reveals structural relationships between problem components

• Some models of visualization – different takes on the same thing!


– Overview, zoom, filter, details (Shneiderman)
– Visualization Pipeline North (from Card et al.)
– Knowledge crystalization (Card et al.)
– Ware
– Model human processor (Card et al.)

• Motor processor – Ex: Fitts’ law

• Viewing information spaces


– Distortion techniques, fisheye views

• Navigation and Exploration


Example: “Overview, zoom and filter,
details on demand” - Shneiderman
• VxInsight demonstrates:
– “Overview, zoom and filter, details on
demand”
– Saw earlier when talking about text
representations (visual mappings)
– Again, visual analytics systems provide

• Developed by Sandia Labs to visualize


databases

• “Elements of database can be


“anything”
– For IV “abstract”
– e.g., document relations, company profiles

• Example screens show grant


proposals
– Shows interactive capabilities
VxInsight: Overview

vvv
VxInsight
• Interaction paradigm (Shneiderman):
– Overview
– Zoom
– Filter
– Details on demand
– Browse
– Search query

• Or (Ware) …
– Lowest level
• Data manipulation loop
– Intermediate
• Exploration and navigation loop
– Highest
• Problem-solving loop
VxInsight - Overview
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search query
VxInsight - Zoom
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
VxInsightv - Details
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
VxInsightv - Query
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
Recall, Visualization Pipeline:
Or, another take on interaction: Mapping Data to Visual Form

F F -1 User
Raw Visual - Task
Dataset Views
Information Form
Visual
Data Visual View Perception
Transformations Mappings Transformations
Interaction

• Most fundamentally – Visualizations are:


– “adjustable mappings from data to visual form to human perceiver”

• Series of data transformations ( )


– Multiple chained transformations
– Human adjusts the transformations - interaction

• Entire pipeline comprises an information visualization


Visualization Pipeline:
Human might adjust any of the visualization Stages

F F -1 User
Raw Visual - Task
Dataset Views
Information Form
Visual
Data Visual View Perception
Transformations Mappings Transformations
Interaction

• Data transformations (rarely):


– Map raw data (idiosynchratic form) into data tables (relational descriptions
including metatags)

• Visual Mappings (sometimes):


• E.g., table to graph
– Transform data tables into visual structures that combine spatial substrates,
marks, and graphical properties

• View Transformations (very often):


• E.g., zooming, …, changing viewpoint
– Create views of Visual Structures by specifying graphical parameters such as
position, scaling, and clipping
Ware: Interactive Visualization:
Interlocking Feedback Loops – Quick Look

• Interactive visualization
– Process made up of interlocking feedback loops
Problem Solving

• Lowest level: Data manipulation loop


– Objects selected and moved Exploration
– Relies on eye-hand coordination and Navigation
– Requires delay-free interaction
Data
Manipulation
• Intermediate: Exploration & navigation loop
– User finds way in large visual space
– Searching a large data space part by part
– Building a cognitive map of the data/simulation

• Highest: Problem-solving loop


– Forming and testing hypotheses about data
– Refines hypotheses through augmented visualization
– Repeat through cycles, revising or replacing visualization
• New data added, problem reformulated, possible solutions
identified
– Visualization as external representation of problem
• Extension of cognitive process
Interactive Visualization
Recall, Problem Solving, Cognitive Amplification, Knowledge Crystallization, (Card et al.)

• Knowledge crystallization: Gather knowledge, make sense of it, use it in task


Task operations

Task
Overview Write,
Zoom
Filter Forage for decide,
Extract
Compose
Details
Browse
Search query
data or act

Reorder
Cluster Read fact
Class
Average
Search for Problem- Read comparison
Read patter
Promote
Detect pattern schema solve
Manipulate
Create
Delete
Abstract

Instantiate
schema
Instantiate
Again, Ware’s Interlocking Feedback Loops
• Interactive visualization
– Process made up of interlocking feedback loops
Problem Solving

• Lowest level: Data manipulation loop


– Objects selected and moved Exploration
– Relies on eye-hand coordination and Navigation
– Requires delay-free interaction
Data
Manipulation
• Intermediate: Exploration & navigation loop
– User finds way in large visual space
– Searching a large data space part by part
– Building a cognitive map of the data/simulation

• Highest: Problem-solving loop


– Forming and testing hypotheses about data
– Refines hypotheses through augmented visualization
– Repeat through cycles, revising or replacing visualization
• New data added, problem reformulated, possible solutions
identified
– Visualization as external representation of problem
• Extension of cognitive process
1) Interacting Feedback Loops, and
2) Knowledge Crystallization, …
• Knowledge crystallization: Gather knowledge, make sense of it, use it in task

• Different time spans

• Problem Solving - outer Task


– Longest time

• Exploration and Write,


Forage for
Navigation data
decide,
– Primary use of data Data or act
and information Manipulation
visualizations
– Occurs for all elements
of problem solving,
knowledge
crystallization Exploration and Problem-
Search for Navigation
schema solve
• Data Manipulation
– Motor, etc.
Instantiate
– Again, for all element schema
of exploration and
navigation
Interacting Feedback Loops – Another Way
Ware’s account with “gear” metaphor
• As “gears” …
Problem Solving
(knowledge
crystalization)

Problem Solving

Exploration
Data Manipulation
and Navigation

Data
Manipulation
Exploration and
Navigation
Lowest Level: Data Manipulation Loop

Problem Solving

Exploration
and Navigation

Data
Manipulation
Lowest Level: Data Manipulation Loop
• Visual-Manual Control Loop

• Very carefully studied, for example … Problem Solving

– Choice reaction time: Hick-Hyman law Exploration


• Reaction time = a + b log2 (C) and Navigation

– 2D positioning and selection: Fitts’ law – quickly, more later Data


• Part of ISO standard 9214-9 Manipulation
– Protocols for evaluating user performance and comfort when
using pointing devices with visual display terminals
• Selection time = a + b log2 (D/W + 1.0)
• Hitting smaller targets further away is harder
• Adding latency severely increases difficulty
• Fitts’ law, including lag
– Mean time = a + b (Human Time + Machine lag) log2 (D/W + 1.0)

– Control compatibility is important


• Offset and scale is easy to deal with; rotation is hard

– Reaction time in making choices


• >= 160 msec per doubling of the numbers of choices
• Faster if allowed to make mistakes
Model Human Processor + Attention
Recall

• A “useful” big picture - Card et al. ’83 plus attention


– Senses/input  f(attention, processing)  motor/output
– Notion of “processors”
• Purely an engineering abstraction

• Detail next
Model Human Processor + Attention
• Sensory store
– Rapid decay “buffer” to hold
sensory input for later processing

• Perceptual processor
– Recognizes symbols, phonemes
– Aided by LTM

• Cognitive processor
– Uses recognized symbols
– Makes comparisons and
decisions
– Problem solving
– Interacts with LTM and WM

• Motor processor
– Input from cog. proc. for action
– Instructs muscles
– Feedback
• Results of muscles by senses

• Attention
– Allocation of resources
Model Human Processor
Recall

• Card et al. ’83

• An architecture with
parameters for cognitive
engineering …
– Will see visual image store, etc.
tonight

• Memory properties
– Decay time: how long memory lasts
– Size: number of things stored
– Encoding: type of things stored
Model Human Processor
Motor Processor

• Motor processor
– tM = 70 (range 30-70)
– For repetitive tasks without
feedback

• Tasks with feedback involve all:


– Perceptual processor
– Cognitive processor
– Motor processor
Motor Processing
• Motor processor can operate in two ways:

• 1. Open-loop control
– Motor processor runs a program by itself – no feedback about correctness

– Maximum rate, cycle time is tM = Tmotor ~ 70 ms

– Experiment: Scribble without looking and trying to stay in lines

• 2. Closed-loop control
– Experiment: Looking at lines, draw within the lines

– Muscle movements (or their effect on the world) are perceived by cognitive
system and compared with desired result

– Cycle time is Tprocess + Tcognitive + Tmotor ~ 240 ms


Fitts’s Law - demo
• Fitts’s Law
– Fundamental law of human
sensory-motor system
• Fitts, P. M. (1954). The information
capacity of the human motor system
in controlling the amplitude of
movement. Journal of Experimental
Psychology, 47, 381-391.

– E.g., for direct (reach) and


mouse use

– Demo:
• Best – won’t run on class box
• http://www.tele-actor.net/fitts/index.html

– Demo:
• OK – no line plotted
• http://fww.few.vu.nl/hci/interactive/fitts/
Fitts’s Law - demo
• Fitts’s Law
– Fundamental law of human sensory-motor system

• “tele-actor” results from demo:


Fitts’s Law
• Fitts’s Law
– Fundamental law of human sensory-motor system
– E.g., for direct (reach) and mouse use
– The time to acquire a target is a function of distance to and width (size) of target
• T = f (D, S)

• Time T to move your hand to a target of size S at distance D away:


– T = ReactionT + MotorT
= a + b * log2 (2 * D/S)

– Depends only on index of difficulty log(2D/S)


Explananation of Fitts’s Law
• Moving hand to a target is closed-loop control
– Vs. open-loop control we saw for Card et al. model

• Each (correction) cycle covers remaining distance D, with error εD


– Smaller correction in position as get closer
• (because there is less distance with which to correct)
– Slower velocity
• (because don’t go so fast with shorter distance)
Implications of Fitts’s Law
• Buttons, etc. should be reasonable size;
– hard to click small targets.

• Edges and corners of the computer display are easy to reach


– Mac single menubar better than multiple Windows menubars
– Also, pointer is "caught" at the edges

• Popup menus can usually be opened faster than pull-down menus


– User avoids movement

• Pie menu items are typically selected faster than linear menu items
– Small distance from the center of the menu
– Wedge-shaped target areas are large
Power Law of Practice
• Time to do a task decreases with
practice
– Obviously
– Involves all of perceptual-cognitive-
motor system

• Time Tn to do a task the nth time:


– Decaying exponential rate
– Tn = T1n α
– α is typically 0.2-0.6

• Example:
– Novices get rapidly better at task with
practice, but performance “levels off”
– Though still increasing performance
Intermediate Level:
Exploration, View Refinement and Navigation

Problem Solving

Exploration
and Navigation

Data
Manipulation
Intermediate Level:
Exploration, View Refinement and Navigation
• View navigation important when data space is too
large to fit on screen Problem Solving

– Complex problem
Exploration
– Considers theories of pathfinding and map use, cognitive
and Navigation
spatial metaphors, direct manipulation, visual feedback
Data
• Basic navigation control loop (below) Manipulation
– Left is human – cognitive and spatial model with which user
understands data space and progress through it
• Maintaining data space for some time may become encoded in
long-term memory
– Right is system – visualization may be updated and refined
from data mapped into spatial model

• Includes:
– 3D Locomotion and viewpoint control
– Pathfinding
– Focus + context
3D Locomotion and Viewpoint Control
Navigation in 3D

• Displaying data elements so looks like 3D


landscape, vs. flat map, often used
– Follows from Gibsonian orientation
• Affordances
• Properties of the world perceived in terms of potential for
action (physical model, direct perception)
• Problem with generalization to user interfaces/interaction
• Nevertheless, important and influential
– Have examined depth cues

• Embed objects in space, navigate space


– Flying viewpoint through the data space
– Constrain user to useful parts of the space to reduce
cognitive load of navigation
• Surface of the ground
• Walkways within power plant
• Particular paths of interest

• Examples
– Web browser: Harmony
– Clustering of text, Wise et al.
3D Locomotion and Viewpoint Control:
Spatial Metaphors

• Evaluation
– Exploration and Explanation
– Cognitive and Physical Affordance
– Task 1: Find areas of detail in the scene
– Task 2: Make the best movie
– 3D environments: Hallway, extended terrain, closed object.

• World-in-hand
– Good for discrete objects
– Poor affordances for looking scale changes – detail
– Problem with center of rotation when extended scenes

• Eye-in-hand
– Easiest under some circumstances
– Poor physical affordances for many views
– Subjects sometimes acted as if model were actually present

• Walking

• Flying vehicle control


– Hardest to learn but most flexible
– Non-linear velocity control
– Spontaneous switch in mental model
– The predictor as solution
3D Locomotion and Viewpoint Control:
Wayfinding, Cognitive, and Real Maps

• Worldlets
– Can be rotated to facilitate recognition
Frames of Reference
Egocentric, Exocentric

• Use of maps implies ability to


apply another perspective
– To physical,
• e.g., road map (view from above),
– Or abstract
– … another frame of reference

• Egocentric
– view from user

• Exocentric
– View from outside the user
– Road map just one of many
exocentric view

• Movement of body (vs. eyes)


affects orientation most
– Pan, tilt, …, but not rotation, so
dof constrained in practice
Frames of Reference
Tethered view, world view

• Various views illustrated


Mutiple Simultanous Views
• Represent data space in different forms in different views

• E.g., “spiral calendar”


Focus, Context, and Scale
• Saw this earlier, here, in Ware
Focus, Context, and Scale
• Problem of finding detail in larger context
– Again, spatial navigation
– Wayfinding problem may be considered as
discovering specific objects in a larger context

• Addressed by multiple views at differing


spatial scales
– Movement between views at different scales
(and frames of reference)
– Changing spatial scale
• E.g., overview + detail

• Addressed, also, by changing structural


scale
– E.g., collapsing lines of code in display of
software systems
Focus, Context, and Scale:
Overview and Detail

• Fred Brooks’ GRIP


project at UNC
– Molecular structure
solution, docking
– Architectural walkthrough

• Users always going


from detail to overview
– Then overview to detail…
– Then detail to overview…

• Options
– Provide display of both
– Provide easy, non-jarring
switch between them

• Multiple-Window
Zoom with Callouts …
Focus+Context: Fisheye Views, 1
• Detail + Overview
– Keep focus, while remaining aware
of context

• Fisheye views
– Physical, of course, also ..
– A distance function. (based on
relevance)
– Given a target item (focus)
– Less relevant other items are
dropped from the display
– Classic cover
• New Yorker’s idea of the world
Focus+Context: Fisheye Views, 2
• Detail + Overview
– Keep focus while remaining aware of context

• Fisheye views
– Physical, of course, also ..
– A distance function. (based on relevance)
– Given a target item (focus)
– Less relevant other items are dropped from
the display
– Or, are just physically smaller – distortion
Distortion Techniques, Generally
• Distort space = Transform space
– By various transformations

• “Built-in” overview and detail, and


landmarks
– Dynamic zoom

• Provides focus + context


– Several examples follow

• Spatial distortion enables smooth


variation
Focus + Context, 1
• Fisheye Views
• Keep focus while remaining aware of the context
• Fisheye views:
– A distance function (based on relevance)
– Given a target item (focus)
– Less relevant other items are dropped from the display.
• Demo of Fisheye Menus:
– http://www.cs.umd.edu/hcil/fisheyemenu/fisheyemenu-demo.shtml
Focus + Context, 2
• Bifocal Lens
– Database navigation: An Office Environment for the Professional by R. Spence and M.
Apperley
Focus + Context, 3
• Distorted Views
– The Table Lens: Merging Graphical and Symbolic Representations in an Interactive
Focus + Context Visualization for TabularInformation by R. Rao and S. K. Card
– A Review and Taxonomy of Distortion Oriented Presentation Techniques by Y. K. Leung
and M. D. Apperley
Focus + Context, 4
• Distorted Views
– Extending Distortion Viewing from 2D to 3D by M. Sheelagh, T. Carpendale, D. J.
Cowperthwaite, F. David Fracchia

Magnification and displacement:


Focus + Context, 5
• Alternate Geometry
– The Hyperbolic Browser: A Focus + Context
Technique for Visualizing Large Hierarchies by
J. Lamping and R. Rao

• Demo
Other Navigation Techniques:
GeoZui3D, Zooming + 2 dof rotations

• Translate point on surface to center

• Then scale

• Or translate and scale


View Refinement and Navigation
(optional, from 2nd ed.)

• Transparency:
– When there is the perception of
direct contact with the data, the
interface becomes transparent
– Big idea in interfaces
– Temporal feedback rapid (< 1/10
second)
– Response is compatible with
interaction method

• Interactive adjustment of ranges


– Zoom in on data area of interest
– Sometimes nonlinear mapping
brings area of interest into range
where patterns are easy to see
(logarithmic)
Interaction vs. Animation
• Ware comments:

• Exploration (interaction) vs. Presentation (animation)


– Flexibility vs. Efficiency

• Active vs. Passive Participation


– Immediacy of response and engagement
– Control promotes understanding
• Person moving learns more than partner watching
• Active control increases sense of presence
End

• .
Choose Appropriate Visual
Encodings
Natural ordering
• Natural ordering and number of distinct values will indicate whether a visual property is best
suited to one of the main data types: quantitative, ordinal, categorical, or relational data.

• Spatial data is another common data type, and is usually best represented with some kind of
map

• Whether a visual property has a natural ordering is determined by whether the mechanics of our
visual system and the “software” in our brains automatically—unintentionally—assign an order,
or ranking, to different values of that property.

• For example, position has a natural ordering; shape doesn’t. Length has a natural ordering;
texture doesn’t (but pattern density does). Line thickness or weight has a natural ordering; line
style (solid, dotted, dashed) doesn’t

• Depending on the specifics of the visual property, its natural ordering may be well suited to
representing quantitative differences (27, 33, 41), or ordinal differences (small, medium, large,
enormous).
Natural ordering…
Color is not ordered
• Here’s a tricky one: Color (hue) is not naturally ordered in our brains. Brightness (lightness or
luminance, sometimes called tint) and intensity (saturation) are, but color itself is not
DISTINCT VALUES
• The second main factor to consider when choosing a visual property is how many distinct values it
has that your reader will be able to perceive, differentiate, and possibly remember.
DISTINCT VALUES…
REDUNDANT ENCODING
• If you have the luxury of leftover, unused visual properties after you’ve encoded the main
dimensions of your data, consider using them to redundantly encode some existing, already-
encoded data dimensions
• The advantage of redundant encoding is that using more channels to get the same information
into your brain can make acquisition of that information faster, easier, and more accurate
DEFAULTS VERSUS INNOVATIVE FORMATS And READERS’
CONTEXT
DEFAULTS VERSUS INNOVATIVE FORMATS
• The choice comes down to a basic cost-benefit analysis. What is the expense to you and your
reader of creating and understanding a new encoding format, versus the value delivered by that
format?
• If you’ve got a truly superior solution (as evaluated by your reader, and not just your ego), then
by all means, use it.
• But if your job can be done (or done well enough) with a default format, save everyone the effort
and use a standard solution
READERS’ CONTEXT
• First, it’s important to point out that your audience will likely be composed of more than one
reader. And as these people are all individuals, they may be as different from each other as they
are from you, and will likely have very different backgrounds and levels of interest in your work.
• It may be impossible to take the preconceptions of all these readers into consideration at once.
• So choose the most important group, think of them as your core group, and design with them in
mind. Where it is possible to appeal to more of your potential audience without sacrificing
precision or efficiency, do so.
READERS’ CONTEXT…
• let’s get specific about some facets of the reader’s mindset that you need to take into account.
Titles, tags, and labels
• When selecting the actual terms you’ll use to label axes, tag visual elements, or title the piece (which
creates the mental framework within which to view it), consider your reader’s vocabulary and
familiarity with relevant jargon.
1. Is the reader from within your industry or outside of it? What about other readers
outside of the core audience group?
2. Is it worth using an industry term for the sake of precision (knowing that the reader may
have to look it up), or would a lay term work just as well?
3. Will the reader be able to decipher any unknown terms from context, or will a vocabulary
gap
• These are the kinds of questions you should ask yourself. Each and every single word in your
visualization needs to serve a specific purpose
Colors
• Another reader context to take into account is color choice. There is quite a bit of science about how
our brains perceive and process color that is somewhat universal, as we saw earlier in this chapter. But
it’s worth mentioning in the context of reader preconceptions the significant cultural associations that
color can carry.
READERS’ CONTEXT…
Color blindness
• Of course, we know that there are many variations in the way different people perceive color. This
is commonly called color blindness but is more properly referred to as color vision deficiency or
dyschromatopsia.
• A disorder of color vision may present in one of several specific ways.
• Although prevalence estimates vary among experts and for different ethnic and national groups,
about 7% of American men experience some kind of color perception disorder (women are much
more rarely affected: about 0.4 percent in America).
• Red-green deficiency is the most common by far, but yellow-blue deficiency also occurs. And
there are lots of people who have trouble distinguishing between close colors like blue and
purple.
Directional orientation
• Is the reader from a culture that reads left-to-right, right-to-left, or top-to-bottom? A person’s
habitual reading patterns will determine their default eye movements over a page, and the order
in which they will encounter the various visual elements in your design.
COMPATIBILITY WITH REALITY
• a large factor in your success is making life easier for your reader, and that’s largely based on
making encodings as easy to decode as possible.
• One way to make decoding easy is to make your encodings of things and relationships as well
aligned with the reality (or your reader’s reality) of those things and relationships as possible; this
alignment is called compatibility.
PATTERNS AND CONSISTENCY
• The human brain is amazingly good at identifying patterns in the world. We easily recognize
similarity in shapes, position, sound, color, rhythm, language, behavior, and physical routine, just
to name a few variables.
• This ability to recognize patterns is extremely powerful, as it enables us to identify stimuli that
we’ve encountered before, and predict behavior based on what happened the last time we
encountered a similar stimulus pattern
• Consequently, we also notice violations of patterns. When a picture is crooked, a friend sounds
troubled, a car is parked too far out into the street, or the mayonnaise smells wrong, the patterns
we expect are being violated and we can’t help but notice these exceptions.
• we notice them because they are exceptions to the norm. they are intentional, whether you
planned for the patterns to exist or not. The second is that when they perceive patterns, readers
will also expect pattern violations to be meaningful.
• It all comes down to three simple rules.
1.Be consistent in membership, ordering, and other encodings.
2.Things that are the same should look the same.
3. Things that are different should look different.
Other Factors
• COMPARISONS NEED TO COMPARE

• SOME STRUCTURES ARE JUST INHERENTLY BAD


• SOME GOOD STRUCTURES ARE OFTEN ABUSED
• KEEP IT SIMPLE (OR YOU MIGHT LOOK) STUPID
Visualize It!
A Comprehensive Guide
to Data Visualization
netquest.com 1
What’s inside...

1 2 3 4 5
Introduction Data types, Basic principles for Storytelling for Trends in market
relationships, and data visualization social and market research and
What is data visualization?
visualization formats communication data visualization
Graphics with an objective:
The data visualization process seeking your mantra dashboards
Two kinds of data Data storytelling

Why is data visualization so Layout and design: Scrollytelling


Seven data relationships A basic recipe for storytelling in
important in reports communicative elements
your presentations and Social-first data visualization
and statements?
11 formats final reports
Prioritize patterns in your
visualizations: Gestalt Virtual reality visualizations

What does the future have


in store?

netquest.com 2
1. Introduction

netquest.com 3
Introduction What is data
visualization?
Data visualization
Data visualization is the process of acquiring, interpreting
and comparing data in order to clearly communicate
The ways we structure and visualize information are
complex ideas, thereby facilitating the identification and
changing rapidly and getting more complex with each
analysis of meaningful patterns.
passing day. Thanks to the rise of social media, the
ubiquity of mobile devices, and service digitaliza-
tion, data is available on any human activity that
utilizes technology. The generated information is
hugely valuable and makes it possible to analyze trends
and patterns, and to use big data to draw connections
between events. Thus, data visualization can be an
effective mechanism for presenting the end user with
understandable information in real time.
Data visualization can be essential
Every company has data, be it to communicate with
to strategic communication: it
clients and senior managers or to help manage the
organization itself. It is only through research and helps us interpret available data;
interpretation that this data can acquire meaning and be detect patterns, trends, and
transformed into knowledge.
anomalies; make decisions; and
This ebook seeks to guide readers through a series of analyze inherent processes.
basic references in order to help them understand data All told, it can have a powerful
visualization and its component parts, and to equip
impact on the business world.
them with the tools and platforms they need to create
interactive visuals and analyze data. In effect, it seeks
to provide readers with a basic vocabulary and a crash
course in the principles of design that govern data visu-
alization so that they can create and analyze interactive
market research reports.
netquest.com 4
The data
visualization process

Several different fields are involved in the data visual-


ization process, with the aim of simplifying or revealing
existing relationships, or discovering something new
within a data set.

1
Visualization process

Filtering & processing. Refining and cleaning data to


convert it into information through analysis, interpreta-
tion, contextualization, comparison, and research.

Translation & visual representation. Shaping the


visual representation by defining graphic resources,
language, context, and the tone of the representation,
all of which are adapted for the recipient.

Perception & interpretation. Finally, the visualization


becomes effective when it has a perceptive impact on
the construction of knowledge.

1 Pérez, J. and Vialcanet, G. (2013). Guía de visualización de datos


aplicada al marketing digital: Cómo transformar datos en conocimiento
(p.5-6).

netquest.com 5
Why is data All of this indicates that human beings are better at Identifying the evolution of sales over the course of the

visualization so processing visual information, which is lodged in our year isn’t easy. However, when we present the same
long-term memory. information in a visual, the results are much clearer (see
important in reports the graph below).

and statements? Consequently, for reports and statements, a visual rep-


resentation that uses images is a much more effective The graph takes what the numbers cannot communi-
way to communicate information than text or a table; it cate on their own and conveys it in a visible, memorable
We live in the era of visual information, and visual also takes up much less space. way. This is the real strength of data visualization.
content plays an important role in every moment of
our lives. A study by SH!FT Disruptive Learning demon- This means that data visuals are more attractive,
strated that we typically process images 60,000 simpler to take in, and easier to remember.
times faster than a table or a text, and that our brains
typically do a better job remembering them in the long Try it for yourself. Take a look at this table:
term. That same research detected that after three days, Graphical excellence is that which gives to
Month Jan Feb Mar Apr May Jun
analyzed subjects retained between 10% and 20% of the viewer the greatest number of ideas in
written or spoken information, compared with 65% of the shortest time with the least ink in the
Sales 45 56 36 58 75 62
visual information. smallest space.”
- Edward Tufte (2001)
Sales

The rationale behind the power 100


of visuals:
80
• The human mind can see an image for just 13 mil-
liseconds and store the information, provided that 75 62
56 58
it is associated with a concept. Our eyes can take in
60
36,000 visual messages per hour. 45
40
• 40% of nerve fibers are connected to the retina.
36
20

Month Jan Feb Mar Apr May Jun

netquest.com 6
Data visualization chiefly helps in 3 key aspects of For example: an interactive graphic from The Guardian2 invites us to explore how the
reports and statements: linguistic standard of U.S. presidential addresses has declined over time. The visual is
interactive and explanatory, in addition to indicating the readability score of various
presidents’ speeches.
1) Explaining
Visuals aim to lead the viewer down a path in order to describe situations, answer
questions, support decisions, communicate information, or solve specific problems. 3) Analyzing
When you attempt to explain something through data visualization, you start with a
question, which interacts with the data set in such a way that enables viewers to make Other visuals prompt viewers to inspect, distill, and transform the most significant
a decision and, subsequently, answer the question. information in a data set so that they can discover something new or predict upcom-
ing situations.
For example: This graphic below could clearly explain the country with the greatest
demand for a certain product compared globally, in a concrete month. For example: this interactive graphic about learning machine3 invites us to explore
and discover information within the visual by scrolling through it. Using the machine
500 learning method, the visual explains the patterns detected in the data in order to cate-
400 gorize characteristics.

300

200
We’ll close this introduction with a 2012 reflection by Alberto Cairo, a specialist in
100 information visualization and a leader in the world of data visualization. For the
0 author, a good visual must provide clarity, highlight trends, uncover patterns, and
United Russia South Europe Canada Australia Japan
States Africa reveal unseen realities:

2) Exploring We create visuals so that users can analyze data and, from it, dis-
cover realities that not even the designer, in some instances, had
Some visuals are designed to lend a data set spatial dimensions, or to offer numerous
considered.”
subsets of data in order to raise questions, find answers, and discover opportunities.
When the goal of a visual is to explore, the viewers start by familiarizing themselves
with the dataset, then identifying an area of interest, asking questions, exploring, and 2 Available at: https://www.fusioncharts.com/whitepapers/downloads/Principles-of-Data-Visualization.pdf
finding several solutions or answers. 3 Available at: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

netquest.com 7
2.
Data types,
relationships, and
visualization formats

netquest.com 8
Data types, 2 kinds of data
relationships, and Before we talk about visuals themselves, we must first understand the different

visualization formats
kinds of data that can be visualized and how they relate to one another.
The most common kinds of data are4:

There are a number of methods and approaches to


creating visuals based on the nature and complexity
1) Quantitative (numeric) 2) Qualitative (categoric)
of the data and the information. Different kinds of
Data that can be quantified and measured. This kind of This kind of data is divided into categories based on
graphics are used in data visualizations, including
data explains a trend or the results of research through non-numeric characteristics. It may or may not have a
representations of statistics, maps, and diagrams.
numeric values. This category of data can be further logical order, and it measures qualities and generates
These schematic, visual representations of content
subdivided into: categorical answers. It can be:
vary in their degree of abstraction.

In order to communicate effectively, it is important to


• Discrete: Data that consists of whole numbers (0, 1, 2, • Ordinal: Meaning it follows an order or sequence.
understand different kinds of data and to establish
3...). For example, the number of children in a family. That might be the alphabet or the months of the year.
visual relationships through the proper use of graphics.
• Continuous: Data that can take any value within an • Categorical: Meaning it follows no fixed order. For
Enrique Rodríguez (2012), a data analyst at DataNauta,
interval. For example, people’s height (between 60 - example, varieties of products sold.
once explained in an interview that...
70 inches) or weight (between 90 and 110 pounds).

A good graphic is one that synthesizes


and contextualizes all of the information
that’s necessary to understand a situa-
tion and decide how to move forward.”
Quantitative Qualitative

5 Source: Hubspot, Prezy, and Infogram (2018). Presenting Data People Can’t
Ignore: How to Communicate Effectively Using Data. | p.10 of 16 | Available at:
https://offers.hubspot.com/presenting-data-people-cant-ignore.

netquest.com 9
7 data relationships
Data relationships can be simple, like the progress of a single metric over time (such as visits to a blog over the course of 30 days or the number of users on a social network),
or they can be complex, precisely comparing relationships, revealing structure, and extracting patterns from data. There are seven data relationships to consider:

Ranking: A visualization that relates two or more values Nominal comparisons: Visualizations that compare Series over time: Here we can trace the changes in the
with respect to a relative magnitude. For example: a quantitative values from different subcategories. For values of a constant metric over the course of time. For
company’s most sold products. example: product prices in various supermarkets. example: monthly sales of a product over the course of
two years.

Correlation: Data with two or more variables that can


demonstrate a positive or negative correlation with one
another. For example: salaries based on level of education.

Deviation: Examines how each data point relates to the Distribution: Visualization that shows the distribu-
others and, particularly, to what point its value differs tion of data spatially, often around a central value.
from the average. For example: the line of deviation for For example: the heights of players on a basketball team.
tickets to an amusement park sold on a rainy versus a Partial and total relationships: Show a subset of data
normal day. as compared with a larger total. For example: the per-
centage of clients that buy specific products.

netquest.com 10
11 formats 1. Bar chart

There are two types of visualizations: static and Bar charts are one of the most popular ways of visual- They are very versatile, and they are typically used
interactive. Their use depends on the search and izing data because they present a data set in a quickly to compare discrete categories, to analyze changes
analysis dimension level. Static visuals can only understood format that enables viewers to identify over time, or to compare parts of a whole.
analyze data in one dimension, whereas inter- highs and lows at a glance. The three variations on the bar chart are:
active visuals can analyze it in several.

As with any other form of communication, familiar-


ity with the code and resources that are available to
us is essential if we’re going to use them successfully Vertical column Horizontal column Full stacked column
our goal. In this page, we present the different kinds
of graphics that we can use to transform our data Used for chronological data, and it Used to visualize categories. Used to visualize categories that
into information. This group of visualization types should be in left-to-right format. collectively add up to 100%.
is listed in order of popularity in the “Visualization
Universe” project by Google News Lab and Adioma,
as of the publication of this report.
6,000

5,500

5,000

4,500 Jan
Education
4,000

3,500

3,000
Feb
Entertainment
2,500

2,000

1,500
Mar
1,000 Heatlh

500

0
Jan Feb Mar Apr May 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

netquest.com 11
2. Histograms

400K
Histograms represent a variable in the form of bars,
where the surface of each bar is proportional to the 350K
>120

frequency of the values represented. They offer an 300K

overview of the distribution of a population or sample 101-120


250K
with respect to a given characteristic.
200K 81-100
The two variations on the histogram are:
150K

60-80
• Vertical columns 100K

• Horizontal columns 50K


<60
0

25 30 35 40 45 50 55 60 65 -60 -40 -20 0 20 40 60

Vertical columns Horizontal columns

3. Pie charts

Pie charts consist of a circle divided into sectors, each


of which represents a portion of the total. They can
be subdivided into no more than five data groups. They
can be useful for comparing discrete or continuous data.
The two variations on the pie chart are:

• Standard: Used to exhibit relationship between parts.


• Donut: A stylistic variation that facilitates the inclu-
sion of a total value or a design element in the center.
A B A B C D

Standard pie chart Donut pie chart

netquest.com 12
4. Scatter plots
1.0 30.000

Scatter plots use the spread of points over a Car- 25.000


0.8
tesian coordinate plane to show the relationship
20.000
between two variables. They also help us determine 0.6
whether or not different groups of data are correlated. 15.000

0.4
10.000

0.2 5.000

0 0
0 0 0 0 00 00 .000 .000 .000 .000
0.2 0.4 0.6 0.8 1.0 1.2 5.0
0 .00 5.00 0.0 5.0
10 1 2 2 30 35 40 45

Scatter plot Scatter plot with grid

5. Heat maps

Heat maps represent individual values from a data


set on a matrix using variations in color or color
intensity. They often use color to help viewers com- E
pare and distinguish between data in two different
categories at a glance. They are useful for visualizing D
2
webpages, where the areas that users interact with most
are represented with “hot” colors, and the pages that C 1
receive the fewest clicks are presented in “cold” colors. 0
B
The two variations on the heat map are:
-1
A
• Mosaic diagram
0% 10% 30% 50% 70% 100%
• Color map 1 2 3 4 5 6

Mosaic diagram Color map

netquest.com 13
6. Line charts 7. Bubble charts 8. Radar charts

These are used to display changes or trends in data These graphics display three-dimensional data and These are a form of representation built around a
over a period of time. They are especially useful for accentuate data in dispersion diagrams and maps. regular polygon that is contained within a circle,
showcasing relationships, acceleration, deceleration, Their purpose is to highlight nominal comparisons and where the radii that guide the vertices are the axes
and volatility in a data set. classification relationships. The size and color of the over which the values are represented. They are
bubbles represent a dimension that, along with the equivalent to graphics with parallel coordinates on polar
data, is very useful for visually stressing specific values. coordinates. Typically, they are used to represent the
The two variations on the bubble chart are: behavior of a metric over the course of a set time cycle,
such as the hours of the day, months of the year, or days
• The bubble plot: used to show a variable in three of the week.
dimensions, position coordinates (x, y) and size.

Line chart
• Bubble map: used to visualize three-dimensional
values for geographic regions.

Radar chart

netquest.com 14
9. Waterfall charts

400K
These help us understand the cumulative effect
350K
of positive and negative values on variables in a
sequential fashion. 300K

250K

200K

150K

100K

50K

Start A B C D E F G H I J K L End

Fall Rise

10. Tree maps

A
Tree maps display hierarchical data (in a tree struc- B C
A
ture) as a set of nested rectangles that occupy sur-
200
face areas proportional to the value of the variable
they represent. Each tree branch is given a rectangle, E H
which is later placed in a mosaic with smaller rectangles B C
80 120
that represent secondary branches. The finished prod-
uct is an intuitive, dynamic visual of a plane divided into
areas that are proportional to hierarchical data, which
has been sorted by size and given a color key. D E F G H D
G
F
30 50 20 40 60

netquest.com 15
11. Area charts Selecting the right graphic to effectively communicate
through our visualizations is no easy task. Stephen
1.0
These represent the relationship of a series over Few (2009), a specialist in data visualization, proposes
time, but unlike line charts, they can represent 0.8 taking a practical approach to selecting and using an
volume. The three variations on the area chart are: appropriate graphic:
0.6
• Standard area: used to display or compare a pro- • Choose a graphic that will capture the viewer’s
gression over time. 0.4 attention for sure.
• Stacked area: used to visualize relationships as part
of the whole, thus demonstrating the contribution of 0.2 • Represent the information in a simple, clear, and
each category to the cumulative total. precise way (avoid unnecessary flourishes).
0
• 100% stacked area: used to communicate the dis-
1 2 3 4 5 6
tribution of categories as part of a whole, where the • Make it easy to compare data; highlight trends
cumulative total does not matter. Standard area and differences.

• Establish an order for the elements based on the


quantity that they represent; that is, detect maxi-
1.0 mums and minimums.
1.0

0.8
0.8 • Give the viewer a clear way to explore the
graphic and understand its goals; make use of
0.6
0.6 guide tags.

0.4
0.4

0.2
0.2

0
0
0 1 2 3 4 5 6
0 1 2 3 4 5 6

A B C

Stacked area 100% stacked area

netquest.com 16
3. Basic principles for
data visualization

netquest.com 17
Basic principles for Shneiderman introduces his famous mantra on how

data visualization
to approach the quest for visual information, which he
breaks down into three tasks:

Graphics with
1. Overview first: This ensures viewers have a general 1.
understanding of the data set, as their starting point for
System Context
an objective: seeking exploration. This means offering them a visual snapshot
of the different kinds of data, explaining their relation- The system plus users and
your mantra ship in a single glance. This strategy helps us visualize the system dependencies
data, at all its different levels, at one time. OVERVIEW
FIRST
The goal of data visualizations is to help us understand 2. Zoom and filter: The second step involves supple-
2.
the object they represent. They are a medium for com- menting the first so that viewers understand the data’s
municating stories and the results of research, as well underlying structure. The zoom in/zoom out mechanism Containers
as a platform for analyzing and exploring data. There- enables us to select interesting subsets of data that meet The overall shape of the archi-
fore, having a sound understanding of how to create certain criteria while maintaining the sense of position tecture and technology choices.
data visualizations will help us create meaningful and and context.
easy-to-remember reports, infographics, and dash-

3.
boards. Creating suitable visuals helps us solve problems 3. Details on demand: This makes it possible to select
and analyze a study’s objects in greater detail. a narrower subset of data, enabling the user to interact
Components ZOOM AND
with the information and use filters by hovering or click-
FILTER
The first step in representing information is trying ing on the data to pull up additional information. Logical components and their
to understand that data visualization. interactions within a container.
The chart on the right side summarizes the key points to
Ben Shneiderman gave us a useful starting point in his designing such a graphic, with an eye to human visual
text “The Visual Information-Seeking Mantra” (1996),
which remains a touchstone work in the field. This
perception, so that users can translate an idea into a set
of physical attributes.
4.
Classes DETAILS ON
author suggests a simple methodology for novice users DEMAND
to delve into the world of data visualization and experi- These attributes are: structure, position, form size, Component or pattern imple-
ment with basic visual representation tasks. 5
and color. When properly applied, these attributes can mentation details.
5 Shneiderman, B. (1996). The Eyes Have It: A Task by Data Type Taxonomy for present information effectively and memorably.
Information Visualizations. Visual Information Seeking Mantra (p. 336). Available at:
https://www.cs.umd.edu/~ben/papers/Shneiderman1996eyes.pdf

netquest.com 18
Layout and design: Structuring: the importance
Furthermore, the visual hierarchy of elements plays a
of layout
communicative role in this encoding process, because the elements’
organization and distribution must have a well-defined
elements All visual representations begin with a blank dimensional hierarchical system in order to communicate effec-
space that will eventually hold the information which tively (Meirelles: 2014). In a sense, visualizations are
will be communicated. The process of spatial coding is paragraphs about data, and they should be treated
a fundamental part of visual representation because it as such. Words, images, and numbers are part of the
In order to begin designing our reports and state- is the medium in which the results of our compositional information that will be visualized. When all of the
ments, it is essential to understand that visual repre- decisions and the meaning of our visual statement will elements are integrated in a single structure and visual
sentations are cognitive tools that complement and be visualized, thereby having an impact on the user. hierarchy, the infographic or report will organize space
strengthen our mental ability to encode and decode properly and communicate effectively, according to
information . Meirelles (2014) notes that: “All graphic
6
Edward Tufte (1990) defines “layout” as a scheme for your user’s needs.
representation affects our visual perception, distributing visual elements in order to achieve organi-
because the elements of transmission utilized act zation and harmony in the final composition. Layout
as external stimuli, which activate our emotional planning and design serve as a template for applying
state and knowledge.” hierarchy and control to information at varying levels of
detail.7 In his book Envisioning Information, Tufte offers
Thus, when our mind visualizes a representation, it several guidelines for information design:
transforms the information, merges it, and applies a
hierarchical structure to it to facilitate interpretation. • Have a properly chosen format.
• Give a broad visual tour and offer a focused reading
For this reason, in order to have an efficient per- at different detail levels.
ceptive impact, it is important to adhere to a series • Use words, numbers, and drawings.
of best practices when creating reports and info- • Reflect a balance, a proportion, a sense of relevant
graphics. As with any other form of communication, scale, and a context.
success depends largely on the business’s familiarity
with the established code and the resources available. Spatial encoding requires processing spatial proportions
Space, shapes, color, icons, and typography are a (position and size), which have a determining role in the
few of the essential elements of a striking visual with organization of perception and memory.
communicative power.

6 Meirelles, I (2014). “La información en el diseño,” (p.21-22). Barcelona: Parramón. 7 Tufte, E. (1990). Envisioning Information. Cheshire: Graphics Press.

netquest.com 19
Visual variables
and their semantics

Visual variables are the building blocks of visual repre-


sentation. They conform to an order and spatial con-
Variables Point Line Area
text in order to convey a quantitative message. These
resources can be used to categorize meaningful prop-
erties and amplify the message being represented. Let’s 2 dimensions
take a look at their semantics: (X,Y)

• Point: Has no dimensions and indicates a place.

• Line: Has one dimension and indicates length Size


and direction.

• Plane: Has two dimensions and indicates space


and scale.

Jacques Bertin, cited in Meirelles (2014), used the term Value


“visual variables” for the first time in his book Semiol-
ogie Graphique, where he presented them as a system
of perceptive variables with corresponding properties
of meaning. He offered a guide for combining graphic Visual variables
elements in an appropriate way according to their order,
position, orientation, size, texture, and value.

netquest.com 20
Using consistent and attractive
color schemes

Color is one of the most powerful resources for data


visualization, and it is essential if we are going to under-
stand information properly. Grayscale

Color can be used to categorize elements, quantify


or represent values, and communicate cultural attri-
Double complementary
butes associated with a specific color.

It dominates our perception and, in order to analyze it,


we must first understand its three dimensions.
Complementary

Hue: this is what we normally imagine when we picture


colors. There is no order to colors; they can only be dis-
tinguished by their characteristics (blue, red, yellow, etc.). Monochromatic

Brightness: the color’s luminosity. This is a relative mea-


sure that describes the amount of light reflected by one
object with respect to another. Brightness is measured Split complementary

on a scale, and we can talk about brighter and darker


values of a single hue.

Cool colors
Saturation: this refers to the intensity of a given color’s
hue. It varies based on brightness. Darker colors are less
saturated, and the less saturated a color is, the closer
it gets to gray. In other words, it gets closer to a neutral Saturated colors
(hueless) color. The following graphic offers a brief sum-
mary of color application.

netquest.com 21
Isabel Meirelles (2014) notes that selecting a color pal- 2. Diverging palettes TIP: The qualitative color scheme is perfect for visualiz-
ette in order to visualize data is no easy task, and she ing data because it affords a high degree of contrast and
recommends following Cynthia Brewer’s advice uses These are more suitable for ordering categorical data, helps you draw attention to important points, especially
three different kinds of color schemes, based on the and they are more effective when the categorical if you use one predominant color and use the second as
nature of the data: division is in the middle of the sequence. The change in an accent in your design.
brightness highlights a critical value in the data, such as
the mean or median, or a zero. Colors become darker to
1. Monochromatic sequential palettes or represent differences in both directions, based on this Finally, don’t forget to use palettes that are comprehen-
their analogue meaningful value in the middle of the data. sible to people who can’t see color. Color blindness is a
disability or limited ability that makes it difficult to distin-
These palettes are great for ordering numeric data that guish certain pairs of colors, such as blue and yellow, or
progresses from small to large. It is best to use brighter red and green. One strategy for avoiding this problem is
color gradients for low values and darker ones for to adapt designs that use more than just hue to codify
higher values. TIP: Try to emphasize the most important information information; create schemes that slightly vary another
using arrows and text, circles, rectangles, or contrasting channel, such as brightness or saturation.
colors. This way, when you visualize your data, your
analysis will be more understandable.

Thus, brightness levels can be used as a visible, coherent 3. Qualitative palettes


aspect of a graphic scheme. Sequential color schemes
make it possible to create a smooth, low-contrast These are better for representing ordinal or categorical
design. This color scheme is better for an image than for data to create primary visual differences between catego-
data visualization. ries. Most qualitative schemes are based on differences in
hue, with differences in brightness between the colors.

TIP: To create a color hierarchy in a sequential scheme,


choose one dominant color and use the others with
moderation; alternatively, you can simply use two softer
versions of the dominant color, which will naturally
make them feel lower on the hierarchy.

netquest.com 22
Use icons and symbols to aid 82%
77%
88%
76% 73%
64% 63%
in understanding and limit Notebooks 55% 54%
unnecessary tagging
Entertainment

Symbols and icons are another avenue for visualizing Lifestyle products
information that goes beyond merely being decorative.
They draw strength from their ability to exhibit a gen- Singles Couples Families
eral context in an attractive, precise way. Icons illustrate
concepts. Viewers can understand what the information
is about by just glancing at the illustration.

Alexander Skorka (2018), chief evangelist for the Dapresy


Singles Couples Families
Group, recommends using symbols and icons because
they simplify communication. Symbols are self-ex- Notebooks
82% 76% 63%
planatory, and our mind can process icons more
easily than text. It is important to consider that an
Entertainment
icon’s success depends largely on cultural context, so it is 55% 64% 88%
important to select universally understandable images.
Lifestyle products
That said, they certainly should not be complex illustra- 77% 73% 54%

tions. An icon with too many details could hinder viewers’


understanding. Keep it simple: icons’ meaning should be
immediately clear, even when they’re very small.

The ease with which we recognize icons enables us to


process data faster than we can process information
conveyed textually. Therefore, when designing informa-
tion, it is wise to use both graphics and icons to convey
proportions in greater detail.

netquest.com 23
The typography in our reports: sense of tradition, security, history, integrity, author-
effective applications ity, integrity, and other such concepts. Sans-serif
fonts stand out because they have a more polished,
sophisticated feel; they convey a sense of modernity,
Typography plays an important role in the design order, cleanliness, elegance, avant-garde, and style.
of reports and statements. Selecting the right font • Pay attention to legibility. Remember that screen
strengthens your message and captures the audience’s type does not appear in the same way as print type.
attention. Müller-Brockmann (1961), a graphic designer, It is best to choose a more responsive (sans-serif) font
defines typography as the proper visual element for for on-screen texts, and fonts with serifs for printed
composition. He notes that “the reader must be able to reports. That said, there’s an exception to every rule,
read the message from a text easily and comfortably. and today there is a bounty of fonts that are perfectly
This depends largely on the size of the text, the length of suitable for both digital and print media.
the lines, and the spacing between the lines”.8 • Watch your weight (light, regular, bold). When
it comes to bolding your text, a value of two or three
Typography is an art form in and of itself, in which should be plenty. It is better to reserve the heaviest
every font has its own characteristics, which should weight for headlines and then apply a stylistic hierar-
be strategically combined. chy based on your content. Avoid fonts that only offer
one weight or style, since their applications are limited.
For people outside the world of graphic design, choos- • Don’t forget that some fonts use more memory
ing a font and setting other typographical features can than others. Fonts with serifs generally monopolize
be tricky, but it doesn’t have to be. Let’s take a practical more of your computer’s brain power than sans-serif
look at the steps you should take when determining fonts. This is an important consideration in interactive
your typography, and then consider the images and reports, since a document that occupies more RAM
visual elements that best accompany your text. Consid- will be less responsive.
erations when setting your typography:
Fonts have personalities that help us establish a more
• Determining the goal of your report’s content. attractive visual tone for our audience. Familiarizing
• Select a font that strengthens that goal. yourself with a few can go a long way. There are:
Fonts come in two types: with serifs or without (sans)
serifs. Serif fonts have an extra stroke that conveys a • Professional fonts • Handwritten fonts
• Fun font • Minimalist fonts
8 The Graphic Artist and his Design Problems (Gestaltungsprobleme
des Grafikers), Teufen, 1961

netquest.com 24
Prioritize patterns in your visualizations: Gestalt

The basic elements of the visualization process also involve preattentive attributes. Preattentive attributes are visual
features that facilitate the rapid visual perception of a graphic in a space. Designers use these characteristics to
better uncover relevant information in visuals, because these characteristics attract the eye.

Colin Ware, Director of the Data Visualization Research Lab at the University of New Hampshire, has highlighted
that preattentive attributes can be used as resources for drawing viewers’ immediate attention to certain
parts of visual representations (2004). According to Ware, preattentive processing happens very quickly—typi-
cally in the first 10 milliseconds. This process is the mind’s attempt to rapidly extract basic visual characteristics from
the graphic (stage 1). These characteristics are then consciously processed, along with the perception of the object,
so that the mind can extract patterns (stage 2), ultimately enabling the information to move to the highest level of
perception (stage 3). This makes it possible to find answers to the initial visual question, utilizing the information
saved in our minds. Colin Ware, cited in Meirelles (2014), explains it as follows:

Bottom up information contributes to the pattern creation process

Top down process reinforces relevant information

Preattentive attributes enhance object perception and cognition processes, leveraging our mind’s visual capacities.
Good data visualizations deliberately make use of these attributes because they boost the mind’s discovery and rec-
ognition of patterns such as lines, planes, colors, movements, and spatial positioning.9

9 Dondis, D.A. (2015). La sintaxis de la imagen: introducción al alfabeto visual. Editorial Gustavo Gili: Barcelona
Meirelles, I. (2014). La información en el diseño. Barcelona: Parramón.

netquest.com 25
The visual below lists preattentive attributes that represent
aspects of lines and planes when visualizing and analyzing
graphic representation: shape, color, and spatial position.

Orientation Line Length Thickness Curvature

Shape
Orientation Line Length Line Width Size

Added marcks Enclosure Color Intensity/value

Shape Curvature Added Marks Enclosure

Color Spatial Position


Intensity Hue 2-D Position Shape Size Sharpness Numerosity

netquest.com 26
Detecting patterns is fundamental to structuring and
organizing visual information. When we create visuals,
we often want to highlight certain patterns over others.
Preattentive attributes are the alphabet of visual lan-
guage; analytic patterns are the words that we write
by using them. When we see a good visualization, we
immediately detect the preattentive attributes and rec-
ognize analytic patterns in the visualization. The follow-
ing table summarizes a few basic analytic patterns:

Analytic patterns

netquest.com 27
We have seen how preattentive attributes and patterns Gestalt’s principles are the principles that enable us to According to Dondis (2015), Gestalt’s principles help
make it possible to process and analyze visual informa- understand the requirements posed by certain prob- describe the way we organize and merge elements
tion; they also enable us to improve pattern discovery lems so that we see everything as an integral, coherent in our minds. They quiet the noise of the graphics so
and perceptive inferences and provide processes for whole. It involves proximity, similarity, shared destiny, that we relate, combine, and analyze them. These
solving visualization problems. “pragnanz” or pithiness, closure, simplicity, familiarity, principles come into play whenever we analyze any
and discernment between figure and ground. sort of visualization. Only position and length can be
used to accurately perceive quantitative data. The
other attributes are useful for perceiving other sorts
of data, such as categorical and relational data.

We’ll close this section with one piece of practical


advice on how to effectively visualize data. Colin
Ware in The Visual Thinking: for Graphic Design
(2008) summarizes the importance of always being
mindful of preattentive attributes and patterns when
designing a visualization:

Good design optimizes the visual thinking process.


The choice of patterns and symbols is important so
that visual queries can be efficiently processed by
the intended viewer. This means choosing words
and patterns each to their best advantage.”

Gestalt’s principles

netquest.com 28
4. Storytelling for social and
market communication

netquest.com 29
Storytelling for As we saw at the beginning of this ebook, our mind The triune model is a valuable tool for effectively com-

social and market tends to visualize information in order to satisfy a basic


need: telling a story. It is one of the most primitive
municating with our audience. It is one of many theories
employed in neuromarketing to influence and persuade

communication forms of communication, and it is inherent in every


human being.
potential buyers. Understanding and mastering this
theory enables us to extract information not just from
the neocortex, but from the reptilian and emotional
brains as well. This can be useful for qualitative market
We cannot live research methodology, since it utilizes a host of different
without communicating, techniques, including in-depth interviews, ethnographic

without expressing our


research, and focus groups. This information is essential
if we are aiming towards a scientific framework to talk
personalities, emotions, about neuromarketing.

and moods, our worries How, then, can we create stories that use data to
and fears. communicate insights? Below, we explain three simple
sequences for telling a story:
Paul Maclean, cited in María Alejandra Rendón (2009),
proposes a “Triune brain” theory, which addresses • Influencing people’s emotions by telling a story
the structure and behavior of the human mind. For (drawing in their attention).
Maclean, the mind consists of three inseparable parts • Persuading them through benefits that cover specific
(or distinct brains); none of the three functions inde- needs (benefits/engagement).
pendently or separately. They are the reptilian brain, the • Moving on to concrete steps (call to action).
emotional brain, and the neocortex.
If you can successfully visualize this sequence, you
The reptilian brain is home to our unconscious, also understand the foundation of all narratives. What that
known as our instinctive side. It manages survival and means is that every story we try to tell has a beginning,
our body’s self-regulation. The second part, the emo- a developed plot, and a resolution, all building up to
tional brain, is responsible for our emotional processes the invaluable call to action. If you have a clear notion
and basic motivations. Last but not least, the neocor- of how to include the “story” element in your reports,
tex is our more rational, complex side. It is in charge of statements, and dashboards, you will successfully create
driving our systematic and logical thinking. stories that use your data to share insights.

netquest.com 30
Data storytelling

We all love good stories, and data is one of the best What do we get when we
tools for telling them. Millions of pieces of data are combine these elements?
generated every day. They could be converted into
great stories, but instead they are left unused. It’s time to
change all that. It’s time to start telling stories that draw
their power from data. Data + Narrative Data + Visualization + Narration =

So-called “data storytelling” is nothing more than Data can be insights; they are drawn from study and Successfully using our data to tell a
placing a structured focus on the way we use data to analysis. Their nature can propose the narrative context. story, wield influence, and effect the
communicate insights. It relies on three key elements: desired change.
narrative, visualization, and data.
Visualization + Data

Visualization shines a light on our data by enabling us to

Visuals rapidly process large volumes of data in a visual system.


Narrative
As more data series are represented, we rely less on the
Engage
verbal and more on the visual. Thus, we can enlighten
our audience with insights that they may not have oth-
CHANGE erwise seen

Explain Enlighten Narrative + Visualization

Data The story must motivate. It must have a plot, highs and
lows, and an arc of emotional connection in order to
draw in and entertain our audience.

The perfect combination

netquest.com 31
A basic recipe for storytelling in your
presentations and final reports
In case you don’t have a clear notion of how to include the “story” 5. Plot. Generate interest; create tension. Depict the concept,
element in your data, we’re going to outline a few points that will crux, and resolution. Incentivize your audience to keep reading
guide you, so that your presentations and reports manage to grab until the last page, so to speak. Establish relationships.
your audience’s attention and have a major impact:
6. Use data to anchor your narrative. The
story in your data ought to be simple; the vision drawn from the
1. Find the story in your data. Write, write, and data comes with an implicit responsibility to be sincere and honest.
write. Write about the highlights of your research in different roles.
Worry about presentation later. 7. Design principles. Adhere to the best practices of
design to visualize your data.
2. Define the perspective. Who are you talking to?
What’s the best way to achieve your objective? 8. Review, review, review. Make sure that all of your
analysis is precise.
3. Create a hierarchy. What is the most important
thing you are trying to convey? Establish different depths to your 9. Be familiar with your content and
reading and data. Avoid irrelevant information. respect your audience.

4. Organize. Figure out the most suitable sequence for 10. Keep it short and sweet. Data-based storytell-
presenting your data. What relationships can you establish ing is the product of hours of work. It’s best to keep presentations
between different aspects of your data? What do some pieces of short, with concrete ideas adapted to the audience so that your
data mean relative to others? Are they the framework (data that message is conveyed efficiently and smoothly.
reveals), the details (data that delves deeper), or the contrast (data
that dramatizes differences)?

netquest.com 32
5. Trends in market research and
data visualization dashboards

netquest.com 33
Trends in market Scrollytelling Instagram, Snapchat, and YouTube. RJ Andrews, in his

research and “Scrollytelling” is a technique that we’ve all experi-


work “Info We Trust”, notes that “It’s almost like con-
tent stumbles onto social right now.” He adds that, “it is

data visualization enced first-hand when viewing certain infographics or


websites. As the name implies, it aims to tell a story
something that has yet to be figured out in a compel-
ling way, but the community is slowly becoming aware.

dashboards as users scroll through a graphic. There are many


libraries that can help generate this kind of visualiza-
Social can mean a lot of things” for the market and for
market research.
tion, including Waypoints, ScrollStory, ScrollMagic, and
Graph-Scroll.js It seems that RJ is not alone. UX designer and data
enthusiast Catherine Madden expects to see more
Data visualization technologies and methods continue These tools enable users to interact with content as they “snackable” visual content in 2017: “I want to see some-
to evolve in important ways. This cutting-edge report scroll through a webpage or zoom in on content, which thing that can be consumed on an Instagram feed that
reflects the most relevant alternatives available on the changes and tells a story. By interacting with the visual- might hook someone into learning more that is also easy
market that can be used to work in this field. In both ization of the data, users can see more details, a timeline, to share.”
the software industry and the academic sector, sev- or new elements, such as text, icons, or graphics. Russel
eral paths for innovation and development are at the Goldenberg, an expert in data and visualization, once One great example of such an influencer is Mona Chal-
forefront, including: scrollytelling, social-first data explained how to implement this technique on his blog, abi, Data Editor at Guardian U.S., who has over 50,000
visualitzation, and virtual reality vizualisations. expounding on the characteristics, advantages, and Instagram followers. She has published a series of active
disadvantages of various graphic libraries whose code is content visualizations on gender, consumption, politics,
intended to act as the foundation for this sort of project. and so on. Checking out her profile is “required reading”
for anyone interested in learning about data visualiza-
tion and social media.

Social-first data The diversity of information has led to the emergence of


tools that help us uncover—through published images—
visualization conversations about a given brand that are taking place
online on a daily basis. One such tool is Brandwatch,
The landscape of data visualization is constantly chang- a social intelligence platform for unearthing key ideas
ing, as we have seen throughout this report. Needless to that are being discussed in the billions of conversations
say, social networks now play an important role, since taking place online. This technique enables us to reveal
the constant flow of information enables us to draw key information about the consumer’s individualized
more and more narrative stories from data obtained by online experience.

netquest.com 34
Virtual reality What does the future
visualizations have in store?
Virtual reality has the potential to revolutionize data Visual data representation techniques and methods
visualization, especially when it comes to big data. progress every day, as technology evolves and our body
Even in a two-dimensional image, there is already too of theoretical knowledge grows. As this technology
much data for the human eye to capture. Now imagine and this knowledge work in tandem, we will continue
a three-dimensional data visualization, which allows developing solutions for our problems and needs. From
the user to fully interact with data in a 360-degree field this report, we hope you have deduced that, in our
of vision. current era, images are the most efficient language. We
hope you now understand that tools and software can
Virtual reality data visualizations are highly interactive, help us discover limitless graphic resources and develop
computer generated 3D projections. Although the new structures for communicating and conveying ideas.
concept of virtual reality is nothing new, the idea of Consequently, we can confidently state that the applica-
immersive data exploration certainly is, and the exciting tions of graphic representation are constantly expand-
possibilities that it promises are endless. ing, and we must not forget that they are the objective
of our communication strategies in market research.

netquest.com 35
Visualize It! Thank you for
A Comprehensive Guide reading.
to Data Visualization
Stay tuned
with us.

Melissa Matias | Project editing


Isabel Montero | Communication Specialist

Visual Data Designer Vanessa Castro | Global Communications Manager

Copy editing
ABOUT THE AUTHOR Bernou Benne | Marketing Specialist
Focused on creating data visualizations and market
research dashboards leveraging data to enhance Graphic design
experiences. She is a passionate, curious person who Nina Rojc | Graphic Designer
enjoys collaborating in human-centered design projects Anna Caballero | Global Brand Designer
around the world. Melissa Matias | Visual Data Designer

netquest.com 36
netquest.com 37
 Visualizing Data

PREV NEXT
⏮ ⏭
Preface 2. Getting Started with Processing
  🔎

Chapter 1. The Seven Stages of


Visualizing Data

The greatest value of a picture is when it forces us to notice what we never


expected to see.

—John Tukey

What do the paths that millions of visitors take through a web site look like?
How do the 3.1 billion A, C, G, and T letters of the human genome compare to
those of the chimp or the mouse? Out of a few hundred thousand files on your
computer’s hard disk, which ones are taking up the most space, and how often do
you use them? By applying methods from the fields of computer science,
statistics, data mining, graphic design, and visualization, we can begin to answer
these questions in a meaningful way that also makes the answers accessible to
others.

All of the previous questions involve a large quantity of data, which makes it
extremely difficult to gain a “big picture” understanding of its meaning. The
problem is further compounded by the data’s continually changing nature, which
can result from new information being added or older information continuously
being refined. This deluge of data necessitates new software-based tools, and its
complexity requires extra consideration. Whenever we analyze data, our goal is
to highlight its features in order of their importance, reveal patterns, and
simultaneously show features that exist across multiple dimensions.

This book shows you how to make use of data as a resource that you might
otherwise never tap. You’ll learn basic visualization principles, how to choose the
right kind of display for your purposes, and how to provide interactive features
that will bring users to your site over and over again. You’ll also learn to program
in Processing, a simple but powerful environment that lets you quickly carry out
the techniques in this book. You’ll find Processing a good basis for designing
interfaces around large data sets, but even if you move to other visualization
tools, the ways of thinking presented here will serve you as long as human beings
continue to process information the same way they’ve always done.

Why Data Display Requires Planning


Each set of data has particular display needs, and the purpose for which you’re
using the data set has just as much of an effect on those needs as the data itself.
There are dozens of quick tools for developing graphics in a cookie-cutter
fashion in office programs, on the Web, and elsewhere, but complex data sets
used for specialized applications require unique treatment. Throughout this book,
we’ll discuss how the characteristics of a data set help determine what kind of
visualization you’ll use.

TOO MUCH INFORMATION


When you hear the term “information overload,” you probably know exactly
what it means because it’s something you deal with daily. In Richard Saul
Wurman’s book Information Anxiety (Doubleday), he describes how the New
York Times on an average Sunday contains more information than a Renaissance-
era person had access to in his entire lifetime.

But this is an exciting time. For $300, you can purchase a commodity PC that has
thousands of times more computing power than the first computers used to
tabulate the U.S. Census. The capability of modern machines is astounding.
Performing sophisticated data analysis no longer requires a research laboratory,
just a cheap machine and some code. Complex data sets can be accessed,
explored, and analyzed by the public in a way that simply was not possible in the
past.

The past 10 years have also brought about significant changes in the graphic
capabilities of average machines. Driven by the gaming industry, high-end 2D
and 3D graphics hardware no longer requires dedicated machines from specific
vendors, but can instead be purchased as a $100 add-on card and is standard
equipment for any machine costing $700 or more. When not used for gaming,
these cards can render extremely sophisticated models with thousands of shapes,
and can do so quickly enough to provide smooth, interactive animation. And
these prices will only decrease—within a few years’ time, accelerated graphics
will be standard equipment on the aforementioned commodity PC.


DATA COLLECTION
We’re getting better and better at collecting data, but we lag in what we can do
with it. Most of the examples in this book come from freely available data
Your trial membership has ended,
sources Pvvssrinivas. Please
on the Internet. Lots of data is out there, contact
but it’s not beingyour administrator or O'Reilly Support.
used to its
/
greatest potential because it’s not being visualized as well as it could be. (More
about this can be found in Chapter 9, which covers places to find data and how to
retrieve it.)

With all the data we’ve collected, we still don’t have many satisfactory answers
to the sort of questions that we started with. This is the greatest challenge of our
information-rich era: how can these questions be answered quickly, if not
instantaneously? We’re getting so good at measuring and recording things, why
haven’t we kept up with the methods to understand and communicate this
information?

THINKING ABOUT DATA


We also do very little sophisticated thinking about information itself. When AOL
released a data set containing the search queries of millions of users that had been
“randomized” to protect the innocent, articles soon appeared about how people
could be identified by—and embarrassed by—information regarding their search
habits. Even though we can collect this kind of information, we often don’t know
quite what it means. Was this a major issue or did it simply embarrass a few AOL
users? Similarly, when millions of records of personal data are lost or accessed
illegally, what does that mean? With so few people addressing data, our
understanding remains quite narrow, boiling down to things like, “My credit card
number might be stolen” or “Do I care if anyone sees what I search?”

DATA NEVER STAYS THE SAME


We might be accustomed to thinking about data as fixed values to be analyzed,
but data is a moving target. How do we build representations of data that adjust
to new values every second, hour, or week? This is a necessity because most data
comes from the real world, where there are no absolutes. The temperature
changes, the train runs late, or a product launch causes the traffic pattern on a
web site to change drastically.

What happens when things start moving? How do we interact with “live” data?
How do we unravel data as it changes over time? We might use animation to play
back the evolution of a data set, or interaction to control what time span we’re
looking at. How can we write code for these situations?

WHAT IS THE QUESTION?


As machines have enormously increased the capacity with which we can create
(through measurements and sampling) and store data, it becomes easier to
disassociate the data from the original reason for collecting it. This leads to an
all-too frequent situation: approaching visualization problems with the question,
“How can we possibly understand so much data?”

As a contrast, think about subway maps, which are abstracted from the complex
shape of the city and are focused on the rider’s goal: to get from one place to the
next. Limiting the detail of each shape, turn, and geographical formation reduces
this complex data set to answering the rider’s question: “How do I get from point
A to point B?”

Harry Beck invented the format now commonly used for subway maps in the
1930s, when he redesigned the map of the London Underground. Inspired by the
layout of circuit boards, the map simplified the complicated Tube system to a
series of vertical, horizontal, and 45°diagonal lines. While attempting to preserve
as much of the relative physical layout as possible, the map shows only the
connections between stations, as that is the only information that riders use to
decide their paths.

When beginning a visualization project, it’s common to focus on all the data that
has been collected so far. The amounts of information might be enormous—
people like to brag about how many gigabytes of data they’ve collected and how
difficult their visualization problem is. But great information visualization never
starts from the standpoint of the data set; it starts with questions. Why was the
data collected, what’s interesting about it, and what stories can it tell?

The most important part of understanding data is identifying the question that
you want to answer. Rather than thinking about the data that was collected, think
about how it will be used and work backward to what was collected. You collect
data because you want to know something about it. If you don’t really know why
you’re collecting it, you’re just hoarding it. It’s easy to say things like, “I want to
know what’s in it,” or “I want to know what it means.” Sure, but what’s
meaningful?

The more specific you can make your question, the more specific and clear the
visual result will be. When questions have a broad scope, as in “exploratory data
analysis” tasks, the answers themselves will be broad and often geared toward
those who are themselves versed in the data. John Tukey, who coined the term
Exploratory Data Analysis, said “. . . pictures based on exploration of data should
[ 1 ]
force their messages upon us.” Too many data problems are labeled
“exploratory” because the data collected is overwhelming, even though the
original purpose was to answer a specific question or achieve specific results.

One of the most important (and least technical) skills in understanding data is
asking good questions. An appropriate question shares an interest you have in the
data, tries to convey it to others, and is curiosity-oriented rather than math-
oriented. Visualizing data is just like any other type of communication: success is
defined by your audience’s ability to pick up on, and be excited about, your
insight.

Admittedly, you may have a rich set of data to which you want to provide
flexible access by not defining your question too narrowly. Even then, your goal
should be to highlight key findings. There is a tendency in the visualization field
to borrow from the statistics field and separate problems into exploratory and
expository, but for the purposes of this book, this distinction is not useful. The
same methods and process are used for both.


In short, a proper visualization is a kind of narrative, providing a clear answer to
a question without extraneous details. By focusing on the original intent of the
question, you can eliminate such details because the question provides a
benchmark for what is and is not necessary.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
A COMBINATION OF MANY DISCIPLINES
Given the complexity of data, using it to provide a meaningful solution requires
insights from diverse fields: statistics, data mining, graphic design, and
information visualization. However, each field has evolved in isolation from the
others.

Thus, visual design—the field of mapping data to a visual form—typically does


not address how to handle thousands or tens of thousands of items of data. Data
mining techniques have such capabilities, but they are disconnected from the
means to interact with the data. Software-based information visualization adds
building blocks for interacting with and representing various kinds of abstract
data, but typically these methods undervalue the aesthetic principles of visual
design rather than embrace their strength as a necessary aid to effective
communication. Someone approaching a data representation problem (such as a
scientist trying to visualize the results of a study involving a few thousand pieces
of genetic data) often finds it difficult to choose a representation and wouldn’t
even know what tools to use or books to read to begin.

PROCESS
We must reconcile these fields as parts of a single process. Graphic designers can
learn the computer science necessary for visualization, and statisticians can
communicate their data more effectively by understanding the visual design
principles behind data representation. The methods themselves are not new, but
their isolation within individual fields has prevented them from being used
together. In this book, we use a process that bridges the individual disciplines,
placing the focus and consideration on how data is understood rather than on the
viewpoint and tools of each individual field.

The process of understanding data begins with a set of numbers and a question.
The following steps form a path to the answer:

Acquire

Obtain the data, whether from a file on a disk or a source over a network.

Parse

Provide some structure for the data’s meaning, and order it into categories.

Filter

Remove all but the data of interest.

Mine

Apply methods from statistics or data mining as a way to discern patterns or


place the data in mathematical context.

Represent

Choose a basic visual model, such as a bar graph, list, or tree.

Refine

Improve the basic representation to make it clearer and more visually


engaging.

Interact

Add methods for manipulating the data or controlling what features are
visible.

Of course, these steps can’t be followed slavishly. You can expect that they’ll be
involved at one time or another in projects you develop, but sometimes it will be
four of the seven, and at other times all of them.

Part of the problem with the individual approaches to dealing with data is that the
separation of fields leads to different people each solving an isolated part of the
problem. When this occurs, something is lost at each transition—like a
“telephone game” in which each step of the process diminishes aspects of the
initial question under consideration. The initial format of the data (determined by
how it is acquired and parsed) will often drive how it is considered for filtering or
mining. The statistical method used to glean useful information from the data
might drive the initial presentation. In other words, the final representation
reflects the results of the statistical method rather than a response to the initial
question.

Similarly, a graphic designer brought in at the next stage will most often respond
to specific problems with the representation provided by the previous steps,
rather than focus on the initial question. The visualization step might add a
compelling and interactive means to look at the data filtered from the earlier
steps, but the display is inflexible because the earlier stages of the process are
hidden. Furthermore, practitioners of each of the fields that commonly deal with
data problems are often unclear about how to traverse the wider set of methods
and arrive at an answer.

This book covers the whole path from data to understanding: the transformation
of a jumble of raw numbers into something coherent and useful. The data under
consideration might be numbers, lists, or relationships between multiple entities.

It should be kept in mind that the term visualization is often used to describe the
art of conveying a physical relationship, such as the subway map mentioned near
the start of this chapter. That’s a different kind of analysis and skill from
information visualization, where the data is primarily numeric or symbolic (e.g.,
A, C, G, and T—the letters of genetic code—and additional annotations about
them). The primary focus of this book is information visualization: for instance, a
series of numbers that describes temperatures in a weather forecast rather than
the shape of the cloud cover contributing to them.

An Example
To illustrate the seven steps listed in the previous section, and how they

contribute to effective information visualization, let’s look at how the process can
be applied to understanding a simple data set. In this case, we’ll take the zip code
numbering system that the U.S. Postal Service uses. The application is not

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
particularly advanced, but it provides a skeleton for how the process works. /
(Chapter 6 contains a full implementation of the project.)
WHAT IS THE QUESTION?
All data problems begin with a question and end with a narrative construct that
provides a clear answer. The Zipdecode project (described further in Chapter 6)
was developed out of a personal interest in the relationship of the zip code
numbering system to geographic areas. Living in Boston, I knew that numbers
starting with a zero denoted places on the East Coast. Having spent time in San
Francisco, I knew the initial numbers for the West Coast were all nines. I grew up
in Michigan, where all our codes were four-prefixed. But what sort of area does
the second digit specify? Or the third?

The finished application was initially constructed in a few hours as a quick way
to take what might be considered a boring data set (a long list of zip codes,
towns, and their latitudes and longitudes) and create something engaging for a
web audience that explained how the codes related to their geography.

Acquire

The acquisition step involves obtaining the data. Like many of the other steps,
this can be either extremely complicated (i.e., trying to glean useful data from a
large system) or very simple (reading a readily available text file).

A copy of the zip code listing can be found on the U.S. Census Bureau web site,
as it is frequently used for geographic coding of statistical data. The listing is a
freely available file with approximately 42,000 lines, one for each of the codes, a
tiny portion of which is shown in Figure 1-1.

Figure 1-1. Zip codes in the format provided by the U.S. Census Bureau

Acquisition concerns how the user downloads your data as well as how you
obtained the data in the first place. If the final project will be distributed over the
Internet, as you design the application, you have to take into account the time
required to download data into the browser. And because data downloaded to the
browser is probably part of an even larger data set stored on the server, you may
have to structure the data on the server to facilitate retrieval of common subsets.

Parse

After you acquire the data, it needs to be parsed—changed into a format that tags
each part of the data with its intended use. Each line of the file must be broken
along its individual parts; in this case, it must be delimited at each tab character.
Then, each piece of data needs to be converted to a useful format. Figure 1-2
shows the layout of each line in the census listing, which we have to understand
to parse it and get out of it what we want.

Figure 1-2. Structure of acquired data

Each field is formatted as a data type that we’ll handle in a conversion program:

String

A set of characters that forms a word or a sentence. Here, the city or town
name is designated as a string. Because the zip codes themselves are not so
much numbers as a series of digits (if they were numbers, the code 02139
would be stored as 2139, which is not the same thing), they also might be
considered strings.

Float

A number with decimal points (used for the latitudes and longitudes of each
location). The name is short for floating point, from programming
nomenclature that describes how the numbers are stored in the computer’s
memory.

Character

A single letter or other symbol. In this data set, a character sometimes


designates special post offices.

Integer ⬆
A number without a fractional portion, and hence no decimal points (e.g.,
−14, 0, or 237).

Your trial membership has ended,


Index
Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Data (commonly an integer or string) that maps to a location in another table
of data. In this case, the index maps numbered codes to the names and two-
digit abbreviations of states. This is common in databases, where such an
index is used as a pointer into another table, sometimes as a way to compact
the data further (e.g., a two-digit code requires less storage than the full name
of the state or territory).

With the completion of this step, the data is successfully tagged and consequently
more useful to a program that will manipulate or represent it in some way.

Filter

The next step involves filtering the data to remove portions not relevant to our
use. In this example, for the sake of keeping it simple, we’ll be focusing on the
contiguous 48 states, so the records for cities and towns that are not part of those
states—Alaska, Hawaii, and territories such as Puerto Rico—are removed.
Another project could require significant mathematical work to place the data
into a mathematical model or normalize it (convert it to an acceptable range of
numbers).

Mine

This step involves math, statistics, and data mining. The data in this case receives
only a simple treatment: the program must figure out the minimum and
maximum values for latitude and longitude by running through the data (as
shown in Figure 1-3) so that it can be presented on a screen at a proper scale.
Most of the time, this step will be far more complicated than a pair of simple
math operations.

Figure 1-3. Mining the data: just compare values to find the minimum and
maximum

Represent

This step determines the basic form that a set of data will take. Some data sets are
shown as lists, others are structured like trees, and so forth. In this case, each zip
code has a latitude and longitude, so the codes can be mapped as a two-
dimensional plot, with the minimum and maximum values for the latitude and
longitude used for the start and end of the scale in each dimension. This is
illustrated in Figure 1-4.

Figure 1-4. Basic visual representation of zip code data

The Represent stage is a linchpin that informs the single most important decision
in a visualization project and can make you rethink earlier stages. How you
choose to represent the data can influence the very first step (what data you
acquire) and the third step (what particular pieces you extract).

Refine

In this step, graphic design methods are used to further clarify the representation
by calling more attention to particular data (establishing hierarchy) or by
changing attributes (such as color) that contribute to readability.

Hierarchy is established in Figure 1-5, for instance, by coloring the background


deep gray and displaying the selected points (all codes beginning with four) in
white and the deselected points in medium yellow.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 1-5. Using color to refine the representation

Interact

The next stage of the process adds interaction, letting the user control or explore
the data. Interaction might cover things like selecting a subset of the data or
changing the viewpoint. As another example of a stage affecting an earlier part of
the process, this stage can also affect the refinement step, as a change in
viewpoint might require the data to be designed differently.

In the Zipdecode project, typing a number selects all zip codes that begin with
that number. Figure 1-6 and Figure 1-7 show all the zip codes beginning with
zero and nine, respectively.

Figure 1-6. The user can alter the display through choices (zip codes
starting with 0)

Figure 1-7. The user can alter the display through choices (zip codes
starting with 9)

Another enhancement to user interaction (not shown here) enables the users to
traverse the display laterally and run through several of the prefixes. After typing
part or all of a zip code, holding down the Shift key allows users to replace the
last number typed without having to hit the Delete key to back up.

Typing is a very simple form of interaction, but it allows the user to rapidly gain
an understanding of the zip code system’s layout. Just contrast this sample
application with the difficulty of deducing the same information from a table of
zip codes and city names.

The viewer can continue to type digits to see the area covered by each subsequent
set of prefixes. Figure 1-8 shows the region highlighted by the two digits 02,
Figure 1-9 shows the three digits 021, and Figure 1-10 shows the four digits
0213. Finally, Figure 1-11 shows what you get by entering a full zip code, 02139
—a city name pops up on the display.

Figure 1-8. Honing in with two digits (02)

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 1-9. Honing in with three digits (021)

In addition, users can enable a “zoom” feature that draws them closer to each
subsequent digit, revealing more detail around the area and showing a constant
rate of detail at each level. Because we’ve chosen a map as a representation, we
could add more details of state and county boundaries or other geographic
features to help viewers associate the “data” space of zip code points with what
they know about the local environment.

Figure 1-10. Honing in further with four digits (0213)

Figure 1-11. Honing in even further with the full zip code (02139)

Iteration and Combination


Figure 1-12 shows the stages in order and demonstrates how later decisions
commonly reflect on earlier stages. Each step of the process is inextricably linked
because of how the steps affect one another. In the Zipdecode application, for
instance:

The need for a compact representation on the screen led me to refilter the
data to include only the contiguous 48 states.

The representation step affected acquisition because after I developed the


application I modified it so it could show data that was downloaded over a
slow Internet connection to the browser. My change to the structure of the
data allows the points to appear slowly, as they are first read from the data
file, employing the data itself as a “progress bar.”

Interaction by typing successive numbers meant that the colors had to be


modified in the visual refinement step to show a slow transition as points in
the display are added or removed. This helps the user maintain context by
preventing the updates on-screen from being too jarring.

Figure 1-12. Interactions between the seven stages

The connections between the steps in the process illustrate the importance of the
individual or team in addressing the project as a whole. This runs counter to the
common fondness for assembly-line style projects, where programmers handle
the technical portions, such as acquiring and parsing data, and visual designers
are left to choose colors and typefaces. At the intersection of these fields is a
more interesting set of properties that demonstrates their strength in combination.

When acquiring data, consider how it can change, whether sporadically (such as
once a month) or continuously. This expands the notion of graphic design that’s
traditionally focused on solving a specific problem for a specific data set, and
instead considers the meta-problem of how to handle a certain kind of data that
might be updated in the future.

In the filtering step, data can be filtered in real time, as in the Zipdecode
application. During visual refinement, changes to the design can be applied ⬆
across the entire system. For instance, a color change can be automatically
applied to the thousands of elements that require it, rather having to make such a
tedious modification by hand. This is the strength of a computational approach,
where tedious processes are minimized through automation.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Principles
I’ll finish this general introduction to visualization by laying out some ways of
thinking about data and its representation that have served me well over many
years and many diverse projects. They may seem abstract at first, or of minor
importance to the job you’re facing, but I urge you to return and reread them as
you practice visualization; they just may help you in later tasks.

EACH PROJECT HAS UNIQUE REQUIREMENTS


A visualization should convey the unique properties of the data set it represents.
This book is not concerned with providing a handful of ready-made
“visualizations” that can be plugged into any data set. Ready-made visualizations
can help produce a quick view of your data set, but they’re inflexible commodity
items that can be implemented in packaged software. Any bar chart or scatter plot
made with Excel will look like a bar chart or scatter plot made with Excel.
Packaged solutions can provide only packaged answers, like a pull-string toy that
is limited to a handful of canned phrases, such as “Sales show a slight increase in
each of the last five years!” Every problem is unique, so capitalize on that
uniqueness to solve the problem.

Chapters in this book are divided by types of data, rather than types of display. In
other words, we’re not saying, “Here’s how to make a bar graph,” but “Here are
several ways to show a correlation.” This gives you a more powerful way to
think about maximizing what can be said about the data set in question.

I’m often asked for a library of tools that will automatically make attractive
representations of any given data set. But if each data set is different, the point of
visualization is to expose that fascinating aspect of the data and make it self-
evident. Although readily available representation toolkits are useful starting
points, they must be customized during an in-depth study of the task.

Data is often stored in a generic format. For instance, databases used for
annotation of genomic data might consist of enormous lists of start and stop
positions, but those lists vary in importance depending on the situation in which
they’re being used. We don’t view books as long abstract sequences of words, yet
when it comes to information, we’re often so taken with the enormity of the
information and the low-level abstractions used to store it that the narrative is
lost. Unless you stop thinking about databases, everything looks like a table—
millions of rows and columns to be stored, queried, and viewed.

In this book, we use a small collection of simple helper classes as starting points.
Often, we’ll be targeting the Web as a delivery platform, so the classes are
designed to take up minimal time for download and display. But I will also
discuss more robust versions of similar tools that can be used for more in-depth
work.

This book aims to help you learn to understand data as a tool for human decision-
making—how it varies, how it can be used, and how to find what’s unique about
your data set. We’ll cover many standard methods of visualization and give you
the background necessary for making a decision about what sort of representation
is suitable for your data. For each representation, we consider its positive and
negative points and focus on customizing it so that it’s best suited to what you’re
trying to convey about your data set.

AVOID THE ALL-YOU-CAN-EAT BUFFET


Often, less detail will actually convey more information because the inclusion of
overly specific details causes the viewer to miss what’s most important or
disregard the image entirely because it’s too complex. Use as little data as
possible, no matter how precious it seems.

Consider a weather map, with curved bands of temperatures across the country.
The designers avoid giving each band a detailed edge (particularly because the
data is often fuzzy). Instead, they convey a broader pattern in the data.

Subway maps leave out the details of surface roads because the additional detail
adds more complexity to the map than necessary. Before maps were created in
Beck’s style, it seemed that knowing street locations was essential to navigating
the subway. Instead, individual stations are used as waypoints for direction
finding. The important detail is that your target destination is near a particular
station. Directions can be given in terms of the last few turns to be taken after
you exit the station, or you can consult a map posted at the station that describes
the immediate area aboveground.

It’s easy to collect data, and some people become preoccupied with simply
accumulating more complex data or data in mass quantities. But more data is not
implicitly better, and often serves to confuse the situation. Just because it can be
measured doesn’t mean it should. Perhaps making things simple is worth
bragging about, but making complex messes is not. Find the smallest amount of
data that can still convey something meaningful about the contents of the data
set. As with Beck’s underground map, focusing on the question helps define
those minimum requirements.

The same holds for the many “dimensions” that are found in data sets. Web site
traffic statistics have many dimensions: IP address, date, time of day, page
visited, previous page visited, result code, browser, machine type, and so on.
While each of these might be examined in turn, they relate to distinct questions.
Only a few of the variables are required to answer a typical question, such as
“How many people visited page x over the last three months, and how has that
figure changed each month?” Avoid trying to show a burdensome
multidimensional space that maps too many points of information.

KNOW YOUR AUDIENCE


Finally, who is your audience? What are their goals when approaching a
visualization? What do they stand to learn? Unless it’s accessible to your
audience, why are you doing it? Making things simple and clear doesn’t mean
assuming that your users are idiots and “dumbing down” the interface for them.

In what way will your audience use the piece? A mapping application used on a ⬆
mobile device has to be designed with a completely different set of criteria than
one used on a desktop computer. Although both applications use maps, they have
little to do with each other. The focus of the desktop application may be finding

Your trial membership has ended,


locations and print maps, whereas the focus of the mobile version is actively
Pvvssrinivas. Please /
following the directions to a particular location. contact your administrator or O'Reilly Support.
Onward
In this chapter, we covered the process for attacking the common modern
problems of having too much data and having data that changes. In the next
chapter, we’ll discuss Processing, the software tool used to handle data sets in
this book.

[ 1 ]*
Tukey, John Wilder. Exploratory Data Analysis.
Reading, MA: Addison-Wesley, 1977.

Settings / Support / Sign Out


© 2020 O'Reilly Media, Inc. Terms of Service / Privacy Policy

PREV NEXT
⏮ ⏭
Preface 2. Getting Started with Processing

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Chapter 2

Figure 2.1 shows the abstract types of what can be visualized. The four basic
dataset types are tables, networks, fields, and geometry; other possible collec‐
tions of items include clusters, sets, and lists. These datasets are made up of dif‐
ferent combinations of the five data types: items, attributes, links, positions, and
grids. For any of these dataset types, the full dataset could be available immedi‐
ately in the form of a static file, or it might be dynamic data processed gradually
in the form of a stream. The type of an attribute can be categorical or ordered,
with a further split into ordinal and quantitative. The ordering direction of at‐
tributes can be sequential, diverging, or cyclic.

1 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 2.1.

What can be visualized: data, datasets, and attributes.

Many aspects of vis design are driven by the kind of data that you have at your
disposal. What kind of data are you given? What information can you figure out
from the data, versus the meanings that you must be told explicitly? What high-
level concepts will allow you to split datasets apart into general and useful
pieces?

Suppose that you see the following data:

14, 2.6, 30, 30, 15, 100001

What does this sequence of six numbers mean? You can’t possibly know yet,
without more information about how to interpret each number. Is it locations for
two points far from each other in three-dimensional space, 14, 2.6, 30 and 30, 15,
100001? Is it two points closer to each other in two-dimensional space, 14, 2.6
and 30, 30, with the fifth number meaning that there are 15 links between these
two points, and the sixth number assigning the weight of ‘100001’ to that link?

Similarly, suppose that you see the following data:

Basil, 7, S, Pear

These numbers and words could have many possible meanings. Maybe a food
shipment of produce has arrived in satisfactory condition on the 7th day of the
month, containing basil and pears. Maybe the Basil Point neighborhood of the
city has had 7 inches of snow cleared by the Pear Creek Limited snow removal

2 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

To move beyond guesses, you need to know two crosscutting pieces of informa‐
tion about these terms: their semantics and their types. The semantics of the data
is its real-world meaning. For instance, does a word represent a human first
name, or is it the shortened version of a company name where the full name can
be looked up in an external list, or is it a city, or is it a fruit? Does a number rep‐
resent a day of the month, or an age, or a measurement of height, or a unique
code for a specific person, or a postal code for a neighborhood, or a position in
space?

The type of the data is its structural or mathematical interpretation. At the data
level, what kind of thing is it: an item, a link, an attribute? At the dataset level,
how are these data types combined into a larger structure: a table, a tree, a field
of sampled values? At the attribute level, what kinds of mathematical operations
are meaningful for it? For example, if a number represents a count of boxes of
detergent, then its type is a quantity, and adding two such numbers together
makes sense. If the number represents a postal code, then its type is a code rather
than a quantity—it is simply the name for a category that happens to be a number
rather than a textual name. Adding two of these numbers together does not make
sense.

Table 2.1 shows several more lines of the same dataset. This simple example ta‐
ble is tiny, with only nine rows and four columns. The exact semantics should be
provided by the creator of the dataset; I give it with the column titles. In this
case, each person has a unique identifier, a name, an age, a shirt size, and a fa‐
vorite fruit.

3 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Table 2.1.

A full table with column titles that prov ide the inte nded seman-
tics of the attributes.

ID Name Age Shirt Size Favorite Fruit

1 Amy 8 S Apple

2 Basil 7 S Pear

3 Clara 9 M Durian

4 Desmond 13 L Elderberry

5 Ernest 12 L Peach

6 Fanny 10 S Lychee

7 George 9 M Orange

8 Hector 8 L Loquat

9 Ida 10 M Pear

10 Amy 12 M Orange

Sometimes types and semantics can be correctly inferred simply by observing the
syntax of a data file or the names of variables within it, but often they must be
provided along with the dataset in order for it to be interpreted correctly.
Sometimes this kind of additional information is called metadata; the line be‐
tween data and metadata is not clear, especially given that the original data is of‐

4 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

The classification below presents a way to think about dataset and attribute types
and semantics in a way that is general enough to cover the cases interesting in
vis, yet specific enough to be helpful for guiding design choices at the abstraction
and idiom levels.

Figure 2.2 shows the five basic data types discussed in this book: items, at‐
tributes, links, positions, and grids. An attribute is some specific property that

can be measured, observed, or logged. * For example, attributes could be


salary, price, number of sales, protein expression levels, or temperature. An item
is an individual entity that is discrete, such as a row in a simple table or a node in
a network. For example, items may be people, stocks, coffee shops, genes, or
cities. A link is a relationship between items, typically within a network. A grid
specifies the strategy for sampling continuous data in terms of both geometric
and topological relationships between its cells. A position is spatial data, provid‐
ing a location in two-dimensional (2D) or three-dimensional (3D) space. For ex‐
ample, a position might be a latitude–longitude pair describing a location on the
Earth’s surface or three numbers specifying a location within the region of space
measured by a medical scanner.

Figure 2.2.

The five basic data types: items, attributes, links, positions, and grids.

* Synonyms for attribute are variable and data dimension, or just dimension
for short. Since dimension has many meanings, in this book it is reserved for the
visual channels of spatial position as discussed in Section 6.3.

A dataset is any collection of information that is the target of analysis. * The


four basic dataset types are tables, networks, fields, and geometry. Other ways to
group items together include clusters, sets, and lists. In real-world situations,
complex combinations of these basic types are common.

* The word dataset is singular. In vis the word data is commonly used as a
singular mass noun as well, in contrast to the traditional usage in the natural sci‐
ences where data is plural.

Figure 2.3 shows that these basic dataset types arise from combinations of the
data types of items, attributes, links, positions, and grids.

5 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 2.3.

The four basic dataset types are tables, networks, fields, and geometry; other pos‐
sible collections of items are clusters, sets, and lists. These datasets are made up
of five core data types: items, attributes, links, positions, and grids.

Figure 2.4 shows the internal structure of the four basic dataset types in detail.
Tables have cells indexed by items and attributes, for either the simple flat case
or the more complex multidimensional case. In a network, items are usually
called nodes, and they are connected with links; a special case of networks is
trees. Continuous fields have grids based on spatial positions where cells contain
attributes. Spatial geometry has only position information.

Figure 2.4.

The detailed structure of the four basic dataset types.

Many datasets come in the form of tables that are made up of rows and columns,
a familiar form to anybody who has used a spreadsheet. In this chapter, I focus on
the concept of a table as simply a type of dataset that is independent of any par‐
ticular visual representation; later chapters address the question of what visual
representations are appropriate for the different types of datasets.

▶ Chapter 7 covers how to arrange tables spatially.

For a simple flat table, the terms used in this book are that each row represents
an item of data, and each column is an attribute of the dataset. Each cell in the
table is fully specified by the combination of a row and a column—an item and
an attribute—and contains a value for that pair. Figure 2.5 shows an example of
the first few dozen items in a table of orders, where the attributes are order ID,
order date, order priority, product container, product base margin, and ship date.

6 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 2.5.

In a simple table of orders, a row represents an item, a column represents an at‐


tribute, and their intersection is the cell containing the value for that pairwise
combination.

A multidimensional table has a more complex structure for indexing into a cell,
with multiple keys.

▶ Keys and values are discussed further in Section 2.6.1.

The dataset type of networks is well suited for specifying that there is some kind

of relationship between two or more items. * An item in a network is often

called a node. * A link is a relation between two items. * For example, in an


articulated social network the nodes are people, and links mean friendship. In a
gene interaction network, the nodes are genes, and links between them mean that
these genes have been observed to interact with each other. In a computer net‐
work, the nodes are computers, and the links represent the ability to send mes‐
sages directly between two computers using physical cables or a wireless connec‐
tion.

* A synonym for networks is graphs. The word graph is also deeply over‐
loaded in vis. Sometimes it is used to mean network as we discuss here, for in‐
stance in the vis subfield called graph drawing or the mathematical subfield
called graph theory. Sometimes it is used in the field of statistical graphics to
mean chart, as in bar graphs and line graphs.

* A synonym for node is vertex.

* A synonym for link is edge.

Network nodes can have associated attributes, just like items in a table. In addi‐
tion, the links themselves could also be considered to have attributes associated
with them; these may be partly or wholly disjoint from the node attributes.

It is again important to distinguish between the abstract concept of a network and


any particular visual layout of that network where the nodes and edges have par‐

7 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Networks with hierarchical structure are more specifically called trees. In con‐
trast to a general network, trees do not have cycles: each child node has only one
parent node pointing to it. One example of a tree is the organization chart of a
company, showing who reports to whom; another example is a tree showing the
evolutionary relationships between species in the biological tree of life, where
the child nodes of humans and monkeys both share the same parent node of pri‐
mates.

The field dataset type also contains attribute values associated with cells. 1
Each cell in a field contains measurements or calculations from a continuous do‐
main: there are conceptually infinitely many values that you might measure, so
you could always take a new measurement between any two existing ones.
Continuous phenomena that might be measured in the physical world or simu‐
lated in software include temperature, pressure, speed, force, and density; mathe‐
matical functions can also be continuous.

For example, consider a field dataset representing a medical scan of a human


body containing measurements indicating the density of tissue at many sample
points, spread regularly throughout a volume of 3D space. A low-resolution scan
would have 262,144 cells, providing information about a cubical volume of space
with 64 bins in each direction. Each cell is associated with a specific region in
3D space. The density measurements could be taken closer together with a higher
resolution grid of cells, or further apart for a coarser grid.

Continuous data requires careful treatment that takes into account the mathemati‐
cal questions of sampling, how frequently to take the measurements, and inter‐
polation, how to show values in between the sampled points in a way that does
not mislead. Interpolating appropriately between the measurements allows you to
reconstruct a new view of the data from an arbitrary viewpoint that’s faithful to
what you measured. These general mathematical problems are studied in areas
such as signal processing and statistics. Visualizing fields requires grappling ex‐
tensively with these concerns.

In contrast, the table and network datatypes discussed above are an example of
discrete data where a finite number of individual items exist, and interpolation
between them is not a meaningful concept. In the cases where a mathematical
framework is necessary, areas such as graph theory and combinatorics provide

relevant ideas. 2

Continuous data is often found in the form of a spatial field, where the cell struc‐
ture of the field is based on sampling at spatial positions. Most datasets that con‐
tain inherently spatial data occur in the context of tasks that require understand‐
ing aspects of its spatial structure, especially shape.

For example, with a spatial field dataset that is generated with a medical imaging
instrument, the user’s task could be to locate suspected tumors that can be recog‐
nized through distinctive shapes or densities. An obvious choice for visual en‐

8 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

where the task is to compare the flow patterns in different regions. One possible
visual encoding would use the geometry of the wing as the spatial substrate,
showing the temperature and pressure using size-coded arrows.

The likely tasks faced by users who have spatial field data constrains many of the
choices about the use of space when designing visual encoding idioms. Many of
the choices for nonspatial data, where no information about spatial position is

provided with the dataset, are unsuitable in this case. *

* A synonym for nonspatial data is abstract data.

Thus, the question of whether a dataset has the type of a spatial field or a nonspa‐
tial table has extensive and far-reaching implications for idiom design.
Historically, vis diverged into areas of specialization based on this very differen‐
tiation. The subfield of scientific visualization, or scivis for short, is concerned
with situations where spatial position is given with the dataset. A central concern
in scivis is handling continuous data appropriately within the mathematical
framework of signal processing. The subfield of information visualization, or
infovis for short, is concerned with situations where the use of space in a visual
encoding is chosen by the designer. A central concern in infovis is determining
whether the chosen idiom is suitable for the combination of data and task, lead‐
ing to the use of methods from human–computer interaction and design.

When a field contains data created by sampling at completely regular intervals,


as in the previous example, the cells form a uniform grid. There is no need to
explicitly store the grid geometry in terms of its location in space, or the grid
topology in terms of how each cell connects with its neighboring cells. More
complicated examples require storing different amounts of geometric and topo‐
logical information about the underlying grid. A rectilinear grid supports non‐
uniform sampling, allowing efficient storage of information that has high com‐
plexity in some areas and low complexity in others, at the cost of storing some
information about the geometric location of each each row. A structured grid al‐
lows curvilinear shapes, where the geometric location of each cell needs to be
specified. Finally, unstructured grids provide complete flexibility, but the topo‐
logical information about how the cells connect to each other must be stored ex‐
plicitly in addition to their spatial positions.

The geometry dataset type specifies information about the shape of items with
explicit spatial positions. The items could be points, or one-dimensional lines or
curves, or 2D surfaces or regions, or 3D volumes.

Geometry datasets are intrinsically spatial, and like spatial fields they typically
occur in the context of tasks that require shape understanding. Spatial data often
includes hierarchical structure at multiple scales. Sometimes this structure is pro‐
vided intrinsically with the dataset, or a hierarchy may be derived from the origi‐
nal data.

Geometry datasets do not necessarily have attributes, in contrast to the other


three basic dataset types. Many of the design issues in vis pertain to questions

9 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

the task at hand from raw geographic data, such as the boundaries of a forest or a
city or a country, or the curve of a road. The problem of how to create images
from a geometric description of a scene falls into another domain: computer
graphics. While vis draws on algorithms from computer graphics, it has different
concerns from that domain. Simply showing a geometric dataset is not an inter‐
esting problem from the point of view of a vis designer.

▶ Section 3.4.2.3 covers deriving data.

▶ Section 8.4 covers generating contours from scalar fields.

Geometric data is sometimes shown alone, particularly when shape understand‐


ing is the primary task. In other cases, it is the backdrop against which additional
information is overlaid.

Beyond tables, there are many ways to group multiple items together, including
sets, lists, and clusters. A set is simply an unordered group of items. A group of

items with a specified ordering could be called a list. * A cluster is a grouping


based on attribute similarity, where items within a cluster are more similar to
each other than to ones in another cluster.

* In computer science, array is often used as a synonym for list.

There are also more complex structures built on top of the basic network type. A
path through a network is an ordered set of segments formed by links connecting
nodes. A compound network is a network with an associated tree: all of the
nodes in the network are the leaves of the tree, and interior nodes in the tree pro‐
vide a hierarchical structure for the nodes that is different from network links be‐
tween them.

Many other kinds of data either fit into one of the previous categories or do so af‐
ter transformations to create derived attributes. Complex and hybrid combina‐
tions, where the complete dataset contains multiple basic types, are common in
real-world applications.

The set of basic types presented above is a starting point for describing the what
part of an analysis instance that pertains to data; that is, the data abstraction. In
simple cases, it may be possible to describe your data abstraction using only that
set of terms. In complex cases, you may need additional description as well. If
so, your goal should be to translate domain-specific terms into words that are as
generic as possible.

Figure 2.6 shows the two kinds of dataset availability: static or dynamic.

Figure 2.6.

10 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

The default approach to vis assumes that the entire dataset is available all at once,
as a static file. However, some datasets are instead dynamic streams, where the

dataset information trickles in over the course of the vis session. * One kind of
dynamic change is to add new items or delete previous items. Another is to
change the values of existing items.

* A synonym for dynamic is online, and a synonym for static is offline.

This distinction in availability crosscuts the basic dataset types: any of them can
be static or dynamic. Designing for streaming data adds complexity to many as‐
pects of the vis process that are straightforward when there is complete dataset
availability up front.

Figure 2.7 shows the attribute types. The major disinction is between categorical
versus ordered. Within the ordered type is a further differentiation between ordi‐
nal versus quantitative. Ordered data might range sequentially from a minimum
to a maximum value, or it might diverge in both directions from a zero point in
the middle of a range, or the values may wrap around in a cycle. Also, attributes
may have hierarchical structure.

Figure 2.7.

Attribute types are categorical, ordinal, or quantitative. The direction of attribute


ordering can be sequential, diverging, or cyclic.

The first distinction is between categorical and ordered data. The type of cate‐
gorical data, such as favorite fruit or names, does not have an implicit ordering,

but it often has hierarchical structure. * Categories can only distinguish


whether two things are the same (apples) or different (apples versus oranges). Of
course, any arbitrary external ordering can be imposed upon categorical data.
Fruit could be ordered alphabetically according to its name, or by its price—but
only if that auxiliary information were available. However, these orderings are
not implicit in the attribute itself, the way they are with quantitative or ordered
data. Other examples of categorical attributes are movie genres, file types, and
city names.

11 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

All ordered data does have an implicit ordering, as opposed to unordered cate‐
gorical data. This type can be further subdivided. With ordinal data, such as shirt
size, we cannot do full-fledged arithmetic, but there is a well-defined ordering.
For example, large minus medium is not a meaningful concept, but we know that
medium falls between small and large. Rankings are another kind of ordinal data;
some examples of ordered data are top-ten lists of movies or initial lineups for
sports tournaments depending on past performance.

A subset of ordered data is quantitative data, namely, a measurement of magni‐


tude that supports arithmetic comparison. For example, the quantity of 68 inches
minus 42 inches is a meaningful concept, and the answer of 26 inches can be cal‐
culated. Other examples of quantitative data are height, weight, temperature,
stock price, number of calling functions in a program, and number of drinks sold
at a coffee shop in a day. Both integers and real numbers are quantitative data.

In this book, the ordered type is used often; the ordinal type is only occasionally
mentioned, when the distinction between it and the quantitative type matters.

Ordered data can be either sequential, where there is a homogeneous range from
a minimum to a maximum value, or diverging, which can be deconstructed into
two sequences pointing in opposite directions that meet at a common zero point.
For instance, a mountain height dataset is sequential, when measured from a min‐
imum point of sea level to a maximum point of Mount Everest. A bathymetric
dataset is also sequential, with sea level on one end and the lowest point on the
ocean floor at the other. A full elevation dataset would be diverging, where the
values go up for mountains on land and down for undersea valleys, with the zero
value of sea level being the common point joining the two sequential datasets.

Ordered data may be cyclic, where the values wrap around back to a starting
point rather than continuing to increase indefinitely. Many kinds of time mea‐
surements are cyclic, including the hour of the day, the day of the week, and the
month of the year.

There may be hierarchical structure within an attribute or between multiple at‐


tributes. The daily stock prices of companies collected over the course of a
decade is an example of a time-series dataset, where one of the attributes is time.
In this case, time can be aggregated hierarchically, from individual days up to
weeks, up to months, up to years. There may be interesting patterns at multiple
temporal scales, such as very strong weekly variations for weekday versus week‐
end, or more subtle yearly patterns showing seasonal variations in summer versus
winter. Many kinds of attributes might have this sort of hierarchical structure: for
example, the geographic attribute of a postal code can be aggregated up to the
level of cities or states or entire countries.

▶ Section 13.4 covers hierarchical aggregation in more detail, and Section 7.5
covers the visual encoding of attribute hierarchies.

12 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

these two questions are crosscutting: one does not dictate the other. Different ap‐
proaches to considering the semantics of attributes that have been proposed
across the many fields where these semantics are studied. The classification in
this book is heavily focused on the semantics of keys versus values, and the re‐
lated questions of spatial and continuous data versus nonspatial and discrete data,
to match up with the idiom design choice analysis framework. One additional
consideration is whether an attribute is temporal.

A key attribute acts as an index that is used to look up value attributes. * The
distinction between key and value attributes is important for the dataset types of
tables and fields, as shown in Figure 2.8.

* A synonym for key attribute is independent attribute. A synonym for


value attribute is dependent attribute. The language of independent and depen‐
dent is common in statistics. In the language of data warehouses, a synonym for
independent is dimension, and a synonym for dependent is measure.

Figure 2.8.

Key and value semantics for tables and fields.

A simple flat table has only one key, where each item corresponds to a row in
the table, and any number of value attributes. In this case, the key might be com‐
pletely implicit, where it’s simply the index of the row. It might be explicit,
where it is contained within the table as an attribute. In this case, there must not
be any duplicate values within that attribute. In tables, keys may be categorical or
ordinal attributes, but quantititive attributes are typically unsuitable as keys be‐
cause there is nothing to prevent them from having the same values for multiple
items.

For example, in Table 2.1, Name is a categorical attribute that might appear to be
a reasonable key at first, but the last line shows that two people have the same
name, so it is not a good choice. Favorite Fruit is clearly not a key, despite being
categorical, because Pear appears in two different rows. The quantitative at‐
tribute of Age has many duplicate values, as does the ordinal attribute of Shirt

13 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

tion, or categorical, if it’s simply treated as a unique code.

Figure 2.9 shows the order table from Figure 2.5 where each attribute is colored
according to its type. There is no explicit key: even the Order ID attribute has du‐
plicates, because orders consist of multiple items with different container sizes,
so it does not act as a unique identifier. This table is an example of using an im‐
plicit key that is the row number within the table.

Figure 2.9.

The order table with the attribute columns colored by their type; none of them is
a key.

The more complex case is a multidimensional table, where multiple keys are re‐
quired to look up an item. The combination of all keys must be unique for each
item, even though an individual key attribute may contain duplicates. For exam‐
ple, a common multidimensional table from the biology domain has a gene as
one key and time as another key, so that the value in each cell is the activity level
of a gene at a particular time.

The information about which attributes are keys and which are values may not be
available; in many instances determining which attributes are independent keys
versus dependent values is the goal of the vis process, rather than its starting
point. In this case, the successful outcome of analysis using vis might be to recast
a flat table into a more semantically meaningful multidimensional table.

Although fields differ from tables a fundamental way because they represent con‐
tinuous rather than discrete data, keys and values are still central concerns.
(Different vocabulary for the same basic idea is more common with spatial field
data, where the term independent variable is used instead of key, and dependent
variable instead of value.)

Fields are structured by sampling in a systematic way so that each grid cell is
spanned by a unique range from a continuous domain. In spatial fields, spatial
position acts as a quantitative key, in contrast to a nonspatial attribute in the case
of a table that is categorical or ordinal. The crucial difference between fields and
tables is that useful answers for attribute values are returned for locations

14 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

multidimensional structure depends on the number of keys. The standard multi‐


dimensional cases are 2D and 3D fields for static measurements taken in two or

three spatial dimensions, 5 and fields with three or four keys, in the case where
these measurements are time-varying. A field can be both multidimensional and
multivariate if it has multiple keys and multiple values. The standard classifica‐
tion according to multivariate structure is that a scalar field has one attribute per
cell, a vector field has two or more attributes per cell, and a tensor field has

many attributes per cell. *

* These definitions of scalar, vector, and tensor follow the common usage in
vis. In a strict mathematical sense, these distinctions are not technically correct,
since scalars and vectors are included as a degenerate case of tensors. Mapping
the mathematical usage to the vis usage, scalars mean mathematical tensors of
order 0, vectors mean mathematical tensors of order 1, and tensors mean mathe‐
matical tensors of order 2 or more.

A scalar field is univariate, with a single value attribute at each point in space.
One example of a 3D scalar field is the time-varying medical scan above; another
is the temperature in a room at each point in 3D space. The geometric intuition is
that each point in a scalar field has a single value. A point in space can have sev‐
eral different numbers associated with it; if there is no underlying connection be‐
tween them then they are simply multiple separate scalar fields.

A vector field is multivariate, with a list of multiple attribute values at each


point. The geometric intuition is that each point in a vector field has a direction
and magnitude, like an arrow that can point in any direction and that can be any
length. The length might mean the speed of a motion or the strength of a force. A
concrete example of a 3D vector field is the velocity of air in the room at a spe‐
cific time point, where there is a direction and speed for each item. The dimen‐
sionality of the field determines the number of components in the direction vec‐
tor; its length can be computed directly from these components, using the stan‐
dard Euclidean distance formula. The standard cases are two, three, or four com‐
ponents, as above.

A tensor field has an array of attributes at each point, representing a more com‐
plex multivariate mathematical structure than the list of numbers in a vector. A
physical example is stress, which in the case of a 3D field can be defined by nine
numbers that represent forces acting in three orthogonal directions. The geomet‐
ric intution is that the full information at each point in a tensor field cannot be
represented by just an arrow and would require a more complex shape such as an
ellipsoid.

This categorization of spatial fields requires knowledge of the attribute semantics


and cannot be determined from type information alone. If you are given a field
with multiple measured values at each point and no further information, there is

15 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

A temporal attribute is simply any kind of information that relates to time. Data
about time is complicated to handle because of the rich hierarchical structure that
we use to reason about time, and the potential for periodic structure. The time hi‐
erarchy is deeply multiscale: the scale of interest could range anywhere from
nanoseconds to hours to decades to millennia. Even the common words time and
date are a way to partially specify the scale of temporal interest. Temporal analy‐
sis tasks often involve finding or verifying periodicity either at a predetermined
scale or at some scale not known in advance. Moreover, the temporal scales of
interest do not all fit into a strict hierarchy; for instance, weeks do not fit cleanly
into months. Thus, the generic vis problems of transformation and aggregation
are often particularly complex when dealing with temporal data. One important
idea is that even though the dataset semantics involves change over time, there
are many approaches to visually encoding that data—and only one of them is to
show it changing over time in the form of an animation.

▶ Section 3.4.2.3 introduces the problem of data transformation. Section 13.4


discusses the question of aggregation in detail.

Temporal attributes can have either value or key semantics. Examples of tempo‐
ral attributes with dependent value semantics are a duration of elapsed time or the
date on which a transaction occurred. In both spatial fields and abstract tables,
time can be an independent key. For example, a time-varying medical scan can
have the independent keys of x, y, z, t to cover spatial position and time, with the
dependent value attribute of density for each combination of four indices to look
up position and time. A temporal key attribute is usually considered to have a
quantitative type, although it’s possible to consider it as ordinal data if the dura‐
tion between events is not interesting.

▶ Vision versus memory is discussed further in Section 6.5.

A dataset has time-varying semantics when time is one of the key attributes, as
opposed to when the temporal attribute is a value rather than a key. As with other
decisions about semantics, the question of whether time has key or value seman‐
tics requires external knowledge about the nature of the dataset and cannot be
made purely from type information. An example of a dataset with time-varying
semantics is one created with a sensor network that tracks the location of each
animal within a herd by taking new measurements every second. Each animal
will have new location data at every time point, so the temporal attribute is an in‐
dependent key and is likely to be a central aspect of understanding the dataset. In
contrast, a horse-racing dataset covering a year’s worth of races could have tem‐
poral value attributes such as the race start time and the duration of each horse’s
run. These attributes do indeed deal with temporal information, but the dataset is
not time-varying.

A common case of temporal data occurs in a time-series dataset, namely, an or‐


dered sequence of time–value pairs. These datasets are a special case of tables,
where time is the key. These time-value pairs are often but not always spaced at
uniform temporal intervals. Typical time-series analysis tasks involve finding
trends, correlations, and variations at multiple time scales such as hourly, daily,
weekly, and seasonal.

16 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

a dataset has stream type, in contrast to an unchanging file that can be loaded all
at once. In this latter sense, items and attributes can be added or deleted and their
values may change during a running session of a vis tool. I carefully distinguish
between these two meanings here.

▶ The dataset types of dynamic streams versus static files are discussed in
Section 2.4.6.

The Big Picture:

The framework presented here was inspired in part by the many taxonomies of
data that have been previously proposed, including the synthesis chapter at the
beginning of an early collection of infovis readings [Card et al. 99], a taxonomy
that emphasizes the division between continuous and discrete data [Tory and
Möller 04a], and one that emphasizes both data and tasks [Shneiderman 96].

Field Datasets:

Several books discuss the spatial field dataset type in far more detail, including
two textbooks [Telea 07, Ward et al. 10], a voluminous handbook [Hansen and
Johnson 05], and the vtk book [Schroeder et al. 06].

Attribute Types:

The attribute types of categorical, ordered, and quantitative were proposed in the
seminal work on scales of measurement from the psychophysics literature
[Stevens 46]. Scales of measurement are also discussed extensively in the book
The Grammar of Graphics [Wilkinson 05] and are used as the foundational axes
of an influential vis design space taxonomy [Card and Mackinlay 97].

Key and Value Semantics:

The Polaris vis system, which has been commercialized as Tableau, is built
around the distinction between key attributes (independent dimensions) and value
attributes (dependent measures) [Stolte et al. 02].

Temporal Semantics:

A good resource for time-oriented data vis is a recent book, Visualization of


Time-Oriented Data [Aigner et al. 11].

1 My use of the term field is related to but not identical to its use in the mathe‐
matics literature, where it denotes a mapping from a domain to a range. In this
case, the domain is a Euclidean space of one, two, or three dimensions, and the
adjective modifying field is a statement about the range: scalars, vectors, or ten‐
sors. Although the term field by itself is not commonly found in the literature,
when I use it without an adjective I’m emphasizing the continuous nature of the
domain, rather than specifics of the ranges of scalars, vectors, or tensors.

2 Technically, all data stored within a computer is discrete rather than continu‐
ous; however, the interesting question is whether the underlying semantics of the

17 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

sus ratio data [Stevens 46]; this distinction is typically not useful when designing
a visual encoding, so in this book these types remain collapsed together into this
single category.

4 It’s common to store the key attribute in the first column, for understandabil‐
ity by people and ease of building data structures by computers.

5 It’s also possible for a spatial field to have just one key.

18 of 18 7/22/2020, 8:04 PM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Chapter 13

Figure 13.1 shows the set of design choices for reducing—or increasing—what is
shown at once within a view. Filtering simply eliminates elements, whereas ag‐
gregation combines many together. Either choice can be applied to both items or
attributes.

Figure 13.1.

1 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Reduction is one of five major strategies for managing complexity in visualiza‐


tions; as pointed out before, these five choices are not mutually exclusive, and
various combinations of them are common.

Typically, static data reduction idioms only reduce what is shown, as the name
suggests. However, in the dynamic case, the outcome of changing a parameter or
a choice may be an increase in the number of visible elements. Thus, many of the
idioms covered in this chapter are bidirectional: they may serve to either reduce
or increase the number of visible elements. Nevertheless, they are all named after
the reduction action for brevity.

▶ Deriving new data is covered in Chapter 3, changing a view over time is cov‐
ered in Chapter 11, faceting data into multiple views is covered in Chapter 12,
and embedding focus and contextual information together within one view is
covered in Chapter 14.

Reducing the amount of data shown in a view is an obvious way to reduce its vis‐
ual complexity. Of course, the devil is in the details, where the challenge is to
minimize the chances that information important to the task is hidden from the
user. Reduction can be applied to both items and attributes; the word element
will be used to refer to either items or attributes when design choices that apply
to both are discussed. Filtering simply eliminates elements, whereas aggregation
creates a single new element that stands in for multiple others that it replaces. It’s
useful to consider the tradeoffs between these two alternatives explicitly when
making design choices: filtering is very straightforward for users to understand,
and typically also to compute. However, people tend to have an “out of sight, out
of mind” mentality about missing information: they tend to forget to take into ac‐
count elements that have been filtered out, even when their absence is the result
of quite recent actions. Aggregation can be somewhat safer from a cognitive
point of view because the stand-in element is designed to convey information
about the entire set of elements that it replaces. However, by definition, it cannot
convey all omitted information; the challenge with aggregation is how and what
to summarize in a way that matches well with the dataset and task.

The design choice of filtering is a straightforward way to reduce the number of


elements shown: some elements are simply eliminated. Filtering can be applied
to both items and attributes. A straightforward approach to filtering is to allow
the user to select one or more ranges of interest in one or more of the elements.
The range might mean what to show or what to leave out.

The idea of filtering is very obvious; the challenge comes in designing a vis sys‐
tem where filtering can be used to effectively explore a dataset. Consider the sim‐
ple case of filtering the set of items according to their values for a single quanti‐
tative attribute. The goal is to select a range within it in terms of minimum and
maximum numeric values and eliminate the items whose values for that attribute
fall outside of the range. From the programmer’s point of view, a very simple
way to support this functionality would be to simply have the user enter two
numbers, a minimum and maximum value. From the user’s point of view, this ap‐

2 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

In an interactive vis context, filtering is often accomplished through dynamic


queries, where there is a tightly coupled loop between visual encoding and inter‐
action, so that the user can immediately see the results of the intervention. In this
design choice, a display showing a visual encoding of the dataset is used in con‐
junction with controls that support direct interaction, so that the display updates
immediately when the user changes a setting. Often these controls are standard
graphical user interface widgets such as sliders, buttons, comboboxes, and text
fields. Many extensions of off-the-shelf widgets have also been proposed to bet‐
ter support the needs of interactive vis.

In item filtering, the goal is to eliminate items based on their values with respect
to specific attributes. Fewer items are shown, but the number of attributes shown
does not change.

Example: FilmFinder

Figure 13.2 shows the FilmFinder system [Ahlberg and Shneiderman 94] for ex‐
ploring a movie database. The dataset is a table with nine value attributes: genre,
year made, title, actors, actresses, directors, rating, popularity, and length. The
visual encoding features an interactive scatterplot where the items are movies
color coded by genre, with scatterplot axes of year made versus movie popular‐
ity; Figure 13.2(a) shows the full dataset. The interaction design features filter‐
ing, with immediate update of the visual display to filter out or add back items as
sliders are moved and buttons are pressed. The visual encoding adapts to the
number of items to display; the marks representing movies are automatically en‐
larged and labeled when enough of the dataset has been filtered away that there is
enough room to do so, as in Figure 13.2(b). The system uses multiform
overview–detail views, where clicking on any mark brings up a popup detail
view with more information about that movie, as in Figure 13.2(c).

3 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 13.2.

FilmFinder features tightly coupled interactive filtering, where the result of mov‐
ing sliders and pressing buttons is immediately reflected in the visual encoding.
(a) Exploration begins with an overview of all movies in the dataset. (b) Moving
the actor slider to select Sean Connery filters out most of the other movies, leav‐
ing enough room to draw labels. (c) Clicking on the mark representing a movie
brings up a detail view. From [Ahlberg and Shneiderman 94, Color Plates 1, 2,
and 3].

FilmFinder is a specific example of the general dynamic queries approach, where


browsing using tightly coupled visual encoding and interaction is an alternative
to searching by running queries, as for example with a database. All of the items
in a database are shown at the start of a session to provide an overview, and di‐

4 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 13.2 shows the use of two augmented slider types, a dual slider for movie
length that allows the user to select both a minimum and maximum value, and
several alpha sliders that are tuned for selection with text strings rather than
numbers.

System FilmFinder

What: Data Table: nine value attributes.

How: Encode Scatterplot; detail view with text/images.

How: Facet Multiform, overview–detail.

How: Reduce Item filtering.

Standard widgets for filtering controls can be augmented by concisely visually


encoding information about the dataset, but in the part of the screen normally
thought of as the control panel rather than a separate display area. The idea is to
do so while using no or minimal additional screen real estate, in order to create
displays that have high information density. These augmented widgets are called
scented widgets [Willett et al. 07], alluding to the idea of information scent:
cues that help a searcher decide whether there is value in drilling down further
into a particular information source, versus looking elsewhere [Pirolli 07]. Figure
13.3 shows several examples. One way to add information is by inserting a con‐
cise statistical graphic, such as a bar or line chart. Another choice is by inserting
icons or text labels. A third choice is to treat some part of the existing widget as a
mark and encode more information into that region using visual channels such as
hue, saturation, and opacity.

Figure 13.3.

The scented widget idiom adds visual encoding information directly to standard
graphical widgets to make filtering possible with high information density dis‐
plays. From [Willett et al. 07, Figure 2].

The Improvise system shown in Figure 12.7 is another example of the use of fil‐
tering. The checkbox list view in the lower middle part of the screen is a simple
filter controlling whether various geographic features are shown. The multiform

5 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Attributes can also be filtered. With attribute filtering, the goal is to eliminate
attributes rather than items; that is, to show the same number of items, but fewer
attributes for each item.

Item filtering and attribute filtering can be combined, with the result of showing
both fewer items and fewer attributes.

Example: DOSFA

Figure 13.4 shows an example of the Dimensional Ordering, Spacing, and


Filtering Approach (DOSFA) idiom [Yang et al. 03a]. As the name suggests, the

idiom features attribute filtering. * Figure 13.4 shows DOSFA on a dataset of


215 attributes representing word counts and 298 points representing documents
in a collection of medical abstracts. DOSFA can be used with many visual encod‐
ing approaches; this figure shows it in use with star plots. In Figure 13.4(a) the
plot axes are so densely packed that little structure can be seen. Figure 13.4(b)
shows the plots after the dimensions are ordered by similarity and filtered by
both similarity and importance thresholds. The filtered display does show clear
visual patterns.

* Many idioms for attribute filtering and aggregation use the alternative term
dimension rather than attribute in their names.

▶ For more on star plots, see Section 7.6.3.

Figure 13.4.

The DOSFA idiom shown on star glyphs with a medical records dataset of 215
dimensions and 298 points. (a) The full dataset is so dense that patterns cannot be
seen. (b) After ordering on similarity and filtering on both similarity and impor‐
tance, the star glyphs show structure. From [Yang et al. 03a, Figures 3a and 3d].

6 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

System DOSFA

What: Data Table: many value attributes.

How: Encode Star plots.

How: Facet Small multiples with matrix alignment.

How: Reduce Attribute filtering.

Attribute filtering is often used in conjunction with attribute ordering. * If at‐


tributes can be ordered according to a derived attribute that measures the similar‐
ity between them, then all of the high-scoring ones or low-scoring ones can be
easily filtered out interactively. A similarity measure for an attribute creates a
quantitative or ordered value for each attribute based on all of the data item val‐

ues for that attribute. * One approach is to calculate the variance of an at‐
tribute: to what extent the values within that attribute are similar to or different
from each other. There are many ways to calculate a similarity measure between
attributes; some focus on global similarity, and others search for partial matches
[Ankerst et al. 98].

* A synonym for attribute ordering is dimensional ordering.

* A synonym for similarity measure is similarity metric. Although I am com‐


bining the ideas of measure and metric here for the purposes of this discussion, in
many specialized contexts such as mathematics and business analysis they are
carefully distinguished with different definitions.

The other major reduction design choice is aggregation, so that a group of ele‐
ments is represented by a new derived element that stands in for the entire group.
Elements are merged together with aggregation, as opposed to eliminated com‐
pletely with filtering. Aggregation and filtering can be used in conjunction with
each other. As with filtering, aggregation can be used for both items and at‐
tributes.

▶ An example of a complex combination of both aggregation and filtering is car‐


tographic generalization, discussed in Section 8.3.1.

Aggregation typically involves the use of a derived attribute. A very simple ex‐

7 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Anscombe’s Quartet example shown in Figure 1.3 exactly illustrates the diffi‐
culty of adequately summarizing data, and thus the limits of static visual encod‐
ing idioms that use aggregation. Aggregation is nevertheless a powerful design
choice, particularly when used within interactive idioms where the user can
change the level of aggregation on the fly to inspect the dataset at different levels
of detail.

The most straightforward use of item aggregation is within static visual encoding
idioms; its full power and flexibility can be harnessed by interactive idioms
where the view dynamically changes.

Example: Histograms

The idiom of histograms shows the distribution of items within an original at‐
tribute. Figure 13.5 shows a histogram of the distribution of weights for all of the
cats in a neighborhood, binned into 5-pound blocks. The range of the original at‐
tribute is partitioned into bins, and the number of items that fall into each bin is
computed and saved as a derived ordered attribute. The visual encoding of a his‐
togram is very similar to bar charts, with a line mark that uses spatial position in
one direction and the bins distributed along an axis in the other direction. One
difference is that histograms are sometimes shown without space between the
bars to visually imply continuity, whereas bar charts conversely have spaces be‐
tween the bars to imply discretization. Despite their visual similarity, histograms
are very different than bar charts. They do not show the original table directly;
rather, they are an example of an aggregation idiom that shows a derived table
that is more concise than the original dataset. The number of bins in the his‐
togram can be chosen independently of the number of items in the dataset. The
choice of bin size is crucial and tricky: a histogram can look quite different de‐
pending on the discretization chosen. One possible solution to the problem is to
compute the number of bins based on dataset characteristics; another is to pro‐
vide the user with controls to easily change the number of bins interactively, to
see how the histogram changes.

Figure 13.5.

The histogram idiom aggregates an arbitrary number of items into a concise rep‐
resentation of their distribution.

8 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Idiom Histograms

What:
Table: one quantitative value attribute.
Data

Derived table: one derived ordered key at-


What:
tribute (bin), one derived quantitative value at-
Derived
tribute (item count per bin).

Rectilinear layout. Line mark with aligned posi-


How:
tion to express derived value attribute. Position:
Encode
key attribute.

Example: Continuous Scatterplots

Another example of aggregation is continuous scatterplots, where the problem of


occlusion on scatterplots is solved by plotting an aggregate value at each pixel
rather than drawing every single item as an individual point. Occlusion can be a
major readability problem with scatterplots, because many dots could be over‐
plotted on the same location. Size coding exacerbates the problem, as does the
use of text labels. Continuous scatterplots use color coding at each pixel to indi‐
cate the density of overplotting, often in conjunction with transparency.
Conceptually, this approach uses a derived attribute, overplot density, which can
be calculated after the layout is computed. Practically, many hardware accelera‐
tion techniques sidestep the need to do this calculation explicitly.

Figure 13.6 shows a continuous scatterplot of a tornado air-flow dataset, with the
magnitude of the velocity on the horizontal and the z-direction velocity on the
vertical. The density is shown with a log-scale sequential colormap with mono‐
tonically increasing luminance. It starts with dark blues at the low end, continues
with reds in the middle, and has yellows and whites at the high end.

Figure 13.6.

The continuous scatterplot idiom uses color to show the density at each location,

9 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Scatterplots began as a idiom for discrete, categorical data. They have been gen‐
eralized to a mathematical framework of density functions for continuous data,
giving rise to continuous scatterplots in the 2D case and continuous histograms in
the 1D case [Bachthaler and Weiskopf 08]. Continuous scatterplots use a dense,
space-filling 2D matrix alignment, where each pixel is given a different color.
Although the idiom of continuous scatterplots has a similar name to the idiom of
scatterplots, analysis via the framework of design choices shows that the ap‐
proach is in fact very different.

Idiom Continuous Scatterplots

What:
Table: two quantitative value attributes.
Data

Derived table: two ordered key attributes (x, y


What:
pixel locations), one quantitative attribute (over-
Derived
plot density).

Dense space-filling 2D matrix alignment, se-


How:
quential categorical hue + ordered luminance
Encode
colormap.

How:
Item aggregation.
Reduce

Example: Boxplot Charts

The visually concise idiom of boxplots shows an aggregate statistical summary


of all the values that occur within the distribution of a single quantitative at‐
tribute. It uses five derived variables carefully chosen to provide information
about the attribute’s distribution: the median (50% point), the lower and upper
quartiles (25% and 75% points), and the upper and lower fences (chosen values
near the extremes, beyond which points should be counted as outliers). Figure
13.7(a) shows the visual encoding of these five numbers using a simple glyph
that relies on vertical spatial position. The eponymous box stretches between the
lower and upper quartiles and has a horizontal line at the median. The whiskers
are vertical lines that extend from the core box to the fences marked with hori‐

zontal lines. * Outliers beyond the range of the chosen fence cutoff are shown
explicitly as discrete dots, just as in scatterplots or dot charts.

* Boxplots are also known as box-and-whisker diagrams.

10 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 13.7.

The boxplot is an idiom presenting summary statistics for the distribution of a


quantitative attribute, using five derived values. These plots illustrate four kinds
of distributions: normal (n), skewed (s), peaked (k), and multimodal (mm). (a)
Standard box plots. (b) Vase plots, which use horizontal spatial position to show
density directly. From [Wickham and Stryjewski 12, Figure 5].

A boxplot is similar in spirit to an individual bar in a bar chart in that only a sin‐
gle spatial axis is used to visually encode data, but boxplots show five numbers
through the use of a glyph rather than the single number encoded by the linear
mark in a bar chart. A boxplot chart features multiple boxplots within a single
shared frame to contrast different attribute distributions, just as bar charts show
multiple bars along the second axis. In Figure 13.7, the quantitative value at‐
tribute is mapped to the vertical axis and the categorical key attribute to the hori‐
zontal one.

The boxplot can be considered an item reduction idiom that provides an aggre‐
gate view of a distribution through the use of derived data. Boxplots are highly
scalable in terms of aggregating the target quantitative attribute from what could
be an arbitrarily large set of values down to five numbers; for example, it could
easily handle from thousands to millions of values within that attribute. The spa‐
tial encoding of these five numbers along the central axis requires only a moder‐
ate amount of screen space, since we have high visual acuity with spatial posi‐
tion. Each boxplot requires only a very small amount of screen space along the
secondary axis, leading to a high level of scalability in terms of the number of
categorical values that can be accommodated in a boxplot chart; roughly hun‐
dreds.

Boxplots directly show the spread, namely, the degree of dispersion, with the ex‐
tent of the box. They show the skew of the distribution compared with a normal
distribution with the peak at the center by the asymmetry between the top and
bottom sections of the box. Standard boxplots are designed to handle unimodal
data, where there is only one value that occurs the most frequently. There are
many variants of boxplots that augment the basic visual encoding with more in‐
formation. Figure 13.7(b) shows a variable-width variant called the vase plot that
uses an additional spatial dimension within the glyph by altering the width of the
central box according to the density, allowing a visual check if the distribution is
instead multimodal, with multiple peaks. The variable-width variants require
more screen space along the secondary axis than the simpler version, in an exam‐
ple of the classic cost–benefit trade-off where conveying more information re‐
quires more room.

11 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Idiom Boxplot Charts

What:
Table: many quantitative value attributes.
Data

What: Five quantitative attributes for each original at-


Derived tribute, representing its distribution.

Why: Characterize distribution; find outliers, extremes,


Tasks averages; identify skew.

One glyph per original attribute expressing de-


How: rived attribute values using vertical spatial posi-
Encode tion, with 1D list alignment of glyphs into sepa-
rated with horizontal spatial position.

How:
Item aggregation.
Reduce

Scale Items: unlimited. Attributes: dozens.

Many of the interesting uses of aggregation in vis involve dynamically changing


sets: the mapping between individual items and the aggregated visual mark
changes on the fly. The simple case is to allow the user to explicitly request ag‐
gregation and deaggregation of item sets. More sophisticated approaches do these
operations automatically as a result of higher-level interaction and navigation,
usually based on spatial proximity.

Example: SolarPlot

Figure 13.8 shows the example of SolarPlot, a radial histogram with an interac‐
tively controllable aggregation level [Chuah 98]. The user directly manipulates
the size of the base circle that is the radial axis of the chart. This change of radius
indirectly changes the number of available histogram bins, and thus the aggrega‐
tion level. Like all histograms, the SolarPlot aggregation operator is count: the
height of the bar represents the number of items in the set. The dataset shown is
ticket sales over time, starting from the base of the circle and progressing coun‐
terclockwise to cover 30 years in total. The small circle in Figure 13.8(a) is heav‐
ily aggregated. It does show an increase in ticket sales over the years. The larger

12 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 13.8.

The SolarPlot circular histogram idiom provides indirect control of aggregation


level by changing the circle size. (a) The small circle shows the increase in ticket
sales over time. (b) Enlarging the circle shows seasonal patterns in addition to the
gradual increase. From [Chuah 98, Figures 1 and 2].

Idiom SolarPlot

What:
Table: one quantitative attribute.
Data

Derived table: one derived ordered key attribute


What: (bin), one derived quantitative value attribute
Derived (item count per bin). Number of bins interactively
controlled.

How: Radial layout, line marks. Line length: express


Encode derived value attribute; angle: key attribute.

How:
Item aggregation.
Reduce

Original items: unlimited. Derived bins: propor-


Scale
tional to screen space allocated.

The general design choice of hierarchical aggregation is to construct the de‐


rived data of a hierarchical clustering of items in the original dataset and allow
the user to interactively control the level of detail to show based on this hierar‐
chy. There are many specific examples of idioms that use variants of this design

13 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

The idiom of hierarchical parallel coordinates [Fua et al. 99] uses interactively
controlled aggregation as a design choice to increase the scalability of the basic
parallel coordinates visual encoding to hundreds of thousands of items. The
dataset is transformed by computing derived data: a hierarchical clustering of the
items. Several statistics about each cluster are computed, including the number of
points it contains; the mean, minimum, and maximum values; and the depth in
the hierarchy. A cluster is represented by a band of varying width and opacity,
where the mean is in the middle and width at each axis depends on the minimum
and maximum item values for that attribute within the cluster. Thus, in the limit,
a cluster of a single item is shown by a single line, just as with the original idiom.
The cluster bands are colored according to their proximity in the cluster hierar‐
chy, so that clusters far away from each other have very different colors.

The level of detail displayed at a global level for the entire dataset can be interac‐
tively controlled by the user using a single slider. The parameter controlled by
that slider is again a derived variable that varies the aggregate level of detail
shown in a smooth and continuous way. Figure 13.9 shows a dataset with eight
attributes and 230,000 items at different levels of detail. Figure 13.9(a) is the
highest-level overview showing the single top-level cluster, with very broad
bands of green. Figure 13.9(b) is the mid-level view showing several clusters,
where the extents of the tan cluster are clearly distinguishable from the now-
smaller green one. Figure 13.9(c) is a more detailed view with dozens of clusters
that have tighter bands; the proximity-based coloring mitigates the effect of oc‐
clusion.

Figure 13.9.

Hierarchical parallel coordinates provide multiple levels of detail. (a) The single
top cluster has large extent. (b) When several clusters are shown, each has a
smaller extent. (c) When many clusters are shown, the proximity-based coloring
helps them remain distinguishable from each other. From [Fua et al. 99, Figure
4].

14 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Idiom Hie rarchical Paralle l Coordinate s

What:
Table.
Data

Cluster hierarchy atop original table of items.


What:
Five per-cluster attributes: count, mean, min,
Derived
max, depth.

How: Parallel coordinates. Color clusters by proximity


Encode in hierarchy.

How: Interactive item aggregation to change level of


Reduce detail.

Scale Items: 10,000–100,000. Clusters: one dozen.

The challenge of spatial aggregation is to take the spatial nature of data into ac‐
count correctly when aggregating it. In the cartography literature, the modifiable
areal unit problem (MAUP) is a major concern: changing the boundaries of the
regions used to analyze data can yield dramatically different results. Even if the
number of units and their size does not change, any change of spatial grouping
can lead to a very significant change in analysis results. Figure 13.10 shows an
example, where the same location near the middle of the map has a different den‐
sity level depending on the region boundaries: high in Figure 13.10(a), medium
in Figure 13.10(b), and low in Figure 13.10(c). Moreover, changing the scale of
the units also leads to different results. The problem of gerrymandering, where
the boundaries of voting districts are manipulated for political gain, is the in‐
stance of the MAUP best known to the general public.

Figure 13.10.

Modifiable Areal Unit Problem (MAUP) example, showing how different bound‐

15 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Example: Geographically Weighted Boxplots

The geowigs family of idioms, namely, geographically weighted interactive


graphics, provides sophisticated support for spatial aggregation using geographi‐
cally weighted regression and geographically weighted summary statistics
[Dykes and Brunsdon 07]. Figure 13.11 shows a multivariate geographic dataset
used to explore social issues in 19th century France. The six quantitative at‐
tributes are population per crime against persons (x1), population per crime
against property (x2), percentage who can read and write (x3), donations to the
poor (x4), population per illegitimate birth (x5), and population per suicide (x6).

Figure 13.11.

Geowigs are geographically weighted interactive graphics. (a) A choropleth map


showing attribute x1. (b) The set of gw-boxplots for all six attributes at two
scales. (c) Weighting maps showing the scales: local and larger. (d) A gw-mean
map at the larger scale. From [Dykes and Brunsdon 07, Figures 7a and 2].

Figure 13.11(a) shows a standard choropleth map colored by personal crime at‐
tribute x1, with the interactively selected region Creuse (23) highlighted. Figure
13.11(b) shows gw-boxplots for all six attributes, at two scales. The gw-boxplot,
a geographically weighted boxplot geowig, supports comparison between the
global distribution and the currently chosen spatial scale using the design choice
of superimposed layers. The global statistical distribution is encoded by the gray
boxplot in the background, and the local statistics for the interactively chosen
scale are encoded by a foreground boxplot in green. Figure 13.11(c) shows the
weighting maps for the currently chosen scale of each gw-boxplot set: very local
on top, and a larger scale on the bottom. Figure 13.11(d) shows a gw-mean map,
a geographically weighted mean geowig, weighted according to the same larger
scale.

▶ Choropleth maps are covered in Section 8.3.1.

16 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

ing to a larger scale, that attribute’s distribution is close to the global one in both
the boxplot, matching the mid-range color in the gw-mean map geowig in Figure
13.11(d).

Idiom Geographically Weighted Boxplots

Geographic geometry with area boundaries.


What: Table: Key attribute (area), several quantitative
Data value attributes. Table: Five-number statistical
summary distributions for each original attribute.

Multidimensional table: key attribute (area), key


What: attribute (scale), quantitative value attributes
Derived (geographically weighted statistical summaries
for each area at multiple scales).

How:
Boxplot.
Encode

Superimposed layers: global boxplot as gray


How:
background, current-scale boxplot as green fore-
Facet
ground.

How:
Spatial aggregation.
Reduce

Just as attributes can be filtered, attributes can also be aggregated, where a new
attribute is synthesized to take the place of multiple original attributes. A very
simple approach to aggregating attributes is to group them by some kind of simi‐
larity measure, and then synthesize the new attribute by calculate an average
across that similar set. A more complex approach to aggregation is dimensional‐
ity reduction (DR), where the goal is to preserve the meaningful structure of a
dataset while using fewer attributes to represent the items.

In the family of idioms typically called dimensionality reduction, the goal is to

17 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

dundancy in the original dataset because the underlying latent variables could not
be measured directly.

Nonlinear methods for dimensionality reduction are used when the new dimen‐
sions cannot be expressed in terms of a straightforward combination of the origi‐
nal ones. The multidimensional scaling (MDS) family of approaches includes
both linear and nonlinear variants, where the goal is to minimize the differences
in distances between points in the high-dimensional space versus the new lower-
dimensional space.

* The words attribute aggregation, attribute synthesis, and dimensionality


reduction (DR) are all synonyms. The term dimensionality reduction is very
common in the literature. I use attribute aggregation as a name to show where
this design choice fits into the taxonomy of the book; it is not a typical usage by
other authors. Although the term dimensionality reduction might logically seem
to include attribute filtering, it is more typically used to mean attribute synthesis
through aggregation.

▶ Deriving new data is discussed in Section 3.4.2.3.

Example: Dimensionality Reduction for Document Collections

A situation where dimensionality reduction is frequently used is when users are


faced with the need to analyze a large collection of text documents, ranging from
thousands to millions or more. Although we typically read text when confronted
with only a single document, document collection vis is typically used in situa‐
tions where there are so many documents in the collection that simply reading
each one is impractical. Document collections are not directly visualizeable, but
they can be transformed into a dataset type that is: a derived high-dimensional ta‐
ble.

Text documents are usually transformed by ignoring the explicit linear ordering
of the words within the document and treating it as a bag of words: the number
of times that each word is used in the document is simply counted. The result is a
large feature vector, where the elements in the vector are all of the words in the
entire document collection. Very common words are typically eliminated, but
these vectors can still contain tens of thousands of words. However, these vectors
are very sparse, where the overwhelming number of values are simply zero: any
individual document contains only a tiny fraction of the possible words.

The result of this transformation is a derived table with a huge number of quanti‐
tative attributes. The documents are the items in the table, and the attribute value
for a particular word contains the number of times that word appears in the docu‐
ment. Looking directly at these tables is not very interesting.

This enormous table is then transformed into a much more compact one by deriv‐
ing a much smaller set of new attributes that still represents much of the structure
in the original table using dimensionality reduction. In this usage, there are two
stages of constructing derived data: from a document collection to a table with a
huge number of attributes, and then a second step to get down to a table with the
same number of items but just a few attributes.

The bag-of-words DR approach is suitable when the goal is to analyze the differ‐

18 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

previously unknown cluster structure.

Images, videos, and other multimedia documents are usually transformed to cre‐
ate derived attributes in a similar spirit to the transformations done to text docu‐
ments. One major question is how to derive new attributes that compactly repre‐
sent an image as a set of features. The features in text documents are relatively
easy to identify because they’re based on the words; even in this case, natural
language processing techniques are often used to combine synonyms and words
with the same stem together. Image features typically require even more complex
computations, such as detecting edges within the image or the set of colors it
contains. Processing individual videos to create derived feature data can take into
account temporal characteristics such as interframe coherence.

A typical analysis scenario is complex enough that it is useful to break it down


into a chained sequence, rather than just analyzing it as a single instance. In the
first step, a low-dimensional table is derived from the high-dimensional table us‐
ing multidimensional scaling. In the second step, the low-dimensional data is en‐
coded as a color-coded scatterplot, according to a conjectured clustering. The
user’s goal is a discovery task, to verify whether there are visible clusters and
identify those that have semantic meaning given the documents that comprise
them. Figure 13.12 shows a scatterplot view of a real-world document collection
dataset, dimensionally reduced with the Glimmer multidimensional scaling
(MDS) algorithm [Ingrain et al. 09]. In this scenario, the user can interactively
navigate within the scatterplot, and selecting a point shows document keywords
in a popup display and the full text of the document in another view. In the third
step, the user’s goal is to produce annotations by adding text labels to the verified
clusters. Figure 13.13 summarizes this what–why–how analyis.

Figure 13.12.

Dimensionality reduction of a large document collection using Glimmer for mul‐


tidimensional scaling. The results are laid out in a single 2D scatterplot, allowing
the user to verify that the conjectured clustering shown with color coding is par‐
tially supported by the spatial layout. From [Ingram et al. 09, Figure 8].

19 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

Figure 13.13.

A chained sequence of what–why–how analysis instances for the scenario of di‐


mensionality reduction of document collection data.

Dime nsionality Reduction for Docume nt


Idiom
Collections

What: Data Text document collection.

What:
Table with 10,000 attributes.
Derived

What:
Table with two attributes.
Derived

How:
Scatterplot, colored by conjectured clustering.
Encode

How: Attribute aggregation (dimensionality reduc-


Reduce tion) with MDS.

Original attributes: 10,000. Derived attributes:


Scale
two. Items: 100,000.

With standard dimensionality reduction techniques, the user chooses the number
of synthetic attributes to create. When the target number of new attributes is two,
the dimensionally reduced data is most often shown as a scatterplot. When more
than two synthetic attributes are created, a scatterplot matrix (SPLOM) may be a
good choice. Although in general scatterplots are often used to check for correla‐

20 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

tasks with scatterplots of dimensionally reduced data are to verify or find


whether the data items form meaningful clusters and to verify or check whether
the new synthetic attributes are meaningful.

For both of these tasks, an important function is to be able to select a low-


dimensional point in the scatterplot and inspect the high-dimensional data that it
represents. Typically, the user investigates by clicking on points and seeing if the
spatial layout implied by the low-dimensional positions of the points seems to
properly reflect the high-dimensional space.

Sometimes the dataset has no additional information, and the scatterplot is sim‐
ply encoding two-dimensional position. In many cases there is a conjectured cat‐
egorization of the points, which are colored according to those categories. The
task is then to check whether the patterns of colors match up well with the pat‐
terns of the spatial clusters of the reduced data, as shown in Figure 13.12.

When you are interpreting dimensionally reduced scatterplots it is important to


remember that only relative distances matter. The absolute position of clusters is
not meaningful; most techniques create layouts where the image would have the
same meaning if it is rotated in an arbitrary direction, reflected with mirror sym‐

metry across any line, or rescaled to a different size. *

* In mathematical terminology, the layouts are affine invariant.

Another caution is that this inspection should be used only to find or verify large-
scale cluster structure. The fine-grained structure in the lower-dimensional plots
should not be considered strongly reliable because some information is lost in the
reduction. That is, it is safe to assume that major differences in the distances be‐
tween points are meaningful, but minor differences in distances may not be a reli‐
able signal.

Empirical studies have shown that two-dimensional scatterplots or SPLOMS are


the safest idiom choices for inspecting dimensionally reduced data [Sedlmair et
al. 13]. While the idiom of three-dimensional scatterplots has been proposed
many times, they are susceptible to all of the problems with 3D representations.
As discussed in Section 6.3, they are an example of the worst possible case for
accurate depth perception, an abstract cloud of points floating in three-
dimensional space. Although some systems use the idiom of 3D landscapes for
dimensionally reduced data, this approach is similarly problematic, for reasons
also covered in Section 6.3.

Filtering:

Early work in dynamic queries popularized filtering with tightly coupled views
and extending standard widgets to better support these queries [Ahlberg and
Shneiderman 94].

Scented Widgets:

Scented widgets [Willett et al. 07] allude to the idea of information scent pro‐

21 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...

ential book on Exploratory Data Analysis [Tukey 77]. A recent survey paper dis‐
cusses the many variants of boxplots that have been proposed in the past 40 years
[Wickham and Stryjewski 12].

Hierarchical Aggregation:

A general conceptual framework for analyzing hierarchical aggregation is pre‐


sented in a recent paper [Elmqvist and Fekete 10]. Earlier work presented hierar‐
chical parallel coordinates [Fua et al. 99].

Spatial Aggregation:

The Modifiable Areal Unit Problem is covered in a recent handbook chapter


[Wong 09]; a seminal booklet lays out the problem in detail [Openshaw 84].
Geographically weighted interactive graphics, or geowigs for short, support ex‐
ploratory analysis that explicitly takes scale into account [Dykes and Brunsdon
07].

Attribute Reduction:

DOSFA [Yang et al. 03a] is one of many approaches to attribute reduction from
the same group [Peng et al. 04, Yang et al. 04, Yang et al. 03b]. The DimStiller
system proposes a general framework for attribute reduction [Ingram et al. 10].
An extensive exploration of similarity metrics for dimensional aggregation was
influential early work [Ankerst et al. 98].

Dimensionality Reduction:

The foundational ideas behind multidimensional scaling were first proposed in


the 1930s [Young and Householder 38], then further developed in the 1950s
[Torgerson 52]. An early proposal for multidimensional scaling (MDS) in the vis
literature used a stochastic force simulation approach [Chalmers 96]. The
Glimmer system exploits the parallelism of graphics hardware for MDS; that pa‐
per also discusses the history and variants of MDS in detail [Ingram et al. 09].
Design guidelines for visually encoding dimensionally reduced data suggest
avoiding the use of 3D scatterplots [Sedlmair et al. 13].

22 of 22 7/15/2020, 7:29 AM
 Designing Data Visualizations

PREV NEXT
⏮ ⏭
3. Determine Your Goals and Supporting Data 5. First, Place
  🔎

Chapter 4. Choose Appropriate Visual


Encodings

Choosing Appropriate Visual Encodings


As we discussed in Data, once you know the “shape” of your data, you can
encode its various dimensions with appropriate visual properties. Different visual
properties vary—or may be modified—in different ways, which makes them
good for encoding different types of data. Two key factors are whether a visual
property is naturally ordered, and how many distinct values of this property the
reader can easily differentiate. Natural ordering and number of distinct values
will indicate whether a visual property is best suited to one of the main data
types: quantitative, ordinal, categorical, or relational data. (Spatial data is
another common data type, and is usually best represented with some kind of
map.)

NATURAL ORDERING
Whether a visual property has a natural ordering is determined by whether the
mechanics of our visual system and the “software” in our brains automatically—
unintentionally—assign an order, or ranking, to different values of that property.
The “software” that makes these judgments is deeply embedded in our brains and
evaluates relative order independent of language, culture, convention, or other
[ 6 ]
learned factors; it’s not optional and you can’t design around it.

For example, position has a natural ordering; shape doesn’t. Length has a natural
ordering; texture doesn’t (but pattern density does). Line thickness or weight has
a natural ordering; line style (solid, dotted, dashed) doesn’t. Depending on the
specifics of the visual property, its natural ordering may be well suited to
representing quantitative differences (27, 33, 41), or ordinal differences (small,
medium, large, enormous).

Natural orderings are not to be confused with properties for which we have
learned or social conventions about their ordering. Social conventions are
powerful, and you should be aware of them, but you cannot depend on them to
be interpreted in the same way as naturally-ordered properties—which are not
social and not learned, and the interpretation of which is not optional.

Color is not ordered

Here’s a tricky one: Color (hue) is not naturally ordered in our brains. Brightness
(lightness or luminance, sometimes called tint) and intensity (saturation) are, but
color itself is not. We have strong social conventions about color, and there is an
ordering by wavelength in the physical world, but color does not have a non-
negotiable natural ordering built into the brain. You can’t depend on everyone to
agree that yellow follows purple in the way that you can depend on them to agree
that four follows three.

The misuse of color to imply order is rampant; don’t fall into this common trap.
In contexts where you’re tempted to use “ordered color” (elevation, heat maps,
etc.), consider varying brightness along one, or perhaps two, axes. For example,
elevation can be represented by increasing the darkness of browns, rather than
[ 7 ] [ 8 ]
cycling through the rainbow (see Figure 4-1 and Figure 4-2 ).

Figure 4-1. A rainbow encoding leads to a map that is very difficult to


understand. Does red mean the Alps are hotter than the rest of Europe?

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-2. In this example the colors diverge from one point, clearly
indicating low, medium, and high elevations.

NOTE
For help in choosing appropriate color palettes, a great tool
is ColorBrewer2.0, at http://colorbrewer2.org
(http://colorbrewer2.org).

DISTINCT VALUES
The second main factor to consider when choosing a visual property is how many
distinct values it has that your reader will be able to perceive, differentiate, and
possibly remember. For example, there are a lot of colors in the world, but we
can’t tell them apart if they’re too similar. We can more easily differentiate a
large number of shapes, a huge number of positions, and an infinite number of
numbers. When choosing a visual property, select one that has a number of useful
differentiable values and an ordering similar to that of your data (see Figure 4-3).

Figure 4-3. Use this table of common visual properties to help you select an
appropriate encoding for your data type.

Figure 4-4 shows another way to think about visual properties, depending on
what kind of data you need to encode. As you can see, many visual properties
may be used to encode multiple data types. Position and placement, as well as
text, can be used to encode any type of data—which is why every visualization
you design needs to begin with careful consideration of how you’ll use them (see
Chapter 5).

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-4. Visual properties grouped by the types of data they can be used
to encode.

REDUNDANT ENCODING
If you have the luxury of leftover, unused visual properties after you’ve encoded
the main dimensions of your data, consider using them to redundantly encode
some existing, already-encoded data dimensions. The advantage of redundant
encoding is that using more channels to get the same information into your brain
[ 9 ]
can make acquisition of that information faster, easier, and more accurate.

For example, if you’ve got lines differentiated by ending (arrows, dots, etc.),
consider also changing the line style (dotted, dashed, etc.) or color. If you’ve got
values encoded by placement, consider redundantly encoding the value with
[10]
brightness, or grouping regions with color, as in Figure 4-5 .

Figure 4-5. Color redundantly encodes the position of groups of companies


in this graph.

To be totally accurate, in Figure 4-5, adding color more strongly defined the
groupings that weren’t strongly defined before, but those groups are a subset of
the information already provided by position. For that reason, in this case color
adds slightly more informational value beyond mere redundancy.

DEFAULTS VERSUS INNOVATIVE FORMATS


It is worth noting that there are a lot of good default encodings and encoding
conventions in the world, and with good reason. Designing new encoding formats
can cost you a lot of time and effort, and may make your reader expend a lot of
time and effort to learn. Knowing the expected defaults for your industry, data
type, or reader can save you a lot of work when it comes to both figuring out how
to best encode your data, and how to explain it to your readers. However, if we
all used existing defaults all the time, not much progress would be made. So
when should you use a default, and when should you innovate?

In writing, we often advise each other to stay away from clichés; don’t use a pat
phrase, but try to find new ways to say things instead. The reason is that we want
the reader to think about what we’re saying, and clichés tend to make readers turn
their brains off. In visualization, however, that kind of brainlessness can be a help
instead of a hindrance—since it makes comprehension more efficient—so
conventions can be our friends.

NOTE
Purposely turning visual convention on its head may cause
the reader’s brain to “throw an exception,” if you will, and this
technique can be used strategically; but please, use it
sparingly.

The choice comes down to a basic cost-benefit analysis. What is the expense to
you and your reader of creating and understanding a new encoding format, versus
the value delivered by that format? If you’ve got a truly superior solution (as
evaluated by your reader, and not just your ego), then by all means, use it. But if
your job can be done (or done well enough) with a default format, save everyone
the effort and use a standard solution.

READERS’ CONTEXT
In Chapter 2, we discussed how important it is to recognize that you are creating

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
a visualization for someone other than yourself—and that the reader may show /
up with a mindset or way of viewing the world different from yours.
First, it’s important to point out that your audience will likely be composed of
more than one reader. And as these people are all individuals, they may be as
different from each other as they are from you, and will likely have very different
backgrounds and levels of interest in your work. It may be impossible to take the
preconceptions of all these readers into consideration at once. So choose the most
important group, think of them as your core group, and design with them in
mind. Where it is possible to appeal to more of your potential audience without
sacrificing precision or efficiency, do so. But, going forward, let us be clear that
when we say reader, what we really mean is a representative reader from within
your core audience.

Okay, now that we’ve cleared that up, let’s get specific about some facets of the
reader’s mindset that you need to take into account.

Titles, tags, and labels

When selecting the actual terms you’ll use to label axes, tag visual elements, or
title the piece (which creates the mental framework within which to view it),
consider your reader’s vocabulary and familiarity with relevant jargon.

Is the reader from within your industry or outside of it? What about other
readers outside of the core audience group?

Is it worth using an industry term for the sake of precision (knowing that the
reader may have to look it up), or would a lay term work just as well?

Will the reader be able to decipher any unknown terms from context, or will
a vocabulary gap obscure the meaning of all or part of the information
presented?

These are the kinds of questions you should ask yourself. Each and every single
word in your visualization needs to serve a specific purpose. For each one, ask
yourself: why use this word in this place? Determine whether there is another
word that would serve the purpose any better (or whether you can get away
without one at all), and if so, make the change.

Related to this, consider any spelling preferences a reader might have. Especially
within the English language, there may be more than one way to spell a word
depending on which country one is in. Don’t make the reader’s brain do extra
work having to parse “superfluous” or “missing” letters.

Colors

Another reader context to take into account is color choice. There is quite a bit of
science about how our brains perceive and process color that is somewhat
universal, as we saw earlier in this chapter. But it’s worth mentioning in the
context of reader preconceptions the significant cultural associations that color
can carry.

Depending on the culture in question, some colors may be lucky, some unlucky;
some may carry positive or negative connotations; some may be associated with
life events like weddings, funerals, or newborn children.

Some colors don’t mean much on their own, but take on meaning when paired or
grouped with other colors: in the United States, red and royal blue to Republicans
and Democrats; pink and light blue often refer to boys and girls; red, yellow, and
green to traffic signals. The colors red, white, and green may signal Christmas in
Canada, but patriotism in Italy. The colors red, white, and blue are patriotic in
multiple places: they will make both an American and a Frenchman think of
home.

Colors may also take on special significance when paired with certain shapes. A
[11]
red octagon means stop in many places (see Figure 4-6 ), but not all.

Figure 4-6. This stop sign from Montreal is labeled in French, but no
English speaker is likely to be confused about its meaning.

Color blindness

Of course, we know that there are many variations in the way different people
perceive color. This is commonly called color blindness but is more properly
referred to as color vision deficiency or dyschromatopsia. A disorder of color
vision may present in one of several specific ways.

Although prevalence estimates vary among experts and for different ethnic and
national groups, about 7% of American men experience some kind of color
perception disorder (women are much more rarely affected: about 0.4 percent in
[12]
America). Red-green deficiency is the most common by far, but yellow-blue
deficiency also occurs. And there are lots of people who have trouble
distinguishing between close colors like blue and purple.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
NOTE
A great resource for help in choosing color palettes friendly
to those with color blindness is the Color Laboratory at
http://colorlab.wickline.org/colorblind/colorlab/
(http://colorlab.wickline.org/colorblind/colorlab/). There you can select
color swatches into a group (or enter custom RGB values)
and simulate how they are perceived with eight types of
dyschromatopsia. Note: the simulation assumes that you
yourself have typical color vision.

Directional orientation

Is the reader from a culture that reads left-to-right, right-to-left, or top-to-bottom?


A person’s habitual reading patterns will determine their default eye movements
over a page, and the order in which they will encounter the various visual
elements in your design.

It will also affect what the reader perceives as “earlier” and “later” in a timeline,
where the edge that is read from will be “earlier” and time will be assumed to
progress in the same direction as your reader typically reads text.

This may also pertain to geographic maps: many of us are used to seeing the
globe split somewhere along the Pacific, with north oriented upward. This suits
North Americans just fine, since—scanning from left to right and starting from
the top of the page—we encounter our own country almost immediately. The
convention came about thanks to European cartographers, who designed maps
over hundreds of years with their own continent as the center of the world.

Occasionally, other map makers have chosen to orient the world map differently,
often for the same purpose of displaying their homeland with prominence (such
as Stuart McArthur’s “South-Up Map,” which puts his native Australia toward
the center-top) or simply for the purpose of correcting the distortion effect that
causes Europe to look bigger than it really is (such as R. Buckminster Fuller’s
“Dymaxion Map”).

COMPATIBILITY WITH REALITY


As with so many suggestions in this chapter, a large factor in your success is
making life easier for your reader, and that’s largely based on making encodings
as easy to decode as possible. One way to make decoding easy is to make your
encodings of things and relationships as well aligned with the reality (or your
reader’s reality) of those things and relationships as possible; this alignment is
called compatibility. This can have many different aspects, including taking cues
from the physical world and from cultural conventions.

Things in the world are full of inherent properties. These are physical properties
that are not (usually) subject to interpretation or culture, but exist as properties
you can point to or measure. Some things are larger than others, have specific
colors, well-known locations, and other identifying characteristics. If your
encodings conflict with or don’t reflect these properties, if they are not
compatible, you’re once again asking your reader to spend extra time decoding
and wondering why things are “wrong;” why they don’t look like they’re
expected to (for example, see the boats and airplanes in Figure 4-7).

Figure 4-7. The visual placement of boats above airplanes is jarring, since
they don’t appear that way in the physical world.

Figure 4-8 shows an example from http://html5readiness.com/


(http://html5readiness.com/).

Figure 4-8. Representation of browser capabilities.

Notice how the colors they’ve chosen map to the browser icons, as shown in
Figure 4-9.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-9. The representative colors differ greatly from the colors in the
browser icons. Other choices would better reflect the icons’ colors.

The encodings they’ve chosen aren’t very compatible with the reality of the
browsers’ icons and branding. IE, with a blue and yellow icon, is shown in
shades of purple. Firefox, with a blue and orange icon, is shown in blue—which
is fine, but curious, given the other browser icons that also contain blue and
might be better contenders for the blue encoding. Safari, with a blue icon, is
encoded with yellow. Chrome—which has red, blue, green, and yellow, but no
orange in its icon—is orange. Opera, with its red icon and corresponding red
label, has the only encoding that makes sense. An improved set of encodings that
more closely match the reality of the browser icons shown in the last column of
Figure 4-9.

Beyond physical or natural conventions, there are learned, cultural conventions


that must also be respected. These may not be as easy to point to, but are no less
important. Note that, as we advised in the section on natural ordering, you should
not rely on social or cultural conventions to convey information. However, these
conventions can be very powerful, and you should be aware that your reader
brings them to the table. Making use of them, when possible, to reinforce your
message will help you convey information efficiently. Avoid countering
conventions where possible in order to avoid creating cognitive dissonance, a
clash of habitual interpretation with the underlying message you are sending.

To use colors as an example of some of these learned conventions, red and green
have strong connotations for bad and good, or stop and go. (See the Color section
in Chapter 6 for more on common color associations.) Beyond color, consider
cultural conventions about spatial representations, such as what left and right
mean politically, or the significance of above and below. Also consider cultural
conventions about the meaning or square versus round, and bright versus dark.

All sorts of metaphorical interpretations are culturally ingrained. An astute


designer will think about these possible interpretations and work with them,
rather than against them.

Direction and reality

Direction is an interesting property to consider because it has both inherent and


learned conventions. How many times have you looked at an emergency exit map
in a hallway, and realized that the exit, displayed to the left on the map, was to
your right in reality, because the map was upside down relative to the direction
[13]
you were facing? You may also run into maps that, for various reasons, don’t
put north at the top of the map. Even though the map may be fully accurate and
not violating compatibility with physical reality, this violation of cultural
convention can be enormously disorienting.

PATTERNS AND CONSISTENCY


The human brain is amazingly good at identifying patterns in the world. We
easily recognize similarity in shapes, position, sound, color, rhythm, language,
behavior, and physical routine, just to name a few variables. This ability to
recognize patterns is extremely powerful, as it enables us to identify stimuli that
we’ve encountered before, and predict behavior based on what happened the last
time we encountered a similar stimulus pattern. This is the foundation of
language, communication, and all learning. The ability to recognize patterns and
learn from them allows us to notice and respond when we hear the sound of our
name, to run down a set of stairs without hurting ourselves, and to salivate when
we smell food cooking.

Consequently, we also notice violations of patterns. When a picture is crooked, a


friend sounds troubled, a car is parked too far out into the street, or the
mayonnaise smells wrong, the patterns we expect are being violated and we can’t
help but notice these exceptions. Flashing lights and safety vests are intentionally
designed to stand out from the background—we notice them because they are
exceptions to the norm.

Practically speaking, this pattern and pattern-violation recognition has two major
Your trial membership has ended, Pvvssrinivas.
implications for design. The first isPlease contact
that readers will your
notice patterns administrator or O'Reilly Support.
and assume
/
they are intentional, whether you planned for the patterns to exist or not. The
second is that when they perceive patterns, readers will also expect pattern
violations to be meaningful.

As designers, we must be extremely deliberate about the patterns and pattern


violations we create. Don’t arbitrarily assign positions or colors or connections or
fonts with no rhyme or reason to your choices, because your reader will always
assume that you meant something by it. If you change the order or membership
of a list of items, either in text or in placement, it will be perceived as
meaningful. If you change the encoding of items, by position, shape, color, or
other methods, it will be perceived as meaningful.

So how should you avoid the potential trap of implying meaning where none is
intended? It all comes down to three simple rules.

Be consistent in membership, ordering, and other encodings.

Things that are the same should look the same.

Things that are different should look different.

These sound simple, and yet violations of these rules are everywhere. You can
probably think of a few already, and will probably start to notice more examples
in your daily life. Maintaining consistency and intention when encoding will
greatly enhance the accessibility and efficiency of your visualization, and, as with
any good habit, will make your life easier in the long run.

Selecting Structure
Just as we don’t write PhD dissertations in sonnet form, or thank-you notes like
legal briefs complete with footnote citations, it’s important that the structure of
your visualization be appropriate to your data.

The structure of a visualization should reveal something about the underlying


data. Take, for example, one of the most classic data visualizations: the Periodic
[14]
Table of the Elements (Figure 4-10 ). This is arguably one of the most elegant
visualizations ever made. It takes a complex dataset and makes it simple,
organized, and transparent. The elements are laid out in order by atomic number,
and by wrapping the rows at strategic points, the table reveals that elements in
various categories occur at regular intervals, or periods. The table makes it easier
to understand the nature of each element—both individually, and in relation to
the other elements we know of.

Figure 4-10. This rendition of the classic table makes good use of color and
line.

Perhaps because it is so elegant and iconic, the Periodic Table is also one of the
most frequently imitated visualizations out there. Designers and satirists are
constantly repurposing its familiar rows and columns to showcase collections of
everything from typefaces to video game controllers, and, ironically,
visualization methods. This phenomenon is a particular peeve to your authors
precisely because it violates the important principle of selecting an appropriate
structure. With the possible (yet questionable) exception of Andrew Plotkin’s
[15]
Periodic Table of Desserts, copycat designers are using a periodic structure
to display data that is not periodic. They are just so many derivative attempts at
cleverness.

WARNING
If you’re using a particular structure just to be cute or clever,
you’re doing it wrong.

If you are tempted to use a periodic table format for your non-periodic data,
consider instead a two-axis scatter plot or table, where the axes are well matched
to the important aspects of your data. This will lead you to a more accurate, and
[16]
less derivative, final product.

NOTE
For another chemistry-oriented example of a specific
structure with an entirely different purpose, check out the
Table of Nuclides:
http://en.wikipedia.org/wiki/Table_of_nuclides
(http://en.wikipedia.org/wiki/Table_of_nuclides)

Beyond that, we must refer you to other tomes (we suggest the books by Yau and
Kosslyn listed in Appendix A to begin with, and Bertin for more dedicated
readers) to help you select just the right structure for your particular
Your trial membership has ended, Pvvssrinivas. /
circumstance; as you can see from Please contact
Figure 4-11, there your
are too many administrator or O'Reilly Support.
to address
each one directly within the scope of this short book. But here are some general
principles and common pitfalls to guide your selection process.C

COMPARISONS NEED TO COMPARE


If you intend to allow comparison of values, set the representations up in
equivalent ways, and then put them close together. You wouldn’t ask people to
look at two versions of a photo in different rooms; you’d put them side-by-side.
The same goes for visualizations, particularly with quantitative measures. If you
want people to be able to meaningfully compare values, put them as near to each
other as possible.

Another important comparison principle is that of preservation. Just as you


would isolate variables in a clinical trial by comparing a test group to a control
group—which is similar to the test group except for one variable—you need to
isolate visual changes by preserving other conditions, so that the change may be
easily and fairly interpreted.

A good example of this is in comparing two graphs. Beware of what scales you
use on your axes so that the reader can fairly interpret the graph data. If one
graph has a scale of 0 to 10 and the other has a scale of 0 to 5 (Figure 4-11), the
slopes displayed on the graphs will be very different for the same data. Using
unequal scales for data you are attempting to compare makes comparison much
more difficult.

Figure 4-11. The same data appears flatter (top) or steeper (bottom)
depending on the scales chosen. If we were attempting to compare these
data sets, the unequal axes would introduce distortion that made
comparison more difficult.

SOME STRUCTURES ARE JUST INHERENTLY BAD


Some formats are just bad, and should never be used under any circumstances.
Many of the formats that fall into this category do so because they distort
proportion. There are certain things that our brains are and aren’t good at: for
example, we are terrible at comparing lengths of curved lines and the surface
areas of irregularly-shaped fields. For this reason, concentric circle graphs (see,
for example, http://michaelvandaniker.com/blog/2009/10/31/visualizing-historic-
browser-statistics-with-axiis/ (http://michaelvandaniker.com/blog/2009/10/31/visualizing-historic-
browser-statistics-with-axiis/)) are one of the worst offenders in the world of data
[17]
presentation structures.

If I show you a section of the ring in the middle that represents a huge
percentage, it still looks objectively shorter than a section of the outer ring that
may represent a much smaller percentage. Also, having all of these lines wrapped
in a circle makes it difficult to compare their lengths anyway. They only way you
can really grasp the information represented in this graph is to read the
percentage numbers in the labels. In this case, we may as well just have a table of
numbers—it would be faster to read and easier to make comparisons with.

Similarly, the ringed pie graph format known as Nightingale’s Rose (for its
creator, Florence—see Figure 4-12), is almost completely useless. Comparing the
areas of the sliced pie wedges is nearly impossible to do accurately. Line graphs
or stacked bar graphs would have served much better.

Figure 4-12. Nightingale’s Rose.

Unfortunately, this format continues to be reinvented in all sorts of modern


contexts. See Figure 4-13 for an equally useless implementation using the same
variously sized pie wedges.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-13. A radial layout distorts the data and renders this disk usage
map totally ineffective for all but the coarsest comparisons.

SOME GOOD STRUCTURES ARE OFTEN ABUSED


There are bad formats, and then there are good formats frequently misused. Like
the Periodic Table, pie graphs are useful for a very specific purpose, but quickly
devolve into unhelpful parody when drafted into extended service.

The specialty of a pie graph is comparison—specifically, comparison of a few


parts to a larger whole. We’ve already established above in our discussion of
concentric circle graphs and Nightingale’s Roses that the human brain is lousy at
comparing the lengths and surface areas of curved or irregularly-shaped fields;
pie graphs fall directly into this category.

Another common pitfall is the use of a geographic map for any and all data that
includes a location dimension. Sometimes the use of a map will actually distort
your message—such as when the surface area of each region fails to correspond
to your population data (see the section on physical reality in Chapter 5). If your
data is tied to population but your display is based on regional size, the
proportionally larger surface areas of some regions may inflate the appearance of
trends in those regions. Consider using a table or bar graph instead.

NOTE
If you wish to show regional trends, remember that you don’t
have to position states or countries alphabetically; it’s okay to
group them by region or along some other appropriate axis.

KEEP IT SIMPLE (OR YOU MIGHT LOOK) STUPID


We talked about careful selection of visual content in Chapter 3, and will talk
about selecting and applying encodings well in Chapter 6. But editing (in the
sense of minimizing noise to maximize signal) is also a key concept to bear in
mind for selecting a useful structure (and keeping it useful).

Consider Figure 4-14, which shows an organization chart developed in 2010 by


the Joint Economic Committee minority, Republicans. The chart, titled “Your
New Health Care System,” depicts the Democratic party’s proposed health care
system, and displays a bewildering array of new government agencies,
regulations, and mandates, represented by a tangled web of shapes and lines.

Figure 4-14. This rendition of the healthcare plan clearly revels in and aims
to exaggerate the system’s complexity.

It’s fairly obvious that political motivations dominated the design choices for this
visualization; it clearly falls into the category of persuasive visualization (rather
than informative). The chart itself doesn’t leave the reader with any actual
information other than, “Wow, this system is complicated.” When we consider
the title of the press release in which this was unveiled—“America’s New Health
Care System Revealed”—we know those responsible to be disingenuous.

A citizen designer, Robert Palmer, took it upon himself to make a different,


cleaner visual representation of the same proposed health care plan (Figure 4-
[18]
15 ). His chart is strikingly different from the one created by the Joint
Economic Committee minority.

Palmer explained his motivation in an open letter to Rep. John Boehner (R-OH)

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
on Flickr (http://www.flickr.com/photos/robertpalmer/3743826461/) /
(http://www.flickr.com/photos/robertpalmer/3743826461/
(http://www.flickr.com/photos/robertpalmer/3743826461/)):

By releasing your chart, instead of meaningfully educating the public, you


willfully obfuscated an already complicated proposal. There is no simple
proposal to solve this problem. You instead chose
[19to
] shout “12! 16! 37! 9!
24!” while we were trying to count something.

Figure 4-15. Palmer’s representation of the same healthcare plan doesn’t


oversimplify, but is much easier to parse.

There is no doubt that national healthcare is a complex matter, and this is evident
in both designs. But Palmer’s rendition clearly aims to pare down that complexity
to its essential nature, for the purpose of making things easier to understand,
rather than purposefully clouding what is happening under the abstracted layer.
This is the hallmark of effective editing.

Sometimes a designer will make the visualization more complicated than it need
to be, not because he is trying to make the data look bad, but for precisely the
opposite reason: he wants the data to look as good as possible. This is an equally
bad mistake.

Your data is important and meaningful all on its own; you don’t have to make it
special by trying to get fancy. Every dot, line and word should serve a
communicative purpose: if it is extraneous or outside the scope of the
visualization’s goals, it must go. Edit ruthlessly. Don’t decorate your data.

[ 6 ]
Or shouldn’t try to: that way madness lies.

[ 7 ]
European Soil Bureau. Copyright © 1995–2011,
European Union. Used with stated authorization to
reproduce, with acknowledgment.
http://eusoils.jrc.ec.europa.eu/
(http://eusoils.jrc.ec.europa.eu/)

[ 8 ]
Center for International Earth Science
Information Network (CIESIN) (2007). Copyright ©
2007, The Trustees of Columbia University in the
City of New York. Columbia University. Population,
Landscape, and Climate Estimates (PLACE). Used
under the Creative Commons Attribution License.
http://sedac.ciesin.columbia.edu/place/
(http://sedac.ciesin.columbia.edu/place/)

[ 9 ]
Ware, Information Visualization: Perception for
Design (Morgan Kaufmann), p. 179.

[10]
Tableau Software Public Gallery. Copyright ©
2003–2011 Tableau Software.
http://www.tableausoftware.com/learn/gallery/company-
performance
(http://www.tableausoftware.com/learn/gallery/company-performance)

[11]
Christian Caron (2011). Copyright © 2011,
Christian Caron.

[12]
Montgomery, Geoffrey, for Howard Hughes
Medical Institute. Seeing, Hearing, and Smelling the
World. Chevy Chase, MD: 1995.

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
[13]
[13]
Your authors take particular interest in examining
information design in the world, take every
opportunity to do so, and hope that everyone else
will start to do the same.

[14]
Michael Dayah (1997). Copyright © 1997 Michael
Dayah. http://www.ptable.com (http://www.ptable.com)

[15]
http://eblong.com/zarf/periodic/index.html
(http://eblong.com/zarf/periodic/index.html)

[16]
Astute readers will note that the periodic table is
also a two-axis layout with carefully chosen axes that
reflect, and facilitate access to, the relevant
properties of the data.

[17]
We care so much about this issue that we
dedicate a section in Chapter 5 to good and bad
uses of circular layouts.

[18]
Robert Palmer (2010). Copyright © 2010, Robert
Palmer. http://rp-network.com/ (http://rp-network.com/)

[19]

http://www.flickr.com/photos/robertpalmer/3743826461/
(http://www.flickr.com/photos/robertpalmer/3743826461/)

Settings / Support / Sign Out


© 2020 O'Reilly Media, Inc. Terms of Service / Privacy Policy

PREV NEXT
⏮ ⏭
3. Determine Your Goals and Supporting Data 5. First, Place

Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /

You might also like