Professional Documents
Culture Documents
netquest.com 3
What is Data Visualization
Tableau, Roambi, Qlik, Salesforce Einstein Analytics, High Charts, Google Charts,
Fusion Charts, Infogram, Sisence and Final Words are some of the Web
Applications of Data Visualization
The data
visualization process
Visualization process1
80
• The human mind can see an image for just 13mil-
liseconds and store the information, provided that 75 62
56 58
it is associated with a concept. Our eyes can take in 60
36,000 visual messages per hour. 45
40
• 40% of nerve fibers are connected to the retina.
36
20
4
Data visualization chiefly helps in 3 key aspects of For example: an interactive graphic from The Guardian2 invites us to explore how the
reports and statements: linguistic standard of U.S. presidential addresses has declined over time. The visual is
interactive and explanatory, in addition to indicating the readability score of various
presidents’ speeches.
1)Explaining
Visuals aim to lead the viewer down a path in order to describe situations, answer
questions, support decisions, communicate information, or solve specific problems. 3)Analyzing
When you attempt to explain something through data visualization, you start with a
question, which interacts with the data set in such a way that enables viewers to make Other visuals prompt viewers to inspect, distill, and transform the most significant
a decision and, subsequently, answer the question. information in a data set so that they can discover something new or predict upcom-
ing situations.
For example: This graphic below could clearly explain the country with the greatest
demand for a certain product compared globally, in a concrete month. For example: this interactive graphic about learning machine3 invites us to explore
and discover information within the visual by scrolling through it. Using the machine
500 learning method, the visual explains the patterns detected in the data in order to cate-
400 gorize characteristics.
300
200
We’ll close this introduction with a 2012 reflection by Alberto Cairo, a specialist in
100 information visualization and a leader in the world of data visualization. For the
0 author, a good visual must provide clarity, highlight trends, uncover patterns, and
United Russia South Europe Canada Australia Japan
States Africa reveal unseen realities:
2)Exploring We create visuals so that users can analyze data and, from it, dis-
cover realities that not even the designer, in some instances, had
Some visuals are designed to lend a data set spatial dimensions, or to offer numerous
considered.”
subsets of data in order to raise questions, find answers, and discover opportunities.
When the goal of a visual is to explore, the viewers start by familiarizing themselves 2 Available at: https://www.fusioncharts.com/whitepapers/downloads/Principles-of-Data-Visualization.pdf
3 Available at: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
with the dataset, then identifying an area of interest, asking questions, exploring, and
finding several solutions or answers.
Data types,
2. relationships, and
visualization formats
netquest.com
8
Data types, 2 kinds of data
relationships, and Before we talk about visuals themselves, we must first understand the different
kinds of data that can be visualized and how they relate to one another.
visualization formats The most common kinds of data are4:
5 Source: Hubspot, Prezy, and Infogram (2018). Presenting Data People Can’t
Ignore: How to Communicate Effectively Using Data. |p.10 of 16 |Available at:
https://offers.hubspot.com/presenting-data-people-cant-ignore.
netquest.com 7
7 data relationships
Data relationships can be simple, like the progress of a single metric over time (such as visits to a blog over the course of 30 days or the number of users on a social network),
or they can be complex, precisely comparing relationships, revealing structure, and extracting patterns from data. There are seven data relationships to consider:
Ranking: A visualization that relates two or more values Nominal comparisons: Visualizations that compare Series over time: Here we can trace the changes in the
with respect to a relative magnitude. For example: a quantitative values from different subcategories. For values of a constant metric over the course of time. For
company’s most sold products. example: product prices invarious supermarkets. example: monthly sales of a product over the course of
two years.
Deviation: Examines how each data point relates to the Distribution: Visualization that shows the distribu-
others and, particularly, to what point its value differs tion of data spatially, often around a central value.
from the average. For example: the line of deviation for For example: the heights of players on a basketball team.
tickets to an amusement park sold on a rainy versus a Partial and total relationships: Show a subset of data
normal day. as compared with a larger total. For example: the per-
centage of clients that buy specific products.
netquest.com 8
11formats 1. Bar chart
There are two types of visualizations: static and Bar charts are one of the most popular ways of visual- They are very versatile, and they are typically used
interactive. Their use depends on the search and izing data because they present a data set in a quickly to compare discrete categories, to analyze changes
analysis dimension level. Static visuals can only understood format that enables viewers to identify over time, or to compare parts of a whole.
analyze data in one dimension, whereas inter- highs and lows at a glance. The three variations on the bar chart are:
active visuals can analyze it in several.
5,500
5,000
4,000
3,500
3,000 Feb
Entertainment
2,500
2,000
1,500 Mar
Heatlh
1,000
500
0
Jan Feb M ar Apr M ay 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
netquest.com 9
2. Histograms
60-80
• Vertical columns 100K
3. Piecharts
netquest.com 10
4. Scatter plots
1.0 30.000
0.4
10.000
0.2 5.000
0 0
5. Heat maps
netquest.com 11
6. Linecharts 7. Bubble charts 8. Radarcharts
These are used to display changes or trends in data These graphics display three-dimensional data and These are a form of representation built around a
over a period of time. They are especially useful for accentuate data in dispersion diagrams and maps. regular polygon that is contained within a circle,
showcasing relationships, acceleration, deceleration, Their purpose is to highlight nominal comparisons and where the radii that guide the vertices are the axes
and volatility in a data set. classification relationships. The size and color of the over which the values are represented. They are
bubbles represent a dimension that, along with the equivalent to graphics with parallel coordinates on polar
data, is very useful for visually stressing specific values. coordinates. Typically, they are used to represent the
The two variations on the bubble chart are: behavior of a metric over the course of a set time cycle,
such as the hours of the day, months of the year, or days
• The bubble plot: used to show a variable in three of the week.
dimensions, position coordinates (x, y) and size.
Line chart
• Bubble map: used to visualizethree-dimensional
values for geographic regions.
Radar chart
netquest.com 12
9. Waterfallcharts
400K
These help us understand the cumulative effect
350K
of positive and negative values on variables in a
sequential fashion. 300K
250K
200K
150K
100K
50K
Start A B C D E F G H I J K L End
Fall Rise
10. Treemaps
A
Tree maps display hierarchical data (in a tree struc-
B C
ture) as a set of nested rectangles that occupy sur- A
200
face areas proportional to the value of the variable
they represent. Each tree branch is given a rectangle, E H
which is later placed in a mosaic with smaller rectangles B C
that represent secondary branches. The finished prod- 80 120
netquest.com 13
11.Areacharts Selecting the right graphic to effectively communicate
through our visualizations is no easy task. Stephen
1.0
These represent the relationship of a series over Few (2009), a specialist in data visualization, proposes
time, but unlike line charts, they can represent taking a practical approach to selecting and using an
0.8
volume. The three variations on the area chart are: appropriate graphic:
0.6
• Standard area: used to display or compare a pro- • Choose a graphic that will capture the viewer’s
gression over time. 0.4 attention for sure.
• Stacked area: used to visualize relationships as part
of the whole, thus demonstrating the contribution of 0.2 • Represent the information in a simple, clear, and
each category to the cumulative total. precise way (avoid unnecessary flourishes).
• 100% stacked area: used to communicate the dis- 0
1 2 3 4 5 6
tribution of categories as part of a whole, where the • Make it easy to compare data; highlight trends
cumulative total does not matter. and differences.
Standard area
0.4
0.4
0.2
0.2
0
0
0 1 2 3 4 5 6
0 1 2 3 4 5 6
A B C
netquest.com 14
3. Basic principles for
data visualization
netquest.com 17
Basic principles for Shneiderman introduces his famous mantra on how
to approach the quest for visual information, which he
data visualization breaks down intothree tasks:
Graphics with
1. Overview first: This ensures viewers have a general
understanding of the data set, as their starting point for
1.
System Context
an objective:seeking exploration. This means offering them a visual snapshot
of the different kinds of data, explaining their relation- Thesystem plus users and
your mantra ship in a single glance. This strategy helps us visualize the system dependencies
data, at all its different levels, at one time. OVERVIEW
FIRST
The goal of data visualizations is to help us understand
the object they represent. They are a medium for com-
2. Zoom and filter: The second step involves supple-
menting the first so that viewers understand the data’s
2.
municating stories and the results of research, as well underlying structure. The zoom in/zoom out mechanism Containers
as a platform for analyzing and exploring data. There- enables us to select interesting subsets of data that meet The overall shape of the archi-
fore, having a sound understanding of how to create certain criteria while maintaining the sense of position tecture and technology choices.
data visualizations will help us create meaningful and and context.
easy-to-remember reports, infographics, and dash-
boards. Creating suitable visuals helps us solve problems 3. Details on demand: This makes it possible to select
and analyze a study’s objects in greater detail. a narrower subset of data, enabling the user to interact 3.
with the information and use filters by hovering or click- Components ZOOM AND
FILTER
The first step in representing information is trying ing on the data to pull up additional information. Logical components and their
to understand that data visualization. interactions withina container.
The chart on the right side summarizes the key points to
Ben Shneiderman gave us a useful starting point in his designing such a graphic, with an eye to human visual
text “The Visual Information-Seeking Mantra” (1996),
which remains a touchstone work in the field. This
perception, so that users can translate an idea into a set
of physical attributes.
4.
Classes DETAILSON
author suggests a simple methodology for novice users
DEMAND
to delve into the world of data visualization and experi- These attributes are: structure, position, form size, Component or patternimple-
ment with basic visual representation tasks. 5
and color. When properly applied, these attributes can mentation details.
5 Shneiderman, B. (1996). The Eyes Have It: A Task by Data Type Taxonomy for present information effectively andmemorably.
Information Visualizations. Visual Information Seeking Mantra (p. 336). Available at:
https://www.cs.umd.edu/~ben/papers/Shneiderman1996eyes.pdf
netquest.com 18
Structuring: the importance
Layout anddesign: Furthermore, the visual hierarchy of elements plays a
of layout
communicative role in this encoding process, because the elements’
organization and distribution must have a well-defined
elements All visual representations begin with a blank dimensional hierarchical system in order to communicate effec-
space that will eventually hold the information which tively (Meirelles: 2014). In a sense, visualizations are
will be communicated. The process of spatial coding is paragraphs about data, and they should be treated
a fundamental part of visual representation because it as such. Words, images, and numbers are part of the
In order to begin designing our reports and state- is the medium in which the results of our compositional information that will be visualized. When all of the
ments, it is essential to understand that visual repre- decisions and the meaning of our visual statement will elements are integrated in a single structure and visual
sentations are cognitive tools that complement and be visualized, thereby having an impact on the user. hierarchy, the infographic or report will organize space
strengthen our mental ability to encode and decode properly and communicate effectively, according to
information6. Meirelles (2014) notes that: “Allgraphic Edward Tufte (1990) defines “layout” as a scheme for your user’s needs.
representation affects our visual perception, distributing visual elements in order to achieve organi-
because the elements of transmission utilized act zation and harmony in the final composition. Layout
as external stimuli, which activate our emotional planning and design serve as a template for applying
state and knowledge.” hierarchy and control to information at varying levels of
detail.7 In his book Envisioning Information, Tufte offers
Thus, when our mind visualizes a representation, it several guidelines for informationdesign:
transforms the information, merges it, and applies a
hierarchical structure to it tofacilitate interpretation. • Have a properly chosen format.
• Give a broad visual tour and offer a focused reading
For this reason, in order to have an efficient per- at different detail levels.
ceptive impact, it is important to adhere to a series • Use words, numbers, anddrawings.
of best practices when creating reports and info- • Reflect a balance, a proportion, a sense of relevant
graphics. As with any other form of communication, scale, and a context.
success depends largely on the business’s familiarity
with the established code and the resources available. Spatial encoding requires processing spatial proportions
Space, shapes, color, icons, and typography are a (position and size), which have a determining role in the
few of the essential elements of a striking visual with organization ofperception and memory.
communicative power.
6 Meirelles, I (2014). “La información en el diseño,” (p.21-22).Barcelona: Parramón. 7 Tufte, E. (1990). Envisioning Information. Cheshire: Graphics Press.
netquest.com 19
Visual variables
and theirsemantics
netquest.com 18
Using consistent andattractive
color schemes
Cool colors
Saturation: this refers to the intensity of a given color’s
hue. It varies based on brightness. Darker colors are less
saturated, and the less saturated a color is, the closer
it gets to gray. In other words, it gets closer to a neutral Saturated colors
(hueless) color. The following graphic offers a brief sum-
mary of color application.
netquest.com 19
Isabel Meirelles (2014) notes that selecting a color pal- 2. Divergingpalettes TIP: The qualitative color scheme is perfect for visualiz-
ette in order to visualize data is no easy task, and she ing data because it affords a high degree of contrast and
recommends following Cynthia Brewer’s advice uses These are more suitable for ordering categorical data, helps you draw attention to important points, especially
three different kinds of color schemes, based on the and they are more effective when the categorical if you use one predominant color and use the second as
nature of the data: division is in the middle of the sequence. The change in an accent in yourdesign.
brightness highlights a critical value in the data, such as
the mean or median, or a zero. Colors become darker to
1. Monochromatic sequential palettesor represent differences in both directions, based on this Finally, don’t forget to use palettes that are comprehen-
their analogue meaningful value in the middle of the data. sible to people who can’t see color. Color blindness is a
disability or limited ability that makes it difficult to distin-
These palettes are great for ordering numeric data that guish certain pairs of colors, such as blue and yellow, or
progresses from small to large. It is best to use brighter red and green. One strategy for avoiding this problem is
color gradients for low values and darker ones for to adapt designs that use more than just hue to codify
higher values. TIP: Try to emphasize the most important information information; create schemes that slightly vary another
using arrows and text, circles, rectangles, or contrasting channel, such as brightness orsaturation.
colors. This way, when you visualize your data, your
analysis will be moreunderstandable.
netquest.com 22
Use icons and symbols to aid 82%
77%
88%
76% 73%
63%
in understanding and limit Notebooks
64%
55% 54%
unnecessary tagging
Entertainment
Symbols and icons are another avenue for visualizing Lifestyle products
information that goes beyond merely being decorative.
They draw strength from their ability to exhibit a gen- Singles Couples Families
eral context in an attractive, precise way. Icons illustrate
concepts. Viewers can understand what the information
is about by just glancing at the illustration.
Lifestyle products
That said, they certainly should not be complex illustra- 77% 73% 54%
tions. An icon with too many details could hinder viewers’
understanding. Keep it simple: icons’ meaning should be
immediately clear, even when they’re very small.
netquest.com 23
The typography in our reports: sense of tradition, security, history, integrity, author-
effective applications ity, integrity, and other such concepts. Sans-serif
fonts stand out because they have a more polished,
sophisticated feel; they convey a sense of modernity,
Typography plays an important role in the design order, cleanliness, elegance, avant-garde, and style.
of reports and statements. Selecting the right font • Pay attention to legibility. Remember that screen
strengthens your message and captures the audience’s type does not appear in the same way as print type.
attention. Müller-Brockmann (1961), a graphic designer, It is best to choose a more responsive (sans-serif) font
defines typography as the proper visual element for for on-screen texts, and fonts with serifs for printed
composition. He notes that “the reader must be able to reports. That said, there’s an exception to every rule,
read the message from a text easily and comfortably. and today there is a bounty of fonts that are perfectly
This depends largely on the size of the text, the length of suitable for both digital and print media.
the lines, and the spacing between the lines”. 8
• Watch your weight (light, regular, bold). When
it comes to bolding your text, a value of two or three
Typography is an art form in and of itself, in which should be plenty. It is better to reserve the heaviest
every font has its own characteristics, which should weight for headlines and then apply a stylistic hierar-
be strategically combined. chy based on your content. Avoid fonts that only offer
one weight or style, since their applications are limited.
For people outside the world of graphic design, choos- • Don’t forget that some fonts use more memory
ing a font and setting other typographical features can than others. Fonts with serifs generally monopolize
be tricky, but it doesn’t have to be. Let’s take a practical more of your computer’s brain power than sans-serif
look at the steps you should take when determining fonts. This is an important consideration in interactive
your typography, and then consider the images and reports, since a document that occupies more RAM
visual elements that best accompany your text. Consid- will be lessresponsive.
erations when setting yourtypography:
Fonts have personalities that help us establish a more
• Determining the goal of your report’s content. attractive visual tone for our audience. Familiarizing
• Select a font that strengthens that goal. yourself with a few can go a long way. There are:
Fonts come in two types: with serifs or without (sans)
serifs. Serif fonts have an extra stroke that conveys a • Professional fonts • Handwritten fonts
netquest.com 2
4
Prioritize patterns in your visualizations: Gestalt
The basic elements of the visualization process also involve preattentive attributes. Preattentive attributes are visual
features that facilitate the rapid visual perception of a graphic in a space. Designers use these characteristics to
better uncover relevant information in visuals, because these characteristics attract the eye.
Colin Ware, Director of the Data Visualization Research Lab at the University of New Hampshire, has highlighted
that preattentive attributes can be used as resources for drawing viewers’ immediate attention to certain
parts of visual representations (2004). According to Ware, preattentive processing happens very quickly—typi-
cally in the first 10 milliseconds. This process is the mind’s attempt to rapidly extract basic visual characteristics from
the graphic (stage 1).These characteristics are then consciously processed, along with the perception of the object,
so that the mind can extract patterns (stage 2), ultimately enabling the information to move to the highest level of
perception (stage 3). This makes it possible to find answers to the initial visual question, utilizing the information
saved in our minds. Colin Ware, cited in Meirelles (2014), explains it as follows:
Preattentive attributes enhance object perception and cognition processes, leveraging our mind’s visual capacities.
Good data visualizations deliberately make use of these attributes because they boost the mind’s discovery and rec-
ognition of patterns such as lines, planes, colors, movements, and spatial positioning.9
9 Dondis, D.A. (2015). La sintaxis de la imagen: introducción al alfabeto visual. Editorial Gustavo Gili: Barcelona
Meirelles, I. (2014). La información en el diseño. Barcelona: Parramón.
netquest.com 23
The visual below lists preattentive attributes that represent
aspects of lines and planes when visualizing and analyzing
graphic representation: shape, color, and spatial position.
Shape
Orientation Line Length Line Width Size
netquest.com 24
Detecting patterns is fundamental to structuring and
organizing visual information. When we create visuals,
we often want to highlight certain patterns over others.
Preattentive attributes are the alphabet of visual lan-
guage; analytic patterns are the words that we write
by using them. When we see a good visualization, we
immediately detect the preattentive attributes and rec-
ognize analytic patterns in the visualization. The follow-
ing table summarizes a few basic analytic patterns:
Analytic patterns
netquest.com 25
We have seen how preattentive attributes and patterns Gestalt’s principles are the principles that enable us to According to Dondis (2015), Gestalt’s principles help
make it possible to process and analyze visual informa- understand the requirements posed by certain prob- describe the way we organize and merge elements
tion; they also enable us to improve pattern discovery lems so that we see everything as an integral, coherent in our minds. They quiet the noise of the graphics so
and perceptive inferences and provide processes for whole. It involves proximity, similarity, shared destiny, that we relate, combine, and analyze them. These
solving visualization problems. “pragnanz” or pithiness, closure, simplicity, familiarity, principles come into play whenever we analyze any
and discernment between figure and ground. sort of visualization. Only position and length can be
used to accurately perceive quantitative data. The
other attributes are useful for perceiving other sorts
of data, such as categorical and relational data.
Gestalt’s principles
netquest.com 28
Session No: CO2-7
Session Topic: The Process of Visualization
The process of understanding data begins with a set of numbers and a question.
The following steps form a path to the answer:
Acquire
Parse
Filter
Mine
Represent
Refine
Interact
The Process of Visualization…
• To illustrate the seven steps listed in the Earlier Slides , and how they contribute
to effective information visualization, let’s look at how the process can be
applied to understanding a simple data set.
• In this case, we’ll take the zip code numbering system that the U.S. Postal
Service uses.
• The application is not particularly advanced, but it provides a skeleton for how
the process works
Example To Illustrate The Process of Visualization(Acquire)
• The acquisition step involves obtaining the data
• A copy of the zip code listing can be found on the U.S. Census Bureau web site, as it is frequently used for
geographic coding of statistical data
Figure Zip codes in the format provided by the U.S. Census Bureau
• Acquisition concerns how the user downloads your data as well as how you obtained the data in the first
place
• As you design the application, you have to take into account the time required to download data into the
browser.
• And because data downloaded to the browser is probably part of an even larger data set stored on the
server, you may have to structure the data on the server to facilitate retrieval of common subsets
Example To Illustrate The Process of Visualization(Parse)
• After you acquire the data, it needs to be parsed—changed into a format that tags each part of the data with
its intended use.
• Each line of the file must be broken along its individual parts; in this case, it must be delimited at each tab
character.
• Then, each piece of data needs to be converted to a useful format. Figure shows the layout of each line in
the census listing, which we have to understand to parse it and get out of it what we want.
Mine
• This step involves math, statistics, and data mining.
• The data in this case receives only a simple treatment: the program must figure out the minimum and
maximum values for latitude and longitude by running through the data (as shown in Figure ) so that it can
be presented on a screen at a proper scale.
• Most of the time, this step will be far more complicated than a pair of simple math operations.
Example To Illustrate The Process of Visualization(Filter and
Mine)
Example To Illustrate The Process of Visualization (Mine)
Figure:Mining the data: just compare values to find the minimum and maximum
Example To Illustrate The Process of Visualization(Represent)
• This step determines the basic form that a set of data will take.
• Some data sets are shown as lists, others are structured like trees, and so forth.
• In this case, each zip code has a latitude and longitude, so the codes can be mapped as a two-dimensional
plot, with the minimum and maximum values for the latitude and longitude used for the start and end of the
scale in each dimension. This is illustrated in Figure
Figure : The user can alter the display through choices (zip codes starting with 0)
Example To Illustrate The Process of Visualization(Interact)
1.https://www.oreilly.com/library/view/visualizing-
data/9780596514556/ch01.html
Will Resume After 5 minutes........
Case Study
A Simple Case Study That performs all the Steps of Visualization
• As a part of your case study Download the Superstoreus2015 data file from
superdatascience.com
• Find The Type of Data Present in Each Column
• Select any four columns using any one of the visualization tools(Python or
Tableau)
• Find the Maximum Value in the column named profit and replace with either
maximum or minimum value by using (Python or Tableau)
• Create Bar Chart, Pie Chart and Histogram for any two columns that you selected
earlier by using any of the above tools
• Create interactions between the selected columns who they are related to each
other by using data visualization tools
Poll Question-02
What are the steps in process of visualization?
Option A: Accuire
Option B:Parse
Option C: Mine
Option D: All the Above
9.
Data Abstraction in Visualization
The Big Picture
• The four basic dataset types are tables, networks, fields, and
geometry; other possible collections of items include clusters, sets,
and lists.
• These datasets are made up of different combinations of the five data
types: items, attributes, links, positions, and grids.
• For any of these dataset types, the full dataset could be available
immediately in the form of a static file, or it might be dynamic data
processed gradually in the form of a stream.
• The type of an attribute can be categorical or ordered, with a further
split into ordinal and quantitative.
• The ordering direction of attributes can be sequential, diverging, or
cyclic.
The Big Picture…
Figure shows the abstract types of what can be visualized.
Why Do Data Semantics and Types Matter?
• Many aspects of vis design are driven by the kind of data that you have at your disposal
• . What kind of data are you given?
• What information can you figure out from the data, versus the meanings that you must be told explicitly?
• What high-level concepts will allow you to split datasets apart into general and useful pieces?
• Suppose that you see the following data:
14, 2.6, 30, 30, 15, 100001
What does this sequence of six numbers mean? You can’t possibly know yet, without more information about
how to interpret each number. Is it locations for two points far from each other in three-dimensional space, 14,
2.6, 30 and 30, 15, 100001? Is it two points closer to each other in two-dimensional space, 14, 2.6 and 30, 30,
with the fifth number meaning that there are 15 links between these two points, and the sixth number
assigning the weight of ‘100001’ to that link?
• Similarly, suppose that you see the following data:
Basil, 7, S, Pear
These numbers and words could have many possible meanings. Maybe a food shipment of produce has arrived
in satisfactory condition on the 7th day of the month, containing basil and pears. Maybe the Basil Point
neighborhood of the city has had 7 inches of snow cleared by the Pear Creek Limited snow removal service.
Maybe the lab rat named Basil has made seven attempts to find his way through the south section of the maze,
lured by scent of the reward food for this trial, a pear.
Why Do Data Semantics and Types Matter?
• The type of the data is its structural or mathematical interpretation.
• At the data level, what kind of thing is it: an item, a link, an attribute?
• At the dataset level, how are these data types combined into a larger structure: a table, a tree, a field of
sampled values?
• At the attribute level, what kinds of mathematical operations are meaningful for it?
• For example, if a number represents a count of boxes of detergent, then its type is a quantity, and adding
two such numbers together makes sense.
• If the number represents a postal code, then its type is a code rather than a quantity—it is simply the name
for a category that happens to be a number rather than a textual name. Adding two of these numbers
together does not make sense.
• Sometimes types and semantics can be correctly inferred simply by observing the syntax of a data file or the
names of variables within it, but often they must be provided along with the dataset in order for it to be
interpreted correctly.
• Sometimes this kind of additional information is called metadata; the line between data and metadata is not
clear, especially given that the original data is often derived and transformed
•.
Dataset Types
Tables
• Many datasets come in the form of tables that are made up of rows and columns, a familiar form to anybody who
has used a spreadsheet.
• For a simple flat table, the terms used here are that each row represents an item of data, and each column is
an attribute of the dataset. Each cell in the table is fully specified by the combination of a row and a column—an
item and an attribute—and contains a value for that pair.;
• A multidimensional table has a more complex structure for indexing into a cell, with multiple keys.
Dataset Types…
Networks
The dataset type of networks is well suited for specifying that there is some kind of relationship between two
or more items.
• An item in a network is often called a node.
• A link is a relation between two items.
• For example, in an articulated social network the nodes are people, and links mean friendship.
• In a gene interaction network, the nodes are genes, and links between them mean that these genes have
been observed to interact with each other.
• In a computer network, the nodes are computers, and the links represent the ability to send messages
directly between two computers using physical cables or a wireless connection.
Trees
• Networks with hierarchical structure are more specifically called trees.
• In contrast to a general network, trees do not have cycles: each child node has only one parent node
pointing to it.
• One example of a tree is the organization chart of a company, showing who reports to whom;
• Another example is a tree showing the evolutionary relationships between species in the biological tree of
life, where the child nodes of humans and monkeys both share the same parent node of primates.
Dataset Types…
Fields
• The field dataset type also contains attribute values associated with cells.
• Each cell in a field contains measurements or calculations from a continuous domain: there are conceptually
infinitely many values that you might measure, so you could always take a new measurement between any
two existing ones.
• Continuous phenomena that might be measured in the physical world or simulated in software include
temperature, pressure, speed, force, and density; mathematical functions can also be continuous.
For example, consider a field dataset representing a medical scan of a human body containing measurements
indicating the density of tissue at many sample points, spread regularly throughout a volume of 3D space. A
low-resolution scan would have 262,144 cells, providing information about a cubical volume of space with 64
bins in each direction. Each cell is associated with a specific region in 3D space. The density measurements
could be taken closer together with a higher resolution grid of cells, or further apart for a coarser grid.
• Continuous data requires careful treatment that takes into account the mathematical questions of sampling,
how frequently to take the measurements, and interpolation, how to show values in between the sampled
points in a way that does not mislead.
• Interpolating appropriately between the measurements allows you to reconstruct a new view of the data
from an arbitrary viewpoint that’s faithful to what you measured.
• These general mathematical problems are studied in areas such as signal processing and statistics.
Visualizing fields requires grappling extensively with these concerns
Spatial Fields
• Continuous data is often found in the form of a spatial field, where the cell structure of the field is based on
sampling at spatial positions.
• Most datasets that contain inherently spatial data occur in the context of tasks that require understanding aspects
of its spatial structure, especially shape.
For example, with a spatial field dataset that is generated with a medical imaging instrument, the user’s task could be
to locate suspected tumors that can be recognized through distinctive shapes or densities. An obvious choice for
visual encoding would be to show something that spatially looks like an X-ray image of the human body and to use
color coding to highlight suspected tumors
Grid Types
• When a field contains data created by sampling at completely regular intervals, the cells form a uniform grid.
• There is no need to explicitly store the grid geometry in terms of its location in space, or the grid topology in
terms of how each cell connects with its neighboring cells.
• More complicated examples require storing different amounts of geometric and topological information about the
underlying grid.
• A rectilinear grid supports nonuniform sampling, allowing efficient storage of information that has high complexity
in some areas and low complexity in others, at the cost of storing some information about the geometric location
of each each row.
• A structured grid allows curvilinear shapes, where the geometric location of each cell needs to be specified.
• Finally, unstructured grids provide complete flexibility, but the topological information about how the cells connect
to each other must be stored explicitly in addition to their spatial positions.
Geometry
• The geometry dataset type specifies information about the shape of items with explicit spatial positions.
• The items could be points, or one-dimensional lines or curves, or 2D surfaces or regions, or 3D volumes.
• Geometry datasets are intrinsically spatial, and like spatial fields they typically occur in the context of tasks
that require shape understanding.
• Spatial data often includes hierarchical structure at multiple scales. Sometimes this structure is provided
intrinsically with the dataset, or a hierarchy may be derived from the original data.
• Geometry datasets do not necessarily have attributes, in contrast to the other three basic dataset types
• One classic example is when contours are derived from a spatial field.
• Another is when shapes are generated at an appropriate level of detail for the task at hand from raw
geographic data, such as the boundaries of a forest or a city or a country, or the curve of a road.
• Geometric data is sometimes shown alone, particularly when shape understanding is the primary task.
• In other cases, it is the backdrop against which additional information is overlaid.
Dataset Availability
• The default approach to vis assumes that the entire dataset is available all at once, as a static file.
• However, some datasets are instead dynamic streams, where the dataset information trickles in over the
course of the vis session.
• One kind of dynamic change is to add new items or delete previous items. Another is to change the values
of existing items.
• This distinction in availability crosscuts the basic dataset types: any of them can be static or dynamic.
• Designing for streaming data adds complexity to many aspects of the vis process that are straightforward
when there is complete dataset availability up front.
Attribute Types
• Figure shows the attribute types.
• The major disinction is between categorical versus ordered.
• Within the ordered type is a further differentiation between ordinal versus quantitative.
• Ordered data might range sequentially from a minimum to a maximum value, or it might diverge in both
directions from a zero point in the middle of a range, or the values may wrap around in a cycle.
• Also, attributes may have hierarchical structure.
Semantics
• Knowing the type of an attribute does not tell us about its semantics, because these two questions are
crosscutting: one does not dictate the other.
• Different approaches to considering the semantics of attributes that have been proposed across the many fields
where these semantics are studied.
• The classification in this book is heavily focused on the semantics of keys versus values, and the related questions
of spatial and continuous data versus nonspatial and discrete data, to match up with the idiom design choice
analysis framework.
• One additional consideration is whether an attribute is temporal.
Key versus Value Semantics
• A key attribute acts as an index that is used to look up value attributes.*The distinction between key and value
attributes is important for the dataset types of tables and fields, as shown in Figure
Semantics…
• Flat Tables ,Multidimensional Tables, Fields, Scalar Fields, Vector Fields, Tensor Fields, Field Semantics
Temporal Semantics
• A temporal attribute is simply any kind of information that relates to time.
• Data about time is complicated to handle because of the rich hierarchical structure that we use to reason
about time, and the potential for periodic structure.
• The time hierarchy is deeply multiscale: the scale of interest could range anywhere from nanoseconds to
hours to decades to millennia. Even the common words time and date are a way to partially specify the scale
of temporal interest.
• Temporal analysis tasks often involve finding or verifying periodicity either at a predetermined scale or at
some scale not known in advance.
• Moreover, the temporal scales of interest do not all fit into a strict hierarchy; for instance, weeks do not fit
cleanly into months.
• Thus, the generic vis problems of transformation and aggregation are often particularly complex when
dealing with temporal data.
• One important idea is that even though the dataset semantics involves change over time, there are many
approaches to visually encoding that data—and only one of them is to show it changing over time in the
form of an animation.
Time-Varying Data
• A dataset has time-varying semantics when time is one of the key attributes, as opposed to when the
temporal attribute is a value rather than a key.
• As with other decisions about semantics, the question of whether time has key or value semantics requires
external knowledge about the nature of the dataset and cannot be made purely from type information.
• An example of a dataset with time-varying semantics is one created with a sensor network that tracks the
location of each animal within a herd by taking new measurements every second.
• Each animal will have new location data at every time point, so the temporal attribute is an independent
key and is likely to be a central aspect of understanding the dataset.
• In contrast, a horse-racing dataset covering a year’s worth of races could have temporal value attributes such
as the race start time and the duration of each horse’s run.
• These attributes do indeed deal with temporal information, but the dataset is not time-varying.
• A common case of temporal data occurs in a time-series dataset, namely, an ordered sequence of time–
value pairs.
• These datasets are a special case of tables, where time is the key. These time-value pairs are often but not
always spaced at uniform temporal intervals.
• Typical time-series analysis tasks involve finding trends, correlations, and variations at multiple time scales
such as hourly, daily, weekly, and seasonal.
Referrence
https://learning.oreilly.com/library/view/visualization-analysis-
and/9781466508910/K14708_C002.xhtml
Filtering and Aggregation
Reduce Item And Attributes
Ware Chapter 10
• External Cognition
Introduction and Overview
• Visualization as an “internal interface”
– Interface between human and computer in a man-machine problem-solving system
• Computer-based information system supports data gathering, calculation, and analysis
• Augments investigator’s working memory
– Provides visual markers for concepts
– Reveals structural relationships between problem components
vvv
VxInsight
• Interaction paradigm (Shneiderman):
– Overview
– Zoom
– Filter
– Details on demand
– Browse
– Search query
• Or (Ware) …
– Lowest level
• Data manipulation loop
– Intermediate
• Exploration and navigation loop
– Highest
• Problem-solving loop
VxInsight - Overview
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search query
VxInsight - Zoom
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
VxInsightv - Details
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
VxInsightv - Query
• Interaction
paradigm
– Overview
– Zoom
– Filter
– Details on
demand
– Browse
– Search
query
Recall, Visualization Pipeline:
Or, another take on interaction: Mapping Data to Visual Form
F F -1 User
Raw Visual - Task
Dataset Views
Information Form
Visual
Data Visual View Perception
Transformations Mappings Transformations
Interaction
F F -1 User
Raw Visual - Task
Dataset Views
Information Form
Visual
Data Visual View Perception
Transformations Mappings Transformations
Interaction
• Interactive visualization
– Process made up of interlocking feedback loops
Problem Solving
Task
Overview Write,
Zoom
Filter Forage for decide,
Extract
Compose
Details
Browse
Search query
data or act
Reorder
Cluster Read fact
Class
Average
Search for Problem- Read comparison
Read patter
Promote
Detect pattern schema solve
Manipulate
Create
Delete
Abstract
Instantiate
schema
Instantiate
Again, Ware’s Interlocking Feedback Loops
• Interactive visualization
– Process made up of interlocking feedback loops
Problem Solving
Problem Solving
Exploration
Data Manipulation
and Navigation
Data
Manipulation
Exploration and
Navigation
Lowest Level: Data Manipulation Loop
Problem Solving
Exploration
and Navigation
Data
Manipulation
Lowest Level: Data Manipulation Loop
• Visual-Manual Control Loop
• Detail next
Model Human Processor + Attention
• Sensory store
– Rapid decay “buffer” to hold
sensory input for later processing
• Perceptual processor
– Recognizes symbols, phonemes
– Aided by LTM
• Cognitive processor
– Uses recognized symbols
– Makes comparisons and
decisions
– Problem solving
– Interacts with LTM and WM
• Motor processor
– Input from cog. proc. for action
– Instructs muscles
– Feedback
• Results of muscles by senses
• Attention
– Allocation of resources
Model Human Processor
Recall
• An architecture with
parameters for cognitive
engineering …
– Will see visual image store, etc.
tonight
• Memory properties
– Decay time: how long memory lasts
– Size: number of things stored
– Encoding: type of things stored
Model Human Processor
Motor Processor
• Motor processor
– tM = 70 (range 30-70)
– For repetitive tasks without
feedback
• 1. Open-loop control
– Motor processor runs a program by itself – no feedback about correctness
• 2. Closed-loop control
– Experiment: Looking at lines, draw within the lines
– Muscle movements (or their effect on the world) are perceived by cognitive
system and compared with desired result
– Demo:
• Best – won’t run on class box
• http://www.tele-actor.net/fitts/index.html
– Demo:
• OK – no line plotted
• http://fww.few.vu.nl/hci/interactive/fitts/
Fitts’s Law - demo
• Fitts’s Law
– Fundamental law of human sensory-motor system
• Pie menu items are typically selected faster than linear menu items
– Small distance from the center of the menu
– Wedge-shaped target areas are large
Power Law of Practice
• Time to do a task decreases with
practice
– Obviously
– Involves all of perceptual-cognitive-
motor system
• Example:
– Novices get rapidly better at task with
practice, but performance “levels off”
– Though still increasing performance
Intermediate Level:
Exploration, View Refinement and Navigation
Problem Solving
Exploration
and Navigation
Data
Manipulation
Intermediate Level:
Exploration, View Refinement and Navigation
• View navigation important when data space is too
large to fit on screen Problem Solving
– Complex problem
Exploration
– Considers theories of pathfinding and map use, cognitive
and Navigation
spatial metaphors, direct manipulation, visual feedback
Data
• Basic navigation control loop (below) Manipulation
– Left is human – cognitive and spatial model with which user
understands data space and progress through it
• Maintaining data space for some time may become encoded in
long-term memory
– Right is system – visualization may be updated and refined
from data mapped into spatial model
• Includes:
– 3D Locomotion and viewpoint control
– Pathfinding
– Focus + context
3D Locomotion and Viewpoint Control
Navigation in 3D
• Examples
– Web browser: Harmony
– Clustering of text, Wise et al.
3D Locomotion and Viewpoint Control:
Spatial Metaphors
• Evaluation
– Exploration and Explanation
– Cognitive and Physical Affordance
– Task 1: Find areas of detail in the scene
– Task 2: Make the best movie
– 3D environments: Hallway, extended terrain, closed object.
• World-in-hand
– Good for discrete objects
– Poor affordances for looking scale changes – detail
– Problem with center of rotation when extended scenes
• Eye-in-hand
– Easiest under some circumstances
– Poor physical affordances for many views
– Subjects sometimes acted as if model were actually present
• Walking
• Worldlets
– Can be rotated to facilitate recognition
Frames of Reference
Egocentric, Exocentric
• Egocentric
– view from user
• Exocentric
– View from outside the user
– Road map just one of many
exocentric view
• Options
– Provide display of both
– Provide easy, non-jarring
switch between them
• Multiple-Window
Zoom with Callouts …
Focus+Context: Fisheye Views, 1
• Detail + Overview
– Keep focus, while remaining aware
of context
• Fisheye views
– Physical, of course, also ..
– A distance function. (based on
relevance)
– Given a target item (focus)
– Less relevant other items are
dropped from the display
– Classic cover
• New Yorker’s idea of the world
Focus+Context: Fisheye Views, 2
• Detail + Overview
– Keep focus while remaining aware of context
• Fisheye views
– Physical, of course, also ..
– A distance function. (based on relevance)
– Given a target item (focus)
– Less relevant other items are dropped from
the display
– Or, are just physically smaller – distortion
Distortion Techniques, Generally
• Distort space = Transform space
– By various transformations
• Demo
Other Navigation Techniques:
GeoZui3D, Zooming + 2 dof rotations
• Then scale
• Transparency:
– When there is the perception of
direct contact with the data, the
interface becomes transparent
– Big idea in interfaces
– Temporal feedback rapid (< 1/10
second)
– Response is compatible with
interaction method
• .
Choose Appropriate Visual
Encodings
Natural ordering
• Natural ordering and number of distinct values will indicate whether a visual property is best
suited to one of the main data types: quantitative, ordinal, categorical, or relational data.
• Spatial data is another common data type, and is usually best represented with some kind of
map
• Whether a visual property has a natural ordering is determined by whether the mechanics of our
visual system and the “software” in our brains automatically—unintentionally—assign an order,
or ranking, to different values of that property.
• For example, position has a natural ordering; shape doesn’t. Length has a natural ordering;
texture doesn’t (but pattern density does). Line thickness or weight has a natural ordering; line
style (solid, dotted, dashed) doesn’t
• Depending on the specifics of the visual property, its natural ordering may be well suited to
representing quantitative differences (27, 33, 41), or ordinal differences (small, medium, large,
enormous).
Natural ordering…
Color is not ordered
• Here’s a tricky one: Color (hue) is not naturally ordered in our brains. Brightness (lightness or
luminance, sometimes called tint) and intensity (saturation) are, but color itself is not
DISTINCT VALUES
• The second main factor to consider when choosing a visual property is how many distinct values it
has that your reader will be able to perceive, differentiate, and possibly remember.
DISTINCT VALUES…
REDUNDANT ENCODING
• If you have the luxury of leftover, unused visual properties after you’ve encoded the main
dimensions of your data, consider using them to redundantly encode some existing, already-
encoded data dimensions
• The advantage of redundant encoding is that using more channels to get the same information
into your brain can make acquisition of that information faster, easier, and more accurate
DEFAULTS VERSUS INNOVATIVE FORMATS And READERS’
CONTEXT
DEFAULTS VERSUS INNOVATIVE FORMATS
• The choice comes down to a basic cost-benefit analysis. What is the expense to you and your
reader of creating and understanding a new encoding format, versus the value delivered by that
format?
• If you’ve got a truly superior solution (as evaluated by your reader, and not just your ego), then
by all means, use it.
• But if your job can be done (or done well enough) with a default format, save everyone the effort
and use a standard solution
READERS’ CONTEXT
• First, it’s important to point out that your audience will likely be composed of more than one
reader. And as these people are all individuals, they may be as different from each other as they
are from you, and will likely have very different backgrounds and levels of interest in your work.
• It may be impossible to take the preconceptions of all these readers into consideration at once.
• So choose the most important group, think of them as your core group, and design with them in
mind. Where it is possible to appeal to more of your potential audience without sacrificing
precision or efficiency, do so.
READERS’ CONTEXT…
• let’s get specific about some facets of the reader’s mindset that you need to take into account.
Titles, tags, and labels
• When selecting the actual terms you’ll use to label axes, tag visual elements, or title the piece (which
creates the mental framework within which to view it), consider your reader’s vocabulary and
familiarity with relevant jargon.
1. Is the reader from within your industry or outside of it? What about other readers
outside of the core audience group?
2. Is it worth using an industry term for the sake of precision (knowing that the reader may
have to look it up), or would a lay term work just as well?
3. Will the reader be able to decipher any unknown terms from context, or will a vocabulary
gap
• These are the kinds of questions you should ask yourself. Each and every single word in your
visualization needs to serve a specific purpose
Colors
• Another reader context to take into account is color choice. There is quite a bit of science about how
our brains perceive and process color that is somewhat universal, as we saw earlier in this chapter. But
it’s worth mentioning in the context of reader preconceptions the significant cultural associations that
color can carry.
READERS’ CONTEXT…
Color blindness
• Of course, we know that there are many variations in the way different people perceive color. This
is commonly called color blindness but is more properly referred to as color vision deficiency or
dyschromatopsia.
• A disorder of color vision may present in one of several specific ways.
• Although prevalence estimates vary among experts and for different ethnic and national groups,
about 7% of American men experience some kind of color perception disorder (women are much
more rarely affected: about 0.4 percent in America).
• Red-green deficiency is the most common by far, but yellow-blue deficiency also occurs. And
there are lots of people who have trouble distinguishing between close colors like blue and
purple.
Directional orientation
• Is the reader from a culture that reads left-to-right, right-to-left, or top-to-bottom? A person’s
habitual reading patterns will determine their default eye movements over a page, and the order
in which they will encounter the various visual elements in your design.
COMPATIBILITY WITH REALITY
• a large factor in your success is making life easier for your reader, and that’s largely based on
making encodings as easy to decode as possible.
• One way to make decoding easy is to make your encodings of things and relationships as well
aligned with the reality (or your reader’s reality) of those things and relationships as possible; this
alignment is called compatibility.
PATTERNS AND CONSISTENCY
• The human brain is amazingly good at identifying patterns in the world. We easily recognize
similarity in shapes, position, sound, color, rhythm, language, behavior, and physical routine, just
to name a few variables.
• This ability to recognize patterns is extremely powerful, as it enables us to identify stimuli that
we’ve encountered before, and predict behavior based on what happened the last time we
encountered a similar stimulus pattern
• Consequently, we also notice violations of patterns. When a picture is crooked, a friend sounds
troubled, a car is parked too far out into the street, or the mayonnaise smells wrong, the patterns
we expect are being violated and we can’t help but notice these exceptions.
• we notice them because they are exceptions to the norm. they are intentional, whether you
planned for the patterns to exist or not. The second is that when they perceive patterns, readers
will also expect pattern violations to be meaningful.
• It all comes down to three simple rules.
1.Be consistent in membership, ordering, and other encodings.
2.Things that are the same should look the same.
3. Things that are different should look different.
Other Factors
• COMPARISONS NEED TO COMPARE
1 2 3 4 5
Introduction Data types, Basic principles for Storytelling for Trends in market
relationships, and data visualization social and market research and
What is data visualization?
visualization formats communication data visualization
Graphics with an objective:
The data visualization process seeking your mantra dashboards
Two kinds of data Data storytelling
netquest.com 2
1. Introduction
netquest.com 3
Introduction What is data
visualization?
Data visualization
Data visualization is the process of acquiring, interpreting
and comparing data in order to clearly communicate
The ways we structure and visualize information are
complex ideas, thereby facilitating the identification and
changing rapidly and getting more complex with each
analysis of meaningful patterns.
passing day. Thanks to the rise of social media, the
ubiquity of mobile devices, and service digitaliza-
tion, data is available on any human activity that
utilizes technology. The generated information is
hugely valuable and makes it possible to analyze trends
and patterns, and to use big data to draw connections
between events. Thus, data visualization can be an
effective mechanism for presenting the end user with
understandable information in real time.
Data visualization can be essential
Every company has data, be it to communicate with
to strategic communication: it
clients and senior managers or to help manage the
organization itself. It is only through research and helps us interpret available data;
interpretation that this data can acquire meaning and be detect patterns, trends, and
transformed into knowledge.
anomalies; make decisions; and
This ebook seeks to guide readers through a series of analyze inherent processes.
basic references in order to help them understand data All told, it can have a powerful
visualization and its component parts, and to equip
impact on the business world.
them with the tools and platforms they need to create
interactive visuals and analyze data. In effect, it seeks
to provide readers with a basic vocabulary and a crash
course in the principles of design that govern data visu-
alization so that they can create and analyze interactive
market research reports.
netquest.com 4
The data
visualization process
1
Visualization process
netquest.com 5
Why is data All of this indicates that human beings are better at Identifying the evolution of sales over the course of the
visualization so processing visual information, which is lodged in our year isn’t easy. However, when we present the same
long-term memory. information in a visual, the results are much clearer (see
important in reports the graph below).
netquest.com 6
Data visualization chiefly helps in 3 key aspects of For example: an interactive graphic from The Guardian2 invites us to explore how the
reports and statements: linguistic standard of U.S. presidential addresses has declined over time. The visual is
interactive and explanatory, in addition to indicating the readability score of various
presidents’ speeches.
1) Explaining
Visuals aim to lead the viewer down a path in order to describe situations, answer
questions, support decisions, communicate information, or solve specific problems. 3) Analyzing
When you attempt to explain something through data visualization, you start with a
question, which interacts with the data set in such a way that enables viewers to make Other visuals prompt viewers to inspect, distill, and transform the most significant
a decision and, subsequently, answer the question. information in a data set so that they can discover something new or predict upcom-
ing situations.
For example: This graphic below could clearly explain the country with the greatest
demand for a certain product compared globally, in a concrete month. For example: this interactive graphic about learning machine3 invites us to explore
and discover information within the visual by scrolling through it. Using the machine
500 learning method, the visual explains the patterns detected in the data in order to cate-
400 gorize characteristics.
300
200
We’ll close this introduction with a 2012 reflection by Alberto Cairo, a specialist in
100 information visualization and a leader in the world of data visualization. For the
0 author, a good visual must provide clarity, highlight trends, uncover patterns, and
United Russia South Europe Canada Australia Japan
States Africa reveal unseen realities:
2) Exploring We create visuals so that users can analyze data and, from it, dis-
cover realities that not even the designer, in some instances, had
Some visuals are designed to lend a data set spatial dimensions, or to offer numerous
considered.”
subsets of data in order to raise questions, find answers, and discover opportunities.
When the goal of a visual is to explore, the viewers start by familiarizing themselves
with the dataset, then identifying an area of interest, asking questions, exploring, and 2 Available at: https://www.fusioncharts.com/whitepapers/downloads/Principles-of-Data-Visualization.pdf
finding several solutions or answers. 3 Available at: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
netquest.com 7
2.
Data types,
relationships, and
visualization formats
netquest.com 8
Data types, 2 kinds of data
relationships, and Before we talk about visuals themselves, we must first understand the different
visualization formats
kinds of data that can be visualized and how they relate to one another.
The most common kinds of data are4:
5 Source: Hubspot, Prezy, and Infogram (2018). Presenting Data People Can’t
Ignore: How to Communicate Effectively Using Data. | p.10 of 16 | Available at:
https://offers.hubspot.com/presenting-data-people-cant-ignore.
netquest.com 9
7 data relationships
Data relationships can be simple, like the progress of a single metric over time (such as visits to a blog over the course of 30 days or the number of users on a social network),
or they can be complex, precisely comparing relationships, revealing structure, and extracting patterns from data. There are seven data relationships to consider:
Ranking: A visualization that relates two or more values Nominal comparisons: Visualizations that compare Series over time: Here we can trace the changes in the
with respect to a relative magnitude. For example: a quantitative values from different subcategories. For values of a constant metric over the course of time. For
company’s most sold products. example: product prices in various supermarkets. example: monthly sales of a product over the course of
two years.
Deviation: Examines how each data point relates to the Distribution: Visualization that shows the distribu-
others and, particularly, to what point its value differs tion of data spatially, often around a central value.
from the average. For example: the line of deviation for For example: the heights of players on a basketball team.
tickets to an amusement park sold on a rainy versus a Partial and total relationships: Show a subset of data
normal day. as compared with a larger total. For example: the per-
centage of clients that buy specific products.
netquest.com 10
11 formats 1. Bar chart
There are two types of visualizations: static and Bar charts are one of the most popular ways of visual- They are very versatile, and they are typically used
interactive. Their use depends on the search and izing data because they present a data set in a quickly to compare discrete categories, to analyze changes
analysis dimension level. Static visuals can only understood format that enables viewers to identify over time, or to compare parts of a whole.
analyze data in one dimension, whereas inter- highs and lows at a glance. The three variations on the bar chart are:
active visuals can analyze it in several.
5,500
5,000
4,500 Jan
Education
4,000
3,500
3,000
Feb
Entertainment
2,500
2,000
1,500
Mar
1,000 Heatlh
500
0
Jan Feb Mar Apr May 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
netquest.com 11
2. Histograms
400K
Histograms represent a variable in the form of bars,
where the surface of each bar is proportional to the 350K
>120
60-80
• Vertical columns 100K
3. Pie charts
netquest.com 12
4. Scatter plots
1.0 30.000
0.4
10.000
0.2 5.000
0 0
0 0 0 0 00 00 .000 .000 .000 .000
0.2 0.4 0.6 0.8 1.0 1.2 5.0
0 .00 5.00 0.0 5.0
10 1 2 2 30 35 40 45
5. Heat maps
netquest.com 13
6. Line charts 7. Bubble charts 8. Radar charts
These are used to display changes or trends in data These graphics display three-dimensional data and These are a form of representation built around a
over a period of time. They are especially useful for accentuate data in dispersion diagrams and maps. regular polygon that is contained within a circle,
showcasing relationships, acceleration, deceleration, Their purpose is to highlight nominal comparisons and where the radii that guide the vertices are the axes
and volatility in a data set. classification relationships. The size and color of the over which the values are represented. They are
bubbles represent a dimension that, along with the equivalent to graphics with parallel coordinates on polar
data, is very useful for visually stressing specific values. coordinates. Typically, they are used to represent the
The two variations on the bubble chart are: behavior of a metric over the course of a set time cycle,
such as the hours of the day, months of the year, or days
• The bubble plot: used to show a variable in three of the week.
dimensions, position coordinates (x, y) and size.
Line chart
• Bubble map: used to visualize three-dimensional
values for geographic regions.
Radar chart
netquest.com 14
9. Waterfall charts
400K
These help us understand the cumulative effect
350K
of positive and negative values on variables in a
sequential fashion. 300K
250K
200K
150K
100K
50K
Start A B C D E F G H I J K L End
Fall Rise
A
Tree maps display hierarchical data (in a tree struc- B C
A
ture) as a set of nested rectangles that occupy sur-
200
face areas proportional to the value of the variable
they represent. Each tree branch is given a rectangle, E H
which is later placed in a mosaic with smaller rectangles B C
80 120
that represent secondary branches. The finished prod-
uct is an intuitive, dynamic visual of a plane divided into
areas that are proportional to hierarchical data, which
has been sorted by size and given a color key. D E F G H D
G
F
30 50 20 40 60
netquest.com 15
11. Area charts Selecting the right graphic to effectively communicate
through our visualizations is no easy task. Stephen
1.0
These represent the relationship of a series over Few (2009), a specialist in data visualization, proposes
time, but unlike line charts, they can represent 0.8 taking a practical approach to selecting and using an
volume. The three variations on the area chart are: appropriate graphic:
0.6
• Standard area: used to display or compare a pro- • Choose a graphic that will capture the viewer’s
gression over time. 0.4 attention for sure.
• Stacked area: used to visualize relationships as part
of the whole, thus demonstrating the contribution of 0.2 • Represent the information in a simple, clear, and
each category to the cumulative total. precise way (avoid unnecessary flourishes).
0
• 100% stacked area: used to communicate the dis-
1 2 3 4 5 6
tribution of categories as part of a whole, where the • Make it easy to compare data; highlight trends
cumulative total does not matter. Standard area and differences.
0.8
0.8 • Give the viewer a clear way to explore the
graphic and understand its goals; make use of
0.6
0.6 guide tags.
0.4
0.4
0.2
0.2
0
0
0 1 2 3 4 5 6
0 1 2 3 4 5 6
A B C
netquest.com 16
3. Basic principles for
data visualization
netquest.com 17
Basic principles for Shneiderman introduces his famous mantra on how
data visualization
to approach the quest for visual information, which he
breaks down into three tasks:
Graphics with
1. Overview first: This ensures viewers have a general 1.
understanding of the data set, as their starting point for
System Context
an objective: seeking exploration. This means offering them a visual snapshot
of the different kinds of data, explaining their relation- The system plus users and
your mantra ship in a single glance. This strategy helps us visualize the system dependencies
data, at all its different levels, at one time. OVERVIEW
FIRST
The goal of data visualizations is to help us understand 2. Zoom and filter: The second step involves supple-
2.
the object they represent. They are a medium for com- menting the first so that viewers understand the data’s
municating stories and the results of research, as well underlying structure. The zoom in/zoom out mechanism Containers
as a platform for analyzing and exploring data. There- enables us to select interesting subsets of data that meet The overall shape of the archi-
fore, having a sound understanding of how to create certain criteria while maintaining the sense of position tecture and technology choices.
data visualizations will help us create meaningful and and context.
easy-to-remember reports, infographics, and dash-
3.
boards. Creating suitable visuals helps us solve problems 3. Details on demand: This makes it possible to select
and analyze a study’s objects in greater detail. a narrower subset of data, enabling the user to interact
Components ZOOM AND
with the information and use filters by hovering or click-
FILTER
The first step in representing information is trying ing on the data to pull up additional information. Logical components and their
to understand that data visualization. interactions within a container.
The chart on the right side summarizes the key points to
Ben Shneiderman gave us a useful starting point in his designing such a graphic, with an eye to human visual
text “The Visual Information-Seeking Mantra” (1996),
which remains a touchstone work in the field. This
perception, so that users can translate an idea into a set
of physical attributes.
4.
Classes DETAILS ON
author suggests a simple methodology for novice users DEMAND
to delve into the world of data visualization and experi- These attributes are: structure, position, form size, Component or pattern imple-
ment with basic visual representation tasks. 5
and color. When properly applied, these attributes can mentation details.
5 Shneiderman, B. (1996). The Eyes Have It: A Task by Data Type Taxonomy for present information effectively and memorably.
Information Visualizations. Visual Information Seeking Mantra (p. 336). Available at:
https://www.cs.umd.edu/~ben/papers/Shneiderman1996eyes.pdf
netquest.com 18
Layout and design: Structuring: the importance
Furthermore, the visual hierarchy of elements plays a
of layout
communicative role in this encoding process, because the elements’
organization and distribution must have a well-defined
elements All visual representations begin with a blank dimensional hierarchical system in order to communicate effec-
space that will eventually hold the information which tively (Meirelles: 2014). In a sense, visualizations are
will be communicated. The process of spatial coding is paragraphs about data, and they should be treated
a fundamental part of visual representation because it as such. Words, images, and numbers are part of the
In order to begin designing our reports and state- is the medium in which the results of our compositional information that will be visualized. When all of the
ments, it is essential to understand that visual repre- decisions and the meaning of our visual statement will elements are integrated in a single structure and visual
sentations are cognitive tools that complement and be visualized, thereby having an impact on the user. hierarchy, the infographic or report will organize space
strengthen our mental ability to encode and decode properly and communicate effectively, according to
information . Meirelles (2014) notes that: “All graphic
6
Edward Tufte (1990) defines “layout” as a scheme for your user’s needs.
representation affects our visual perception, distributing visual elements in order to achieve organi-
because the elements of transmission utilized act zation and harmony in the final composition. Layout
as external stimuli, which activate our emotional planning and design serve as a template for applying
state and knowledge.” hierarchy and control to information at varying levels of
detail.7 In his book Envisioning Information, Tufte offers
Thus, when our mind visualizes a representation, it several guidelines for information design:
transforms the information, merges it, and applies a
hierarchical structure to it to facilitate interpretation. • Have a properly chosen format.
• Give a broad visual tour and offer a focused reading
For this reason, in order to have an efficient per- at different detail levels.
ceptive impact, it is important to adhere to a series • Use words, numbers, and drawings.
of best practices when creating reports and info- • Reflect a balance, a proportion, a sense of relevant
graphics. As with any other form of communication, scale, and a context.
success depends largely on the business’s familiarity
with the established code and the resources available. Spatial encoding requires processing spatial proportions
Space, shapes, color, icons, and typography are a (position and size), which have a determining role in the
few of the essential elements of a striking visual with organization of perception and memory.
communicative power.
6 Meirelles, I (2014). “La información en el diseño,” (p.21-22). Barcelona: Parramón. 7 Tufte, E. (1990). Envisioning Information. Cheshire: Graphics Press.
netquest.com 19
Visual variables
and their semantics
netquest.com 20
Using consistent and attractive
color schemes
Cool colors
Saturation: this refers to the intensity of a given color’s
hue. It varies based on brightness. Darker colors are less
saturated, and the less saturated a color is, the closer
it gets to gray. In other words, it gets closer to a neutral Saturated colors
(hueless) color. The following graphic offers a brief sum-
mary of color application.
netquest.com 21
Isabel Meirelles (2014) notes that selecting a color pal- 2. Diverging palettes TIP: The qualitative color scheme is perfect for visualiz-
ette in order to visualize data is no easy task, and she ing data because it affords a high degree of contrast and
recommends following Cynthia Brewer’s advice uses These are more suitable for ordering categorical data, helps you draw attention to important points, especially
three different kinds of color schemes, based on the and they are more effective when the categorical if you use one predominant color and use the second as
nature of the data: division is in the middle of the sequence. The change in an accent in your design.
brightness highlights a critical value in the data, such as
the mean or median, or a zero. Colors become darker to
1. Monochromatic sequential palettes or represent differences in both directions, based on this Finally, don’t forget to use palettes that are comprehen-
their analogue meaningful value in the middle of the data. sible to people who can’t see color. Color blindness is a
disability or limited ability that makes it difficult to distin-
These palettes are great for ordering numeric data that guish certain pairs of colors, such as blue and yellow, or
progresses from small to large. It is best to use brighter red and green. One strategy for avoiding this problem is
color gradients for low values and darker ones for to adapt designs that use more than just hue to codify
higher values. TIP: Try to emphasize the most important information information; create schemes that slightly vary another
using arrows and text, circles, rectangles, or contrasting channel, such as brightness or saturation.
colors. This way, when you visualize your data, your
analysis will be more understandable.
netquest.com 22
Use icons and symbols to aid 82%
77%
88%
76% 73%
64% 63%
in understanding and limit Notebooks 55% 54%
unnecessary tagging
Entertainment
Symbols and icons are another avenue for visualizing Lifestyle products
information that goes beyond merely being decorative.
They draw strength from their ability to exhibit a gen- Singles Couples Families
eral context in an attractive, precise way. Icons illustrate
concepts. Viewers can understand what the information
is about by just glancing at the illustration.
netquest.com 23
The typography in our reports: sense of tradition, security, history, integrity, author-
effective applications ity, integrity, and other such concepts. Sans-serif
fonts stand out because they have a more polished,
sophisticated feel; they convey a sense of modernity,
Typography plays an important role in the design order, cleanliness, elegance, avant-garde, and style.
of reports and statements. Selecting the right font • Pay attention to legibility. Remember that screen
strengthens your message and captures the audience’s type does not appear in the same way as print type.
attention. Müller-Brockmann (1961), a graphic designer, It is best to choose a more responsive (sans-serif) font
defines typography as the proper visual element for for on-screen texts, and fonts with serifs for printed
composition. He notes that “the reader must be able to reports. That said, there’s an exception to every rule,
read the message from a text easily and comfortably. and today there is a bounty of fonts that are perfectly
This depends largely on the size of the text, the length of suitable for both digital and print media.
the lines, and the spacing between the lines”.8 • Watch your weight (light, regular, bold). When
it comes to bolding your text, a value of two or three
Typography is an art form in and of itself, in which should be plenty. It is better to reserve the heaviest
every font has its own characteristics, which should weight for headlines and then apply a stylistic hierar-
be strategically combined. chy based on your content. Avoid fonts that only offer
one weight or style, since their applications are limited.
For people outside the world of graphic design, choos- • Don’t forget that some fonts use more memory
ing a font and setting other typographical features can than others. Fonts with serifs generally monopolize
be tricky, but it doesn’t have to be. Let’s take a practical more of your computer’s brain power than sans-serif
look at the steps you should take when determining fonts. This is an important consideration in interactive
your typography, and then consider the images and reports, since a document that occupies more RAM
visual elements that best accompany your text. Consid- will be less responsive.
erations when setting your typography:
Fonts have personalities that help us establish a more
• Determining the goal of your report’s content. attractive visual tone for our audience. Familiarizing
• Select a font that strengthens that goal. yourself with a few can go a long way. There are:
Fonts come in two types: with serifs or without (sans)
serifs. Serif fonts have an extra stroke that conveys a • Professional fonts • Handwritten fonts
• Fun font • Minimalist fonts
8 The Graphic Artist and his Design Problems (Gestaltungsprobleme
des Grafikers), Teufen, 1961
netquest.com 24
Prioritize patterns in your visualizations: Gestalt
The basic elements of the visualization process also involve preattentive attributes. Preattentive attributes are visual
features that facilitate the rapid visual perception of a graphic in a space. Designers use these characteristics to
better uncover relevant information in visuals, because these characteristics attract the eye.
Colin Ware, Director of the Data Visualization Research Lab at the University of New Hampshire, has highlighted
that preattentive attributes can be used as resources for drawing viewers’ immediate attention to certain
parts of visual representations (2004). According to Ware, preattentive processing happens very quickly—typi-
cally in the first 10 milliseconds. This process is the mind’s attempt to rapidly extract basic visual characteristics from
the graphic (stage 1). These characteristics are then consciously processed, along with the perception of the object,
so that the mind can extract patterns (stage 2), ultimately enabling the information to move to the highest level of
perception (stage 3). This makes it possible to find answers to the initial visual question, utilizing the information
saved in our minds. Colin Ware, cited in Meirelles (2014), explains it as follows:
Preattentive attributes enhance object perception and cognition processes, leveraging our mind’s visual capacities.
Good data visualizations deliberately make use of these attributes because they boost the mind’s discovery and rec-
ognition of patterns such as lines, planes, colors, movements, and spatial positioning.9
9 Dondis, D.A. (2015). La sintaxis de la imagen: introducción al alfabeto visual. Editorial Gustavo Gili: Barcelona
Meirelles, I. (2014). La información en el diseño. Barcelona: Parramón.
netquest.com 25
The visual below lists preattentive attributes that represent
aspects of lines and planes when visualizing and analyzing
graphic representation: shape, color, and spatial position.
Shape
Orientation Line Length Line Width Size
netquest.com 26
Detecting patterns is fundamental to structuring and
organizing visual information. When we create visuals,
we often want to highlight certain patterns over others.
Preattentive attributes are the alphabet of visual lan-
guage; analytic patterns are the words that we write
by using them. When we see a good visualization, we
immediately detect the preattentive attributes and rec-
ognize analytic patterns in the visualization. The follow-
ing table summarizes a few basic analytic patterns:
Analytic patterns
netquest.com 27
We have seen how preattentive attributes and patterns Gestalt’s principles are the principles that enable us to According to Dondis (2015), Gestalt’s principles help
make it possible to process and analyze visual informa- understand the requirements posed by certain prob- describe the way we organize and merge elements
tion; they also enable us to improve pattern discovery lems so that we see everything as an integral, coherent in our minds. They quiet the noise of the graphics so
and perceptive inferences and provide processes for whole. It involves proximity, similarity, shared destiny, that we relate, combine, and analyze them. These
solving visualization problems. “pragnanz” or pithiness, closure, simplicity, familiarity, principles come into play whenever we analyze any
and discernment between figure and ground. sort of visualization. Only position and length can be
used to accurately perceive quantitative data. The
other attributes are useful for perceiving other sorts
of data, such as categorical and relational data.
Gestalt’s principles
netquest.com 28
4. Storytelling for social and
market communication
netquest.com 29
Storytelling for As we saw at the beginning of this ebook, our mind The triune model is a valuable tool for effectively com-
and moods, our worries How, then, can we create stories that use data to
and fears. communicate insights? Below, we explain three simple
sequences for telling a story:
Paul Maclean, cited in María Alejandra Rendón (2009),
proposes a “Triune brain” theory, which addresses • Influencing people’s emotions by telling a story
the structure and behavior of the human mind. For (drawing in their attention).
Maclean, the mind consists of three inseparable parts • Persuading them through benefits that cover specific
(or distinct brains); none of the three functions inde- needs (benefits/engagement).
pendently or separately. They are the reptilian brain, the • Moving on to concrete steps (call to action).
emotional brain, and the neocortex.
If you can successfully visualize this sequence, you
The reptilian brain is home to our unconscious, also understand the foundation of all narratives. What that
known as our instinctive side. It manages survival and means is that every story we try to tell has a beginning,
our body’s self-regulation. The second part, the emo- a developed plot, and a resolution, all building up to
tional brain, is responsible for our emotional processes the invaluable call to action. If you have a clear notion
and basic motivations. Last but not least, the neocor- of how to include the “story” element in your reports,
tex is our more rational, complex side. It is in charge of statements, and dashboards, you will successfully create
driving our systematic and logical thinking. stories that use your data to share insights.
netquest.com 30
Data storytelling
We all love good stories, and data is one of the best What do we get when we
tools for telling them. Millions of pieces of data are combine these elements?
generated every day. They could be converted into
great stories, but instead they are left unused. It’s time to
change all that. It’s time to start telling stories that draw
their power from data. Data + Narrative Data + Visualization + Narration =
So-called “data storytelling” is nothing more than Data can be insights; they are drawn from study and Successfully using our data to tell a
placing a structured focus on the way we use data to analysis. Their nature can propose the narrative context. story, wield influence, and effect the
communicate insights. It relies on three key elements: desired change.
narrative, visualization, and data.
Visualization + Data
Data The story must motivate. It must have a plot, highs and
lows, and an arc of emotional connection in order to
draw in and entertain our audience.
netquest.com 31
A basic recipe for storytelling in your
presentations and final reports
In case you don’t have a clear notion of how to include the “story” 5. Plot. Generate interest; create tension. Depict the concept,
element in your data, we’re going to outline a few points that will crux, and resolution. Incentivize your audience to keep reading
guide you, so that your presentations and reports manage to grab until the last page, so to speak. Establish relationships.
your audience’s attention and have a major impact:
6. Use data to anchor your narrative. The
story in your data ought to be simple; the vision drawn from the
1. Find the story in your data. Write, write, and data comes with an implicit responsibility to be sincere and honest.
write. Write about the highlights of your research in different roles.
Worry about presentation later. 7. Design principles. Adhere to the best practices of
design to visualize your data.
2. Define the perspective. Who are you talking to?
What’s the best way to achieve your objective? 8. Review, review, review. Make sure that all of your
analysis is precise.
3. Create a hierarchy. What is the most important
thing you are trying to convey? Establish different depths to your 9. Be familiar with your content and
reading and data. Avoid irrelevant information. respect your audience.
4. Organize. Figure out the most suitable sequence for 10. Keep it short and sweet. Data-based storytell-
presenting your data. What relationships can you establish ing is the product of hours of work. It’s best to keep presentations
between different aspects of your data? What do some pieces of short, with concrete ideas adapted to the audience so that your
data mean relative to others? Are they the framework (data that message is conveyed efficiently and smoothly.
reveals), the details (data that delves deeper), or the contrast (data
that dramatizes differences)?
netquest.com 32
5. Trends in market research and
data visualization dashboards
netquest.com 33
Trends in market Scrollytelling Instagram, Snapchat, and YouTube. RJ Andrews, in his
netquest.com 34
Virtual reality What does the future
visualizations have in store?
Virtual reality has the potential to revolutionize data Visual data representation techniques and methods
visualization, especially when it comes to big data. progress every day, as technology evolves and our body
Even in a two-dimensional image, there is already too of theoretical knowledge grows. As this technology
much data for the human eye to capture. Now imagine and this knowledge work in tandem, we will continue
a three-dimensional data visualization, which allows developing solutions for our problems and needs. From
the user to fully interact with data in a 360-degree field this report, we hope you have deduced that, in our
of vision. current era, images are the most efficient language. We
hope you now understand that tools and software can
Virtual reality data visualizations are highly interactive, help us discover limitless graphic resources and develop
computer generated 3D projections. Although the new structures for communicating and conveying ideas.
concept of virtual reality is nothing new, the idea of Consequently, we can confidently state that the applica-
immersive data exploration certainly is, and the exciting tions of graphic representation are constantly expand-
possibilities that it promises are endless. ing, and we must not forget that they are the objective
of our communication strategies in market research.
netquest.com 35
Visualize It! Thank you for
A Comprehensive Guide reading.
to Data Visualization
Stay tuned
with us.
Copy editing
ABOUT THE AUTHOR Bernou Benne | Marketing Specialist
Focused on creating data visualizations and market
research dashboards leveraging data to enhance Graphic design
experiences. She is a passionate, curious person who Nina Rojc | Graphic Designer
enjoys collaborating in human-centered design projects Anna Caballero | Global Brand Designer
around the world. Melissa Matias | Visual Data Designer
netquest.com 36
netquest.com 37
Visualizing Data
PREV NEXT
⏮ ⏭
Preface 2. Getting Started with Processing
🔎
—John Tukey
What do the paths that millions of visitors take through a web site look like?
How do the 3.1 billion A, C, G, and T letters of the human genome compare to
those of the chimp or the mouse? Out of a few hundred thousand files on your
computer’s hard disk, which ones are taking up the most space, and how often do
you use them? By applying methods from the fields of computer science,
statistics, data mining, graphic design, and visualization, we can begin to answer
these questions in a meaningful way that also makes the answers accessible to
others.
All of the previous questions involve a large quantity of data, which makes it
extremely difficult to gain a “big picture” understanding of its meaning. The
problem is further compounded by the data’s continually changing nature, which
can result from new information being added or older information continuously
being refined. This deluge of data necessitates new software-based tools, and its
complexity requires extra consideration. Whenever we analyze data, our goal is
to highlight its features in order of their importance, reveal patterns, and
simultaneously show features that exist across multiple dimensions.
This book shows you how to make use of data as a resource that you might
otherwise never tap. You’ll learn basic visualization principles, how to choose the
right kind of display for your purposes, and how to provide interactive features
that will bring users to your site over and over again. You’ll also learn to program
in Processing, a simple but powerful environment that lets you quickly carry out
the techniques in this book. You’ll find Processing a good basis for designing
interfaces around large data sets, but even if you move to other visualization
tools, the ways of thinking presented here will serve you as long as human beings
continue to process information the same way they’ve always done.
But this is an exciting time. For $300, you can purchase a commodity PC that has
thousands of times more computing power than the first computers used to
tabulate the U.S. Census. The capability of modern machines is astounding.
Performing sophisticated data analysis no longer requires a research laboratory,
just a cheap machine and some code. Complex data sets can be accessed,
explored, and analyzed by the public in a way that simply was not possible in the
past.
The past 10 years have also brought about significant changes in the graphic
capabilities of average machines. Driven by the gaming industry, high-end 2D
and 3D graphics hardware no longer requires dedicated machines from specific
vendors, but can instead be purchased as a $100 add-on card and is standard
equipment for any machine costing $700 or more. When not used for gaming,
these cards can render extremely sophisticated models with thousands of shapes,
and can do so quickly enough to provide smooth, interactive animation. And
these prices will only decrease—within a few years’ time, accelerated graphics
will be standard equipment on the aforementioned commodity PC.
⬆
DATA COLLECTION
We’re getting better and better at collecting data, but we lag in what we can do
with it. Most of the examples in this book come from freely available data
Your trial membership has ended,
sources Pvvssrinivas. Please
on the Internet. Lots of data is out there, contact
but it’s not beingyour administrator or O'Reilly Support.
used to its
/
greatest potential because it’s not being visualized as well as it could be. (More
about this can be found in Chapter 9, which covers places to find data and how to
retrieve it.)
With all the data we’ve collected, we still don’t have many satisfactory answers
to the sort of questions that we started with. This is the greatest challenge of our
information-rich era: how can these questions be answered quickly, if not
instantaneously? We’re getting so good at measuring and recording things, why
haven’t we kept up with the methods to understand and communicate this
information?
What happens when things start moving? How do we interact with “live” data?
How do we unravel data as it changes over time? We might use animation to play
back the evolution of a data set, or interaction to control what time span we’re
looking at. How can we write code for these situations?
As a contrast, think about subway maps, which are abstracted from the complex
shape of the city and are focused on the rider’s goal: to get from one place to the
next. Limiting the detail of each shape, turn, and geographical formation reduces
this complex data set to answering the rider’s question: “How do I get from point
A to point B?”
Harry Beck invented the format now commonly used for subway maps in the
1930s, when he redesigned the map of the London Underground. Inspired by the
layout of circuit boards, the map simplified the complicated Tube system to a
series of vertical, horizontal, and 45°diagonal lines. While attempting to preserve
as much of the relative physical layout as possible, the map shows only the
connections between stations, as that is the only information that riders use to
decide their paths.
When beginning a visualization project, it’s common to focus on all the data that
has been collected so far. The amounts of information might be enormous—
people like to brag about how many gigabytes of data they’ve collected and how
difficult their visualization problem is. But great information visualization never
starts from the standpoint of the data set; it starts with questions. Why was the
data collected, what’s interesting about it, and what stories can it tell?
The most important part of understanding data is identifying the question that
you want to answer. Rather than thinking about the data that was collected, think
about how it will be used and work backward to what was collected. You collect
data because you want to know something about it. If you don’t really know why
you’re collecting it, you’re just hoarding it. It’s easy to say things like, “I want to
know what’s in it,” or “I want to know what it means.” Sure, but what’s
meaningful?
The more specific you can make your question, the more specific and clear the
visual result will be. When questions have a broad scope, as in “exploratory data
analysis” tasks, the answers themselves will be broad and often geared toward
those who are themselves versed in the data. John Tukey, who coined the term
Exploratory Data Analysis, said “. . . pictures based on exploration of data should
[ 1 ]
force their messages upon us.” Too many data problems are labeled
“exploratory” because the data collected is overwhelming, even though the
original purpose was to answer a specific question or achieve specific results.
One of the most important (and least technical) skills in understanding data is
asking good questions. An appropriate question shares an interest you have in the
data, tries to convey it to others, and is curiosity-oriented rather than math-
oriented. Visualizing data is just like any other type of communication: success is
defined by your audience’s ability to pick up on, and be excited about, your
insight.
Admittedly, you may have a rich set of data to which you want to provide
flexible access by not defining your question too narrowly. Even then, your goal
should be to highlight key findings. There is a tendency in the visualization field
to borrow from the statistics field and separate problems into exploratory and
expository, but for the purposes of this book, this distinction is not useful. The
same methods and process are used for both.
⬆
In short, a proper visualization is a kind of narrative, providing a clear answer to
a question without extraneous details. By focusing on the original intent of the
question, you can eliminate such details because the question provides a
benchmark for what is and is not necessary.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
A COMBINATION OF MANY DISCIPLINES
Given the complexity of data, using it to provide a meaningful solution requires
insights from diverse fields: statistics, data mining, graphic design, and
information visualization. However, each field has evolved in isolation from the
others.
PROCESS
We must reconcile these fields as parts of a single process. Graphic designers can
learn the computer science necessary for visualization, and statisticians can
communicate their data more effectively by understanding the visual design
principles behind data representation. The methods themselves are not new, but
their isolation within individual fields has prevented them from being used
together. In this book, we use a process that bridges the individual disciplines,
placing the focus and consideration on how data is understood rather than on the
viewpoint and tools of each individual field.
The process of understanding data begins with a set of numbers and a question.
The following steps form a path to the answer:
Acquire
Obtain the data, whether from a file on a disk or a source over a network.
Parse
Provide some structure for the data’s meaning, and order it into categories.
Filter
Mine
Represent
Refine
Interact
Add methods for manipulating the data or controlling what features are
visible.
Of course, these steps can’t be followed slavishly. You can expect that they’ll be
involved at one time or another in projects you develop, but sometimes it will be
four of the seven, and at other times all of them.
Part of the problem with the individual approaches to dealing with data is that the
separation of fields leads to different people each solving an isolated part of the
problem. When this occurs, something is lost at each transition—like a
“telephone game” in which each step of the process diminishes aspects of the
initial question under consideration. The initial format of the data (determined by
how it is acquired and parsed) will often drive how it is considered for filtering or
mining. The statistical method used to glean useful information from the data
might drive the initial presentation. In other words, the final representation
reflects the results of the statistical method rather than a response to the initial
question.
Similarly, a graphic designer brought in at the next stage will most often respond
to specific problems with the representation provided by the previous steps,
rather than focus on the initial question. The visualization step might add a
compelling and interactive means to look at the data filtered from the earlier
steps, but the display is inflexible because the earlier stages of the process are
hidden. Furthermore, practitioners of each of the fields that commonly deal with
data problems are often unclear about how to traverse the wider set of methods
and arrive at an answer.
This book covers the whole path from data to understanding: the transformation
of a jumble of raw numbers into something coherent and useful. The data under
consideration might be numbers, lists, or relationships between multiple entities.
It should be kept in mind that the term visualization is often used to describe the
art of conveying a physical relationship, such as the subway map mentioned near
the start of this chapter. That’s a different kind of analysis and skill from
information visualization, where the data is primarily numeric or symbolic (e.g.,
A, C, G, and T—the letters of genetic code—and additional annotations about
them). The primary focus of this book is information visualization: for instance, a
series of numbers that describes temperatures in a weather forecast rather than
the shape of the cloud cover contributing to them.
An Example
To illustrate the seven steps listed in the previous section, and how they
⬆
contribute to effective information visualization, let’s look at how the process can
be applied to understanding a simple data set. In this case, we’ll take the zip code
numbering system that the U.S. Postal Service uses. The application is not
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
particularly advanced, but it provides a skeleton for how the process works. /
(Chapter 6 contains a full implementation of the project.)
WHAT IS THE QUESTION?
All data problems begin with a question and end with a narrative construct that
provides a clear answer. The Zipdecode project (described further in Chapter 6)
was developed out of a personal interest in the relationship of the zip code
numbering system to geographic areas. Living in Boston, I knew that numbers
starting with a zero denoted places on the East Coast. Having spent time in San
Francisco, I knew the initial numbers for the West Coast were all nines. I grew up
in Michigan, where all our codes were four-prefixed. But what sort of area does
the second digit specify? Or the third?
The finished application was initially constructed in a few hours as a quick way
to take what might be considered a boring data set (a long list of zip codes,
towns, and their latitudes and longitudes) and create something engaging for a
web audience that explained how the codes related to their geography.
Acquire
The acquisition step involves obtaining the data. Like many of the other steps,
this can be either extremely complicated (i.e., trying to glean useful data from a
large system) or very simple (reading a readily available text file).
A copy of the zip code listing can be found on the U.S. Census Bureau web site,
as it is frequently used for geographic coding of statistical data. The listing is a
freely available file with approximately 42,000 lines, one for each of the codes, a
tiny portion of which is shown in Figure 1-1.
Figure 1-1. Zip codes in the format provided by the U.S. Census Bureau
Acquisition concerns how the user downloads your data as well as how you
obtained the data in the first place. If the final project will be distributed over the
Internet, as you design the application, you have to take into account the time
required to download data into the browser. And because data downloaded to the
browser is probably part of an even larger data set stored on the server, you may
have to structure the data on the server to facilitate retrieval of common subsets.
Parse
After you acquire the data, it needs to be parsed—changed into a format that tags
each part of the data with its intended use. Each line of the file must be broken
along its individual parts; in this case, it must be delimited at each tab character.
Then, each piece of data needs to be converted to a useful format. Figure 1-2
shows the layout of each line in the census listing, which we have to understand
to parse it and get out of it what we want.
Each field is formatted as a data type that we’ll handle in a conversion program:
String
A set of characters that forms a word or a sentence. Here, the city or town
name is designated as a string. Because the zip codes themselves are not so
much numbers as a series of digits (if they were numbers, the code 02139
would be stored as 2139, which is not the same thing), they also might be
considered strings.
Float
A number with decimal points (used for the latitudes and longitudes of each
location). The name is short for floating point, from programming
nomenclature that describes how the numbers are stored in the computer’s
memory.
Character
Integer ⬆
A number without a fractional portion, and hence no decimal points (e.g.,
−14, 0, or 237).
With the completion of this step, the data is successfully tagged and consequently
more useful to a program that will manipulate or represent it in some way.
Filter
The next step involves filtering the data to remove portions not relevant to our
use. In this example, for the sake of keeping it simple, we’ll be focusing on the
contiguous 48 states, so the records for cities and towns that are not part of those
states—Alaska, Hawaii, and territories such as Puerto Rico—are removed.
Another project could require significant mathematical work to place the data
into a mathematical model or normalize it (convert it to an acceptable range of
numbers).
Mine
This step involves math, statistics, and data mining. The data in this case receives
only a simple treatment: the program must figure out the minimum and
maximum values for latitude and longitude by running through the data (as
shown in Figure 1-3) so that it can be presented on a screen at a proper scale.
Most of the time, this step will be far more complicated than a pair of simple
math operations.
Figure 1-3. Mining the data: just compare values to find the minimum and
maximum
Represent
This step determines the basic form that a set of data will take. Some data sets are
shown as lists, others are structured like trees, and so forth. In this case, each zip
code has a latitude and longitude, so the codes can be mapped as a two-
dimensional plot, with the minimum and maximum values for the latitude and
longitude used for the start and end of the scale in each dimension. This is
illustrated in Figure 1-4.
The Represent stage is a linchpin that informs the single most important decision
in a visualization project and can make you rethink earlier stages. How you
choose to represent the data can influence the very first step (what data you
acquire) and the third step (what particular pieces you extract).
Refine
In this step, graphic design methods are used to further clarify the representation
by calling more attention to particular data (establishing hierarchy) or by
changing attributes (such as color) that contribute to readability.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 1-5. Using color to refine the representation
Interact
The next stage of the process adds interaction, letting the user control or explore
the data. Interaction might cover things like selecting a subset of the data or
changing the viewpoint. As another example of a stage affecting an earlier part of
the process, this stage can also affect the refinement step, as a change in
viewpoint might require the data to be designed differently.
In the Zipdecode project, typing a number selects all zip codes that begin with
that number. Figure 1-6 and Figure 1-7 show all the zip codes beginning with
zero and nine, respectively.
Figure 1-6. The user can alter the display through choices (zip codes
starting with 0)
Figure 1-7. The user can alter the display through choices (zip codes
starting with 9)
Another enhancement to user interaction (not shown here) enables the users to
traverse the display laterally and run through several of the prefixes. After typing
part or all of a zip code, holding down the Shift key allows users to replace the
last number typed without having to hit the Delete key to back up.
Typing is a very simple form of interaction, but it allows the user to rapidly gain
an understanding of the zip code system’s layout. Just contrast this sample
application with the difficulty of deducing the same information from a table of
zip codes and city names.
The viewer can continue to type digits to see the area covered by each subsequent
set of prefixes. Figure 1-8 shows the region highlighted by the two digits 02,
Figure 1-9 shows the three digits 021, and Figure 1-10 shows the four digits
0213. Finally, Figure 1-11 shows what you get by entering a full zip code, 02139
—a city name pops up on the display.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 1-9. Honing in with three digits (021)
In addition, users can enable a “zoom” feature that draws them closer to each
subsequent digit, revealing more detail around the area and showing a constant
rate of detail at each level. Because we’ve chosen a map as a representation, we
could add more details of state and county boundaries or other geographic
features to help viewers associate the “data” space of zip code points with what
they know about the local environment.
Figure 1-11. Honing in even further with the full zip code (02139)
The need for a compact representation on the screen led me to refilter the
data to include only the contiguous 48 states.
The connections between the steps in the process illustrate the importance of the
individual or team in addressing the project as a whole. This runs counter to the
common fondness for assembly-line style projects, where programmers handle
the technical portions, such as acquiring and parsing data, and visual designers
are left to choose colors and typefaces. At the intersection of these fields is a
more interesting set of properties that demonstrates their strength in combination.
When acquiring data, consider how it can change, whether sporadically (such as
once a month) or continuously. This expands the notion of graphic design that’s
traditionally focused on solving a specific problem for a specific data set, and
instead considers the meta-problem of how to handle a certain kind of data that
might be updated in the future.
In the filtering step, data can be filtered in real time, as in the Zipdecode
application. During visual refinement, changes to the design can be applied ⬆
across the entire system. For instance, a color change can be automatically
applied to the thousands of elements that require it, rather having to make such a
tedious modification by hand. This is the strength of a computational approach,
where tedious processes are minimized through automation.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Principles
I’ll finish this general introduction to visualization by laying out some ways of
thinking about data and its representation that have served me well over many
years and many diverse projects. They may seem abstract at first, or of minor
importance to the job you’re facing, but I urge you to return and reread them as
you practice visualization; they just may help you in later tasks.
Chapters in this book are divided by types of data, rather than types of display. In
other words, we’re not saying, “Here’s how to make a bar graph,” but “Here are
several ways to show a correlation.” This gives you a more powerful way to
think about maximizing what can be said about the data set in question.
I’m often asked for a library of tools that will automatically make attractive
representations of any given data set. But if each data set is different, the point of
visualization is to expose that fascinating aspect of the data and make it self-
evident. Although readily available representation toolkits are useful starting
points, they must be customized during an in-depth study of the task.
Data is often stored in a generic format. For instance, databases used for
annotation of genomic data might consist of enormous lists of start and stop
positions, but those lists vary in importance depending on the situation in which
they’re being used. We don’t view books as long abstract sequences of words, yet
when it comes to information, we’re often so taken with the enormity of the
information and the low-level abstractions used to store it that the narrative is
lost. Unless you stop thinking about databases, everything looks like a table—
millions of rows and columns to be stored, queried, and viewed.
In this book, we use a small collection of simple helper classes as starting points.
Often, we’ll be targeting the Web as a delivery platform, so the classes are
designed to take up minimal time for download and display. But I will also
discuss more robust versions of similar tools that can be used for more in-depth
work.
This book aims to help you learn to understand data as a tool for human decision-
making—how it varies, how it can be used, and how to find what’s unique about
your data set. We’ll cover many standard methods of visualization and give you
the background necessary for making a decision about what sort of representation
is suitable for your data. For each representation, we consider its positive and
negative points and focus on customizing it so that it’s best suited to what you’re
trying to convey about your data set.
Consider a weather map, with curved bands of temperatures across the country.
The designers avoid giving each band a detailed edge (particularly because the
data is often fuzzy). Instead, they convey a broader pattern in the data.
Subway maps leave out the details of surface roads because the additional detail
adds more complexity to the map than necessary. Before maps were created in
Beck’s style, it seemed that knowing street locations was essential to navigating
the subway. Instead, individual stations are used as waypoints for direction
finding. The important detail is that your target destination is near a particular
station. Directions can be given in terms of the last few turns to be taken after
you exit the station, or you can consult a map posted at the station that describes
the immediate area aboveground.
It’s easy to collect data, and some people become preoccupied with simply
accumulating more complex data or data in mass quantities. But more data is not
implicitly better, and often serves to confuse the situation. Just because it can be
measured doesn’t mean it should. Perhaps making things simple is worth
bragging about, but making complex messes is not. Find the smallest amount of
data that can still convey something meaningful about the contents of the data
set. As with Beck’s underground map, focusing on the question helps define
those minimum requirements.
The same holds for the many “dimensions” that are found in data sets. Web site
traffic statistics have many dimensions: IP address, date, time of day, page
visited, previous page visited, result code, browser, machine type, and so on.
While each of these might be examined in turn, they relate to distinct questions.
Only a few of the variables are required to answer a typical question, such as
“How many people visited page x over the last three months, and how has that
figure changed each month?” Avoid trying to show a burdensome
multidimensional space that maps too many points of information.
In what way will your audience use the piece? A mapping application used on a ⬆
mobile device has to be designed with a completely different set of criteria than
one used on a desktop computer. Although both applications use maps, they have
little to do with each other. The focus of the desktop application may be finding
[ 1 ]*
Tukey, John Wilder. Exploratory Data Analysis.
Reading, MA: Addison-Wesley, 1977.
PREV NEXT
⏮ ⏭
Preface 2. Getting Started with Processing
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Chapter 2
Figure 2.1 shows the abstract types of what can be visualized. The four basic
dataset types are tables, networks, fields, and geometry; other possible collec‐
tions of items include clusters, sets, and lists. These datasets are made up of dif‐
ferent combinations of the five data types: items, attributes, links, positions, and
grids. For any of these dataset types, the full dataset could be available immedi‐
ately in the form of a static file, or it might be dynamic data processed gradually
in the form of a stream. The type of an attribute can be categorical or ordered,
with a further split into ordinal and quantitative. The ordering direction of at‐
tributes can be sequential, diverging, or cyclic.
1 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 2.1.
Many aspects of vis design are driven by the kind of data that you have at your
disposal. What kind of data are you given? What information can you figure out
from the data, versus the meanings that you must be told explicitly? What high-
level concepts will allow you to split datasets apart into general and useful
pieces?
What does this sequence of six numbers mean? You can’t possibly know yet,
without more information about how to interpret each number. Is it locations for
two points far from each other in three-dimensional space, 14, 2.6, 30 and 30, 15,
100001? Is it two points closer to each other in two-dimensional space, 14, 2.6
and 30, 30, with the fifth number meaning that there are 15 links between these
two points, and the sixth number assigning the weight of ‘100001’ to that link?
Basil, 7, S, Pear
These numbers and words could have many possible meanings. Maybe a food
shipment of produce has arrived in satisfactory condition on the 7th day of the
month, containing basil and pears. Maybe the Basil Point neighborhood of the
city has had 7 inches of snow cleared by the Pear Creek Limited snow removal
2 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
To move beyond guesses, you need to know two crosscutting pieces of informa‐
tion about these terms: their semantics and their types. The semantics of the data
is its real-world meaning. For instance, does a word represent a human first
name, or is it the shortened version of a company name where the full name can
be looked up in an external list, or is it a city, or is it a fruit? Does a number rep‐
resent a day of the month, or an age, or a measurement of height, or a unique
code for a specific person, or a postal code for a neighborhood, or a position in
space?
The type of the data is its structural or mathematical interpretation. At the data
level, what kind of thing is it: an item, a link, an attribute? At the dataset level,
how are these data types combined into a larger structure: a table, a tree, a field
of sampled values? At the attribute level, what kinds of mathematical operations
are meaningful for it? For example, if a number represents a count of boxes of
detergent, then its type is a quantity, and adding two such numbers together
makes sense. If the number represents a postal code, then its type is a code rather
than a quantity—it is simply the name for a category that happens to be a number
rather than a textual name. Adding two of these numbers together does not make
sense.
Table 2.1 shows several more lines of the same dataset. This simple example ta‐
ble is tiny, with only nine rows and four columns. The exact semantics should be
provided by the creator of the dataset; I give it with the column titles. In this
case, each person has a unique identifier, a name, an age, a shirt size, and a fa‐
vorite fruit.
3 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Table 2.1.
A full table with column titles that prov ide the inte nded seman-
tics of the attributes.
1 Amy 8 S Apple
2 Basil 7 S Pear
3 Clara 9 M Durian
4 Desmond 13 L Elderberry
5 Ernest 12 L Peach
6 Fanny 10 S Lychee
7 George 9 M Orange
8 Hector 8 L Loquat
9 Ida 10 M Pear
10 Amy 12 M Orange
Sometimes types and semantics can be correctly inferred simply by observing the
syntax of a data file or the names of variables within it, but often they must be
provided along with the dataset in order for it to be interpreted correctly.
Sometimes this kind of additional information is called metadata; the line be‐
tween data and metadata is not clear, especially given that the original data is of‐
4 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
The classification below presents a way to think about dataset and attribute types
and semantics in a way that is general enough to cover the cases interesting in
vis, yet specific enough to be helpful for guiding design choices at the abstraction
and idiom levels.
Figure 2.2 shows the five basic data types discussed in this book: items, at‐
tributes, links, positions, and grids. An attribute is some specific property that
Figure 2.2.
The five basic data types: items, attributes, links, positions, and grids.
* Synonyms for attribute are variable and data dimension, or just dimension
for short. Since dimension has many meanings, in this book it is reserved for the
visual channels of spatial position as discussed in Section 6.3.
* The word dataset is singular. In vis the word data is commonly used as a
singular mass noun as well, in contrast to the traditional usage in the natural sci‐
ences where data is plural.
Figure 2.3 shows that these basic dataset types arise from combinations of the
data types of items, attributes, links, positions, and grids.
5 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 2.3.
The four basic dataset types are tables, networks, fields, and geometry; other pos‐
sible collections of items are clusters, sets, and lists. These datasets are made up
of five core data types: items, attributes, links, positions, and grids.
Figure 2.4 shows the internal structure of the four basic dataset types in detail.
Tables have cells indexed by items and attributes, for either the simple flat case
or the more complex multidimensional case. In a network, items are usually
called nodes, and they are connected with links; a special case of networks is
trees. Continuous fields have grids based on spatial positions where cells contain
attributes. Spatial geometry has only position information.
Figure 2.4.
Many datasets come in the form of tables that are made up of rows and columns,
a familiar form to anybody who has used a spreadsheet. In this chapter, I focus on
the concept of a table as simply a type of dataset that is independent of any par‐
ticular visual representation; later chapters address the question of what visual
representations are appropriate for the different types of datasets.
For a simple flat table, the terms used in this book are that each row represents
an item of data, and each column is an attribute of the dataset. Each cell in the
table is fully specified by the combination of a row and a column—an item and
an attribute—and contains a value for that pair. Figure 2.5 shows an example of
the first few dozen items in a table of orders, where the attributes are order ID,
order date, order priority, product container, product base margin, and ship date.
6 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 2.5.
A multidimensional table has a more complex structure for indexing into a cell,
with multiple keys.
The dataset type of networks is well suited for specifying that there is some kind
* A synonym for networks is graphs. The word graph is also deeply over‐
loaded in vis. Sometimes it is used to mean network as we discuss here, for in‐
stance in the vis subfield called graph drawing or the mathematical subfield
called graph theory. Sometimes it is used in the field of statistical graphics to
mean chart, as in bar graphs and line graphs.
Network nodes can have associated attributes, just like items in a table. In addi‐
tion, the links themselves could also be considered to have attributes associated
with them; these may be partly or wholly disjoint from the node attributes.
7 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Networks with hierarchical structure are more specifically called trees. In con‐
trast to a general network, trees do not have cycles: each child node has only one
parent node pointing to it. One example of a tree is the organization chart of a
company, showing who reports to whom; another example is a tree showing the
evolutionary relationships between species in the biological tree of life, where
the child nodes of humans and monkeys both share the same parent node of pri‐
mates.
The field dataset type also contains attribute values associated with cells. 1
Each cell in a field contains measurements or calculations from a continuous do‐
main: there are conceptually infinitely many values that you might measure, so
you could always take a new measurement between any two existing ones.
Continuous phenomena that might be measured in the physical world or simu‐
lated in software include temperature, pressure, speed, force, and density; mathe‐
matical functions can also be continuous.
Continuous data requires careful treatment that takes into account the mathemati‐
cal questions of sampling, how frequently to take the measurements, and inter‐
polation, how to show values in between the sampled points in a way that does
not mislead. Interpolating appropriately between the measurements allows you to
reconstruct a new view of the data from an arbitrary viewpoint that’s faithful to
what you measured. These general mathematical problems are studied in areas
such as signal processing and statistics. Visualizing fields requires grappling ex‐
tensively with these concerns.
In contrast, the table and network datatypes discussed above are an example of
discrete data where a finite number of individual items exist, and interpolation
between them is not a meaningful concept. In the cases where a mathematical
framework is necessary, areas such as graph theory and combinatorics provide
relevant ideas. 2
Continuous data is often found in the form of a spatial field, where the cell struc‐
ture of the field is based on sampling at spatial positions. Most datasets that con‐
tain inherently spatial data occur in the context of tasks that require understand‐
ing aspects of its spatial structure, especially shape.
For example, with a spatial field dataset that is generated with a medical imaging
instrument, the user’s task could be to locate suspected tumors that can be recog‐
nized through distinctive shapes or densities. An obvious choice for visual en‐
8 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
where the task is to compare the flow patterns in different regions. One possible
visual encoding would use the geometry of the wing as the spatial substrate,
showing the temperature and pressure using size-coded arrows.
The likely tasks faced by users who have spatial field data constrains many of the
choices about the use of space when designing visual encoding idioms. Many of
the choices for nonspatial data, where no information about spatial position is
Thus, the question of whether a dataset has the type of a spatial field or a nonspa‐
tial table has extensive and far-reaching implications for idiom design.
Historically, vis diverged into areas of specialization based on this very differen‐
tiation. The subfield of scientific visualization, or scivis for short, is concerned
with situations where spatial position is given with the dataset. A central concern
in scivis is handling continuous data appropriately within the mathematical
framework of signal processing. The subfield of information visualization, or
infovis for short, is concerned with situations where the use of space in a visual
encoding is chosen by the designer. A central concern in infovis is determining
whether the chosen idiom is suitable for the combination of data and task, lead‐
ing to the use of methods from human–computer interaction and design.
The geometry dataset type specifies information about the shape of items with
explicit spatial positions. The items could be points, or one-dimensional lines or
curves, or 2D surfaces or regions, or 3D volumes.
Geometry datasets are intrinsically spatial, and like spatial fields they typically
occur in the context of tasks that require shape understanding. Spatial data often
includes hierarchical structure at multiple scales. Sometimes this structure is pro‐
vided intrinsically with the dataset, or a hierarchy may be derived from the origi‐
nal data.
9 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
the task at hand from raw geographic data, such as the boundaries of a forest or a
city or a country, or the curve of a road. The problem of how to create images
from a geometric description of a scene falls into another domain: computer
graphics. While vis draws on algorithms from computer graphics, it has different
concerns from that domain. Simply showing a geometric dataset is not an inter‐
esting problem from the point of view of a vis designer.
Beyond tables, there are many ways to group multiple items together, including
sets, lists, and clusters. A set is simply an unordered group of items. A group of
There are also more complex structures built on top of the basic network type. A
path through a network is an ordered set of segments formed by links connecting
nodes. A compound network is a network with an associated tree: all of the
nodes in the network are the leaves of the tree, and interior nodes in the tree pro‐
vide a hierarchical structure for the nodes that is different from network links be‐
tween them.
Many other kinds of data either fit into one of the previous categories or do so af‐
ter transformations to create derived attributes. Complex and hybrid combina‐
tions, where the complete dataset contains multiple basic types, are common in
real-world applications.
The set of basic types presented above is a starting point for describing the what
part of an analysis instance that pertains to data; that is, the data abstraction. In
simple cases, it may be possible to describe your data abstraction using only that
set of terms. In complex cases, you may need additional description as well. If
so, your goal should be to translate domain-specific terms into words that are as
generic as possible.
Figure 2.6 shows the two kinds of dataset availability: static or dynamic.
Figure 2.6.
10 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
The default approach to vis assumes that the entire dataset is available all at once,
as a static file. However, some datasets are instead dynamic streams, where the
dataset information trickles in over the course of the vis session. * One kind of
dynamic change is to add new items or delete previous items. Another is to
change the values of existing items.
This distinction in availability crosscuts the basic dataset types: any of them can
be static or dynamic. Designing for streaming data adds complexity to many as‐
pects of the vis process that are straightforward when there is complete dataset
availability up front.
Figure 2.7 shows the attribute types. The major disinction is between categorical
versus ordered. Within the ordered type is a further differentiation between ordi‐
nal versus quantitative. Ordered data might range sequentially from a minimum
to a maximum value, or it might diverge in both directions from a zero point in
the middle of a range, or the values may wrap around in a cycle. Also, attributes
may have hierarchical structure.
Figure 2.7.
The first distinction is between categorical and ordered data. The type of cate‐
gorical data, such as favorite fruit or names, does not have an implicit ordering,
11 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
All ordered data does have an implicit ordering, as opposed to unordered cate‐
gorical data. This type can be further subdivided. With ordinal data, such as shirt
size, we cannot do full-fledged arithmetic, but there is a well-defined ordering.
For example, large minus medium is not a meaningful concept, but we know that
medium falls between small and large. Rankings are another kind of ordinal data;
some examples of ordered data are top-ten lists of movies or initial lineups for
sports tournaments depending on past performance.
In this book, the ordered type is used often; the ordinal type is only occasionally
mentioned, when the distinction between it and the quantitative type matters.
Ordered data can be either sequential, where there is a homogeneous range from
a minimum to a maximum value, or diverging, which can be deconstructed into
two sequences pointing in opposite directions that meet at a common zero point.
For instance, a mountain height dataset is sequential, when measured from a min‐
imum point of sea level to a maximum point of Mount Everest. A bathymetric
dataset is also sequential, with sea level on one end and the lowest point on the
ocean floor at the other. A full elevation dataset would be diverging, where the
values go up for mountains on land and down for undersea valleys, with the zero
value of sea level being the common point joining the two sequential datasets.
Ordered data may be cyclic, where the values wrap around back to a starting
point rather than continuing to increase indefinitely. Many kinds of time mea‐
surements are cyclic, including the hour of the day, the day of the week, and the
month of the year.
▶ Section 13.4 covers hierarchical aggregation in more detail, and Section 7.5
covers the visual encoding of attribute hierarchies.
12 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
these two questions are crosscutting: one does not dictate the other. Different ap‐
proaches to considering the semantics of attributes that have been proposed
across the many fields where these semantics are studied. The classification in
this book is heavily focused on the semantics of keys versus values, and the re‐
lated questions of spatial and continuous data versus nonspatial and discrete data,
to match up with the idiom design choice analysis framework. One additional
consideration is whether an attribute is temporal.
A key attribute acts as an index that is used to look up value attributes. * The
distinction between key and value attributes is important for the dataset types of
tables and fields, as shown in Figure 2.8.
Figure 2.8.
A simple flat table has only one key, where each item corresponds to a row in
the table, and any number of value attributes. In this case, the key might be com‐
pletely implicit, where it’s simply the index of the row. It might be explicit,
where it is contained within the table as an attribute. In this case, there must not
be any duplicate values within that attribute. In tables, keys may be categorical or
ordinal attributes, but quantititive attributes are typically unsuitable as keys be‐
cause there is nothing to prevent them from having the same values for multiple
items.
For example, in Table 2.1, Name is a categorical attribute that might appear to be
a reasonable key at first, but the last line shows that two people have the same
name, so it is not a good choice. Favorite Fruit is clearly not a key, despite being
categorical, because Pear appears in two different rows. The quantitative at‐
tribute of Age has many duplicate values, as does the ordinal attribute of Shirt
13 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 2.9 shows the order table from Figure 2.5 where each attribute is colored
according to its type. There is no explicit key: even the Order ID attribute has du‐
plicates, because orders consist of multiple items with different container sizes,
so it does not act as a unique identifier. This table is an example of using an im‐
plicit key that is the row number within the table.
Figure 2.9.
The order table with the attribute columns colored by their type; none of them is
a key.
The more complex case is a multidimensional table, where multiple keys are re‐
quired to look up an item. The combination of all keys must be unique for each
item, even though an individual key attribute may contain duplicates. For exam‐
ple, a common multidimensional table from the biology domain has a gene as
one key and time as another key, so that the value in each cell is the activity level
of a gene at a particular time.
The information about which attributes are keys and which are values may not be
available; in many instances determining which attributes are independent keys
versus dependent values is the goal of the vis process, rather than its starting
point. In this case, the successful outcome of analysis using vis might be to recast
a flat table into a more semantically meaningful multidimensional table.
Although fields differ from tables a fundamental way because they represent con‐
tinuous rather than discrete data, keys and values are still central concerns.
(Different vocabulary for the same basic idea is more common with spatial field
data, where the term independent variable is used instead of key, and dependent
variable instead of value.)
Fields are structured by sampling in a systematic way so that each grid cell is
spanned by a unique range from a continuous domain. In spatial fields, spatial
position acts as a quantitative key, in contrast to a nonspatial attribute in the case
of a table that is categorical or ordinal. The crucial difference between fields and
tables is that useful answers for attribute values are returned for locations
14 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
three spatial dimensions, 5 and fields with three or four keys, in the case where
these measurements are time-varying. A field can be both multidimensional and
multivariate if it has multiple keys and multiple values. The standard classifica‐
tion according to multivariate structure is that a scalar field has one attribute per
cell, a vector field has two or more attributes per cell, and a tensor field has
* These definitions of scalar, vector, and tensor follow the common usage in
vis. In a strict mathematical sense, these distinctions are not technically correct,
since scalars and vectors are included as a degenerate case of tensors. Mapping
the mathematical usage to the vis usage, scalars mean mathematical tensors of
order 0, vectors mean mathematical tensors of order 1, and tensors mean mathe‐
matical tensors of order 2 or more.
A scalar field is univariate, with a single value attribute at each point in space.
One example of a 3D scalar field is the time-varying medical scan above; another
is the temperature in a room at each point in 3D space. The geometric intuition is
that each point in a scalar field has a single value. A point in space can have sev‐
eral different numbers associated with it; if there is no underlying connection be‐
tween them then they are simply multiple separate scalar fields.
A tensor field has an array of attributes at each point, representing a more com‐
plex multivariate mathematical structure than the list of numbers in a vector. A
physical example is stress, which in the case of a 3D field can be defined by nine
numbers that represent forces acting in three orthogonal directions. The geomet‐
ric intution is that the full information at each point in a tensor field cannot be
represented by just an arrow and would require a more complex shape such as an
ellipsoid.
15 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
A temporal attribute is simply any kind of information that relates to time. Data
about time is complicated to handle because of the rich hierarchical structure that
we use to reason about time, and the potential for periodic structure. The time hi‐
erarchy is deeply multiscale: the scale of interest could range anywhere from
nanoseconds to hours to decades to millennia. Even the common words time and
date are a way to partially specify the scale of temporal interest. Temporal analy‐
sis tasks often involve finding or verifying periodicity either at a predetermined
scale or at some scale not known in advance. Moreover, the temporal scales of
interest do not all fit into a strict hierarchy; for instance, weeks do not fit cleanly
into months. Thus, the generic vis problems of transformation and aggregation
are often particularly complex when dealing with temporal data. One important
idea is that even though the dataset semantics involves change over time, there
are many approaches to visually encoding that data—and only one of them is to
show it changing over time in the form of an animation.
Temporal attributes can have either value or key semantics. Examples of tempo‐
ral attributes with dependent value semantics are a duration of elapsed time or the
date on which a transaction occurred. In both spatial fields and abstract tables,
time can be an independent key. For example, a time-varying medical scan can
have the independent keys of x, y, z, t to cover spatial position and time, with the
dependent value attribute of density for each combination of four indices to look
up position and time. A temporal key attribute is usually considered to have a
quantitative type, although it’s possible to consider it as ordinal data if the dura‐
tion between events is not interesting.
A dataset has time-varying semantics when time is one of the key attributes, as
opposed to when the temporal attribute is a value rather than a key. As with other
decisions about semantics, the question of whether time has key or value seman‐
tics requires external knowledge about the nature of the dataset and cannot be
made purely from type information. An example of a dataset with time-varying
semantics is one created with a sensor network that tracks the location of each
animal within a herd by taking new measurements every second. Each animal
will have new location data at every time point, so the temporal attribute is an in‐
dependent key and is likely to be a central aspect of understanding the dataset. In
contrast, a horse-racing dataset covering a year’s worth of races could have tem‐
poral value attributes such as the race start time and the duration of each horse’s
run. These attributes do indeed deal with temporal information, but the dataset is
not time-varying.
16 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
a dataset has stream type, in contrast to an unchanging file that can be loaded all
at once. In this latter sense, items and attributes can be added or deleted and their
values may change during a running session of a vis tool. I carefully distinguish
between these two meanings here.
▶ The dataset types of dynamic streams versus static files are discussed in
Section 2.4.6.
The framework presented here was inspired in part by the many taxonomies of
data that have been previously proposed, including the synthesis chapter at the
beginning of an early collection of infovis readings [Card et al. 99], a taxonomy
that emphasizes the division between continuous and discrete data [Tory and
Möller 04a], and one that emphasizes both data and tasks [Shneiderman 96].
Field Datasets:
Several books discuss the spatial field dataset type in far more detail, including
two textbooks [Telea 07, Ward et al. 10], a voluminous handbook [Hansen and
Johnson 05], and the vtk book [Schroeder et al. 06].
Attribute Types:
The attribute types of categorical, ordered, and quantitative were proposed in the
seminal work on scales of measurement from the psychophysics literature
[Stevens 46]. Scales of measurement are also discussed extensively in the book
The Grammar of Graphics [Wilkinson 05] and are used as the foundational axes
of an influential vis design space taxonomy [Card and Mackinlay 97].
The Polaris vis system, which has been commercialized as Tableau, is built
around the distinction between key attributes (independent dimensions) and value
attributes (dependent measures) [Stolte et al. 02].
Temporal Semantics:
1 My use of the term field is related to but not identical to its use in the mathe‐
matics literature, where it denotes a mapping from a domain to a range. In this
case, the domain is a Euclidean space of one, two, or three dimensions, and the
adjective modifying field is a statement about the range: scalars, vectors, or ten‐
sors. Although the term field by itself is not commonly found in the literature,
when I use it without an adjective I’m emphasizing the continuous nature of the
domain, rather than specifics of the ranges of scalars, vectors, or tensors.
2 Technically, all data stored within a computer is discrete rather than continu‐
ous; however, the interesting question is whether the underlying semantics of the
17 of 18 7/22/2020, 8:04 PM
Chapter 2 What: Data Abstraction - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
sus ratio data [Stevens 46]; this distinction is typically not useful when designing
a visual encoding, so in this book these types remain collapsed together into this
single category.
4 It’s common to store the key attribute in the first column, for understandabil‐
ity by people and ease of building data structures by computers.
5 It’s also possible for a spatial field to have just one key.
18 of 18 7/22/2020, 8:04 PM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Chapter 13
Figure 13.1 shows the set of design choices for reducing—or increasing—what is
shown at once within a view. Filtering simply eliminates elements, whereas ag‐
gregation combines many together. Either choice can be applied to both items or
attributes.
Figure 13.1.
1 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Typically, static data reduction idioms only reduce what is shown, as the name
suggests. However, in the dynamic case, the outcome of changing a parameter or
a choice may be an increase in the number of visible elements. Thus, many of the
idioms covered in this chapter are bidirectional: they may serve to either reduce
or increase the number of visible elements. Nevertheless, they are all named after
the reduction action for brevity.
▶ Deriving new data is covered in Chapter 3, changing a view over time is cov‐
ered in Chapter 11, faceting data into multiple views is covered in Chapter 12,
and embedding focus and contextual information together within one view is
covered in Chapter 14.
Reducing the amount of data shown in a view is an obvious way to reduce its vis‐
ual complexity. Of course, the devil is in the details, where the challenge is to
minimize the chances that information important to the task is hidden from the
user. Reduction can be applied to both items and attributes; the word element
will be used to refer to either items or attributes when design choices that apply
to both are discussed. Filtering simply eliminates elements, whereas aggregation
creates a single new element that stands in for multiple others that it replaces. It’s
useful to consider the tradeoffs between these two alternatives explicitly when
making design choices: filtering is very straightforward for users to understand,
and typically also to compute. However, people tend to have an “out of sight, out
of mind” mentality about missing information: they tend to forget to take into ac‐
count elements that have been filtered out, even when their absence is the result
of quite recent actions. Aggregation can be somewhat safer from a cognitive
point of view because the stand-in element is designed to convey information
about the entire set of elements that it replaces. However, by definition, it cannot
convey all omitted information; the challenge with aggregation is how and what
to summarize in a way that matches well with the dataset and task.
The idea of filtering is very obvious; the challenge comes in designing a vis sys‐
tem where filtering can be used to effectively explore a dataset. Consider the sim‐
ple case of filtering the set of items according to their values for a single quanti‐
tative attribute. The goal is to select a range within it in terms of minimum and
maximum numeric values and eliminate the items whose values for that attribute
fall outside of the range. From the programmer’s point of view, a very simple
way to support this functionality would be to simply have the user enter two
numbers, a minimum and maximum value. From the user’s point of view, this ap‐
2 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
In item filtering, the goal is to eliminate items based on their values with respect
to specific attributes. Fewer items are shown, but the number of attributes shown
does not change.
Example: FilmFinder
Figure 13.2 shows the FilmFinder system [Ahlberg and Shneiderman 94] for ex‐
ploring a movie database. The dataset is a table with nine value attributes: genre,
year made, title, actors, actresses, directors, rating, popularity, and length. The
visual encoding features an interactive scatterplot where the items are movies
color coded by genre, with scatterplot axes of year made versus movie popular‐
ity; Figure 13.2(a) shows the full dataset. The interaction design features filter‐
ing, with immediate update of the visual display to filter out or add back items as
sliders are moved and buttons are pressed. The visual encoding adapts to the
number of items to display; the marks representing movies are automatically en‐
larged and labeled when enough of the dataset has been filtered away that there is
enough room to do so, as in Figure 13.2(b). The system uses multiform
overview–detail views, where clicking on any mark brings up a popup detail
view with more information about that movie, as in Figure 13.2(c).
3 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.2.
FilmFinder features tightly coupled interactive filtering, where the result of mov‐
ing sliders and pressing buttons is immediately reflected in the visual encoding.
(a) Exploration begins with an overview of all movies in the dataset. (b) Moving
the actor slider to select Sean Connery filters out most of the other movies, leav‐
ing enough room to draw labels. (c) Clicking on the mark representing a movie
brings up a detail view. From [Ahlberg and Shneiderman 94, Color Plates 1, 2,
and 3].
4 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.2 shows the use of two augmented slider types, a dual slider for movie
length that allows the user to select both a minimum and maximum value, and
several alpha sliders that are tuned for selection with text strings rather than
numbers.
System FilmFinder
Figure 13.3.
The scented widget idiom adds visual encoding information directly to standard
graphical widgets to make filtering possible with high information density dis‐
plays. From [Willett et al. 07, Figure 2].
The Improvise system shown in Figure 12.7 is another example of the use of fil‐
tering. The checkbox list view in the lower middle part of the screen is a simple
filter controlling whether various geographic features are shown. The multiform
5 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Attributes can also be filtered. With attribute filtering, the goal is to eliminate
attributes rather than items; that is, to show the same number of items, but fewer
attributes for each item.
Item filtering and attribute filtering can be combined, with the result of showing
both fewer items and fewer attributes.
Example: DOSFA
* Many idioms for attribute filtering and aggregation use the alternative term
dimension rather than attribute in their names.
Figure 13.4.
The DOSFA idiom shown on star glyphs with a medical records dataset of 215
dimensions and 298 points. (a) The full dataset is so dense that patterns cannot be
seen. (b) After ordering on similarity and filtering on both similarity and impor‐
tance, the star glyphs show structure. From [Yang et al. 03a, Figures 3a and 3d].
6 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
System DOSFA
ues for that attribute. * One approach is to calculate the variance of an at‐
tribute: to what extent the values within that attribute are similar to or different
from each other. There are many ways to calculate a similarity measure between
attributes; some focus on global similarity, and others search for partial matches
[Ankerst et al. 98].
The other major reduction design choice is aggregation, so that a group of ele‐
ments is represented by a new derived element that stands in for the entire group.
Elements are merged together with aggregation, as opposed to eliminated com‐
pletely with filtering. Aggregation and filtering can be used in conjunction with
each other. As with filtering, aggregation can be used for both items and at‐
tributes.
Aggregation typically involves the use of a derived attribute. A very simple ex‐
7 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Anscombe’s Quartet example shown in Figure 1.3 exactly illustrates the diffi‐
culty of adequately summarizing data, and thus the limits of static visual encod‐
ing idioms that use aggregation. Aggregation is nevertheless a powerful design
choice, particularly when used within interactive idioms where the user can
change the level of aggregation on the fly to inspect the dataset at different levels
of detail.
The most straightforward use of item aggregation is within static visual encoding
idioms; its full power and flexibility can be harnessed by interactive idioms
where the view dynamically changes.
Example: Histograms
The idiom of histograms shows the distribution of items within an original at‐
tribute. Figure 13.5 shows a histogram of the distribution of weights for all of the
cats in a neighborhood, binned into 5-pound blocks. The range of the original at‐
tribute is partitioned into bins, and the number of items that fall into each bin is
computed and saved as a derived ordered attribute. The visual encoding of a his‐
togram is very similar to bar charts, with a line mark that uses spatial position in
one direction and the bins distributed along an axis in the other direction. One
difference is that histograms are sometimes shown without space between the
bars to visually imply continuity, whereas bar charts conversely have spaces be‐
tween the bars to imply discretization. Despite their visual similarity, histograms
are very different than bar charts. They do not show the original table directly;
rather, they are an example of an aggregation idiom that shows a derived table
that is more concise than the original dataset. The number of bins in the his‐
togram can be chosen independently of the number of items in the dataset. The
choice of bin size is crucial and tricky: a histogram can look quite different de‐
pending on the discretization chosen. One possible solution to the problem is to
compute the number of bins based on dataset characteristics; another is to pro‐
vide the user with controls to easily change the number of bins interactively, to
see how the histogram changes.
Figure 13.5.
The histogram idiom aggregates an arbitrary number of items into a concise rep‐
resentation of their distribution.
8 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Idiom Histograms
What:
Table: one quantitative value attribute.
Data
Figure 13.6 shows a continuous scatterplot of a tornado air-flow dataset, with the
magnitude of the velocity on the horizontal and the z-direction velocity on the
vertical. The density is shown with a log-scale sequential colormap with mono‐
tonically increasing luminance. It starts with dark blues at the low end, continues
with reds in the middle, and has yellows and whites at the high end.
Figure 13.6.
The continuous scatterplot idiom uses color to show the density at each location,
9 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Scatterplots began as a idiom for discrete, categorical data. They have been gen‐
eralized to a mathematical framework of density functions for continuous data,
giving rise to continuous scatterplots in the 2D case and continuous histograms in
the 1D case [Bachthaler and Weiskopf 08]. Continuous scatterplots use a dense,
space-filling 2D matrix alignment, where each pixel is given a different color.
Although the idiom of continuous scatterplots has a similar name to the idiom of
scatterplots, analysis via the framework of design choices shows that the ap‐
proach is in fact very different.
What:
Table: two quantitative value attributes.
Data
How:
Item aggregation.
Reduce
zontal lines. * Outliers beyond the range of the chosen fence cutoff are shown
explicitly as discrete dots, just as in scatterplots or dot charts.
10 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.7.
A boxplot is similar in spirit to an individual bar in a bar chart in that only a sin‐
gle spatial axis is used to visually encode data, but boxplots show five numbers
through the use of a glyph rather than the single number encoded by the linear
mark in a bar chart. A boxplot chart features multiple boxplots within a single
shared frame to contrast different attribute distributions, just as bar charts show
multiple bars along the second axis. In Figure 13.7, the quantitative value at‐
tribute is mapped to the vertical axis and the categorical key attribute to the hori‐
zontal one.
The boxplot can be considered an item reduction idiom that provides an aggre‐
gate view of a distribution through the use of derived data. Boxplots are highly
scalable in terms of aggregating the target quantitative attribute from what could
be an arbitrarily large set of values down to five numbers; for example, it could
easily handle from thousands to millions of values within that attribute. The spa‐
tial encoding of these five numbers along the central axis requires only a moder‐
ate amount of screen space, since we have high visual acuity with spatial posi‐
tion. Each boxplot requires only a very small amount of screen space along the
secondary axis, leading to a high level of scalability in terms of the number of
categorical values that can be accommodated in a boxplot chart; roughly hun‐
dreds.
Boxplots directly show the spread, namely, the degree of dispersion, with the ex‐
tent of the box. They show the skew of the distribution compared with a normal
distribution with the peak at the center by the asymmetry between the top and
bottom sections of the box. Standard boxplots are designed to handle unimodal
data, where there is only one value that occurs the most frequently. There are
many variants of boxplots that augment the basic visual encoding with more in‐
formation. Figure 13.7(b) shows a variable-width variant called the vase plot that
uses an additional spatial dimension within the glyph by altering the width of the
central box according to the density, allowing a visual check if the distribution is
instead multimodal, with multiple peaks. The variable-width variants require
more screen space along the secondary axis than the simpler version, in an exam‐
ple of the classic cost–benefit trade-off where conveying more information re‐
quires more room.
11 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
What:
Table: many quantitative value attributes.
Data
How:
Item aggregation.
Reduce
Example: SolarPlot
Figure 13.8 shows the example of SolarPlot, a radial histogram with an interac‐
tively controllable aggregation level [Chuah 98]. The user directly manipulates
the size of the base circle that is the radial axis of the chart. This change of radius
indirectly changes the number of available histogram bins, and thus the aggrega‐
tion level. Like all histograms, the SolarPlot aggregation operator is count: the
height of the bar represents the number of items in the set. The dataset shown is
ticket sales over time, starting from the base of the circle and progressing coun‐
terclockwise to cover 30 years in total. The small circle in Figure 13.8(a) is heav‐
ily aggregated. It does show an increase in ticket sales over the years. The larger
12 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.8.
Idiom SolarPlot
What:
Table: one quantitative attribute.
Data
How:
Item aggregation.
Reduce
13 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
The idiom of hierarchical parallel coordinates [Fua et al. 99] uses interactively
controlled aggregation as a design choice to increase the scalability of the basic
parallel coordinates visual encoding to hundreds of thousands of items. The
dataset is transformed by computing derived data: a hierarchical clustering of the
items. Several statistics about each cluster are computed, including the number of
points it contains; the mean, minimum, and maximum values; and the depth in
the hierarchy. A cluster is represented by a band of varying width and opacity,
where the mean is in the middle and width at each axis depends on the minimum
and maximum item values for that attribute within the cluster. Thus, in the limit,
a cluster of a single item is shown by a single line, just as with the original idiom.
The cluster bands are colored according to their proximity in the cluster hierar‐
chy, so that clusters far away from each other have very different colors.
The level of detail displayed at a global level for the entire dataset can be interac‐
tively controlled by the user using a single slider. The parameter controlled by
that slider is again a derived variable that varies the aggregate level of detail
shown in a smooth and continuous way. Figure 13.9 shows a dataset with eight
attributes and 230,000 items at different levels of detail. Figure 13.9(a) is the
highest-level overview showing the single top-level cluster, with very broad
bands of green. Figure 13.9(b) is the mid-level view showing several clusters,
where the extents of the tan cluster are clearly distinguishable from the now-
smaller green one. Figure 13.9(c) is a more detailed view with dozens of clusters
that have tighter bands; the proximity-based coloring mitigates the effect of oc‐
clusion.
Figure 13.9.
Hierarchical parallel coordinates provide multiple levels of detail. (a) The single
top cluster has large extent. (b) When several clusters are shown, each has a
smaller extent. (c) When many clusters are shown, the proximity-based coloring
helps them remain distinguishable from each other. From [Fua et al. 99, Figure
4].
14 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
What:
Table.
Data
The challenge of spatial aggregation is to take the spatial nature of data into ac‐
count correctly when aggregating it. In the cartography literature, the modifiable
areal unit problem (MAUP) is a major concern: changing the boundaries of the
regions used to analyze data can yield dramatically different results. Even if the
number of units and their size does not change, any change of spatial grouping
can lead to a very significant change in analysis results. Figure 13.10 shows an
example, where the same location near the middle of the map has a different den‐
sity level depending on the region boundaries: high in Figure 13.10(a), medium
in Figure 13.10(b), and low in Figure 13.10(c). Moreover, changing the scale of
the units also leads to different results. The problem of gerrymandering, where
the boundaries of voting districts are manipulated for political gain, is the in‐
stance of the MAUP best known to the general public.
Figure 13.10.
Modifiable Areal Unit Problem (MAUP) example, showing how different bound‐
15 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.11.
Figure 13.11(a) shows a standard choropleth map colored by personal crime at‐
tribute x1, with the interactively selected region Creuse (23) highlighted. Figure
13.11(b) shows gw-boxplots for all six attributes, at two scales. The gw-boxplot,
a geographically weighted boxplot geowig, supports comparison between the
global distribution and the currently chosen spatial scale using the design choice
of superimposed layers. The global statistical distribution is encoded by the gray
boxplot in the background, and the local statistics for the interactively chosen
scale are encoded by a foreground boxplot in green. Figure 13.11(c) shows the
weighting maps for the currently chosen scale of each gw-boxplot set: very local
on top, and a larger scale on the bottom. Figure 13.11(d) shows a gw-mean map,
a geographically weighted mean geowig, weighted according to the same larger
scale.
16 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
ing to a larger scale, that attribute’s distribution is close to the global one in both
the boxplot, matching the mid-range color in the gw-mean map geowig in Figure
13.11(d).
How:
Boxplot.
Encode
How:
Spatial aggregation.
Reduce
Just as attributes can be filtered, attributes can also be aggregated, where a new
attribute is synthesized to take the place of multiple original attributes. A very
simple approach to aggregating attributes is to group them by some kind of simi‐
larity measure, and then synthesize the new attribute by calculate an average
across that similar set. A more complex approach to aggregation is dimensional‐
ity reduction (DR), where the goal is to preserve the meaningful structure of a
dataset while using fewer attributes to represent the items.
17 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
dundancy in the original dataset because the underlying latent variables could not
be measured directly.
Nonlinear methods for dimensionality reduction are used when the new dimen‐
sions cannot be expressed in terms of a straightforward combination of the origi‐
nal ones. The multidimensional scaling (MDS) family of approaches includes
both linear and nonlinear variants, where the goal is to minimize the differences
in distances between points in the high-dimensional space versus the new lower-
dimensional space.
Text documents are usually transformed by ignoring the explicit linear ordering
of the words within the document and treating it as a bag of words: the number
of times that each word is used in the document is simply counted. The result is a
large feature vector, where the elements in the vector are all of the words in the
entire document collection. Very common words are typically eliminated, but
these vectors can still contain tens of thousands of words. However, these vectors
are very sparse, where the overwhelming number of values are simply zero: any
individual document contains only a tiny fraction of the possible words.
The result of this transformation is a derived table with a huge number of quanti‐
tative attributes. The documents are the items in the table, and the attribute value
for a particular word contains the number of times that word appears in the docu‐
ment. Looking directly at these tables is not very interesting.
This enormous table is then transformed into a much more compact one by deriv‐
ing a much smaller set of new attributes that still represents much of the structure
in the original table using dimensionality reduction. In this usage, there are two
stages of constructing derived data: from a document collection to a table with a
huge number of attributes, and then a second step to get down to a table with the
same number of items but just a few attributes.
The bag-of-words DR approach is suitable when the goal is to analyze the differ‐
18 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Images, videos, and other multimedia documents are usually transformed to cre‐
ate derived attributes in a similar spirit to the transformations done to text docu‐
ments. One major question is how to derive new attributes that compactly repre‐
sent an image as a set of features. The features in text documents are relatively
easy to identify because they’re based on the words; even in this case, natural
language processing techniques are often used to combine synonyms and words
with the same stem together. Image features typically require even more complex
computations, such as detecting edges within the image or the set of colors it
contains. Processing individual videos to create derived feature data can take into
account temporal characteristics such as interframe coherence.
Figure 13.12.
19 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Figure 13.13.
What:
Table with 10,000 attributes.
Derived
What:
Table with two attributes.
Derived
How:
Scatterplot, colored by conjectured clustering.
Encode
With standard dimensionality reduction techniques, the user chooses the number
of synthetic attributes to create. When the target number of new attributes is two,
the dimensionally reduced data is most often shown as a scatterplot. When more
than two synthetic attributes are created, a scatterplot matrix (SPLOM) may be a
good choice. Although in general scatterplots are often used to check for correla‐
20 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
Sometimes the dataset has no additional information, and the scatterplot is sim‐
ply encoding two-dimensional position. In many cases there is a conjectured cat‐
egorization of the points, which are colored according to those categories. The
task is then to check whether the patterns of colors match up well with the pat‐
terns of the spatial clusters of the reduced data, as shown in Figure 13.12.
Another caution is that this inspection should be used only to find or verify large-
scale cluster structure. The fine-grained structure in the lower-dimensional plots
should not be considered strongly reliable because some information is lost in the
reduction. That is, it is safe to assume that major differences in the distances be‐
tween points are meaningful, but minor differences in distances may not be a reli‐
able signal.
Filtering:
Early work in dynamic queries popularized filtering with tightly coupled views
and extending standard widgets to better support these queries [Ahlberg and
Shneiderman 94].
Scented Widgets:
Scented widgets [Willett et al. 07] allude to the idea of information scent pro‐
21 of 22 7/15/2020, 7:29 AM
Chapter 13 Reduce Items and Attributes - Visualization Analysis and Design https://learning.oreilly.com/library/view/visualization-analysis-and/9781...
ential book on Exploratory Data Analysis [Tukey 77]. A recent survey paper dis‐
cusses the many variants of boxplots that have been proposed in the past 40 years
[Wickham and Stryjewski 12].
Hierarchical Aggregation:
Spatial Aggregation:
Attribute Reduction:
DOSFA [Yang et al. 03a] is one of many approaches to attribute reduction from
the same group [Peng et al. 04, Yang et al. 04, Yang et al. 03b]. The DimStiller
system proposes a general framework for attribute reduction [Ingram et al. 10].
An extensive exploration of similarity metrics for dimensional aggregation was
influential early work [Ankerst et al. 98].
Dimensionality Reduction:
22 of 22 7/15/2020, 7:29 AM
Designing Data Visualizations
PREV NEXT
⏮ ⏭
3. Determine Your Goals and Supporting Data 5. First, Place
🔎
NATURAL ORDERING
Whether a visual property has a natural ordering is determined by whether the
mechanics of our visual system and the “software” in our brains automatically—
unintentionally—assign an order, or ranking, to different values of that property.
The “software” that makes these judgments is deeply embedded in our brains and
evaluates relative order independent of language, culture, convention, or other
[ 6 ]
learned factors; it’s not optional and you can’t design around it.
For example, position has a natural ordering; shape doesn’t. Length has a natural
ordering; texture doesn’t (but pattern density does). Line thickness or weight has
a natural ordering; line style (solid, dotted, dashed) doesn’t. Depending on the
specifics of the visual property, its natural ordering may be well suited to
representing quantitative differences (27, 33, 41), or ordinal differences (small,
medium, large, enormous).
Natural orderings are not to be confused with properties for which we have
learned or social conventions about their ordering. Social conventions are
powerful, and you should be aware of them, but you cannot depend on them to
be interpreted in the same way as naturally-ordered properties—which are not
social and not learned, and the interpretation of which is not optional.
Here’s a tricky one: Color (hue) is not naturally ordered in our brains. Brightness
(lightness or luminance, sometimes called tint) and intensity (saturation) are, but
color itself is not. We have strong social conventions about color, and there is an
ordering by wavelength in the physical world, but color does not have a non-
negotiable natural ordering built into the brain. You can’t depend on everyone to
agree that yellow follows purple in the way that you can depend on them to agree
that four follows three.
The misuse of color to imply order is rampant; don’t fall into this common trap.
In contexts where you’re tempted to use “ordered color” (elevation, heat maps,
etc.), consider varying brightness along one, or perhaps two, axes. For example,
elevation can be represented by increasing the darkness of browns, rather than
[ 7 ] [ 8 ]
cycling through the rainbow (see Figure 4-1 and Figure 4-2 ).
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-2. In this example the colors diverge from one point, clearly
indicating low, medium, and high elevations.
NOTE
For help in choosing appropriate color palettes, a great tool
is ColorBrewer2.0, at http://colorbrewer2.org
(http://colorbrewer2.org).
DISTINCT VALUES
The second main factor to consider when choosing a visual property is how many
distinct values it has that your reader will be able to perceive, differentiate, and
possibly remember. For example, there are a lot of colors in the world, but we
can’t tell them apart if they’re too similar. We can more easily differentiate a
large number of shapes, a huge number of positions, and an infinite number of
numbers. When choosing a visual property, select one that has a number of useful
differentiable values and an ordering similar to that of your data (see Figure 4-3).
Figure 4-3. Use this table of common visual properties to help you select an
appropriate encoding for your data type.
Figure 4-4 shows another way to think about visual properties, depending on
what kind of data you need to encode. As you can see, many visual properties
may be used to encode multiple data types. Position and placement, as well as
text, can be used to encode any type of data—which is why every visualization
you design needs to begin with careful consideration of how you’ll use them (see
Chapter 5).
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-4. Visual properties grouped by the types of data they can be used
to encode.
REDUNDANT ENCODING
If you have the luxury of leftover, unused visual properties after you’ve encoded
the main dimensions of your data, consider using them to redundantly encode
some existing, already-encoded data dimensions. The advantage of redundant
encoding is that using more channels to get the same information into your brain
[ 9 ]
can make acquisition of that information faster, easier, and more accurate.
For example, if you’ve got lines differentiated by ending (arrows, dots, etc.),
consider also changing the line style (dotted, dashed, etc.) or color. If you’ve got
values encoded by placement, consider redundantly encoding the value with
[10]
brightness, or grouping regions with color, as in Figure 4-5 .
To be totally accurate, in Figure 4-5, adding color more strongly defined the
groupings that weren’t strongly defined before, but those groups are a subset of
the information already provided by position. For that reason, in this case color
adds slightly more informational value beyond mere redundancy.
In writing, we often advise each other to stay away from clichés; don’t use a pat
phrase, but try to find new ways to say things instead. The reason is that we want
the reader to think about what we’re saying, and clichés tend to make readers turn
their brains off. In visualization, however, that kind of brainlessness can be a help
instead of a hindrance—since it makes comprehension more efficient—so
conventions can be our friends.
NOTE
Purposely turning visual convention on its head may cause
the reader’s brain to “throw an exception,” if you will, and this
technique can be used strategically; but please, use it
sparingly.
The choice comes down to a basic cost-benefit analysis. What is the expense to
you and your reader of creating and understanding a new encoding format, versus
the value delivered by that format? If you’ve got a truly superior solution (as
evaluated by your reader, and not just your ego), then by all means, use it. But if
your job can be done (or done well enough) with a default format, save everyone
the effort and use a standard solution.
READERS’ CONTEXT
In Chapter 2, we discussed how important it is to recognize that you are creating
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
a visualization for someone other than yourself—and that the reader may show /
up with a mindset or way of viewing the world different from yours.
First, it’s important to point out that your audience will likely be composed of
more than one reader. And as these people are all individuals, they may be as
different from each other as they are from you, and will likely have very different
backgrounds and levels of interest in your work. It may be impossible to take the
preconceptions of all these readers into consideration at once. So choose the most
important group, think of them as your core group, and design with them in
mind. Where it is possible to appeal to more of your potential audience without
sacrificing precision or efficiency, do so. But, going forward, let us be clear that
when we say reader, what we really mean is a representative reader from within
your core audience.
Okay, now that we’ve cleared that up, let’s get specific about some facets of the
reader’s mindset that you need to take into account.
When selecting the actual terms you’ll use to label axes, tag visual elements, or
title the piece (which creates the mental framework within which to view it),
consider your reader’s vocabulary and familiarity with relevant jargon.
Is the reader from within your industry or outside of it? What about other
readers outside of the core audience group?
Is it worth using an industry term for the sake of precision (knowing that the
reader may have to look it up), or would a lay term work just as well?
Will the reader be able to decipher any unknown terms from context, or will
a vocabulary gap obscure the meaning of all or part of the information
presented?
These are the kinds of questions you should ask yourself. Each and every single
word in your visualization needs to serve a specific purpose. For each one, ask
yourself: why use this word in this place? Determine whether there is another
word that would serve the purpose any better (or whether you can get away
without one at all), and if so, make the change.
Related to this, consider any spelling preferences a reader might have. Especially
within the English language, there may be more than one way to spell a word
depending on which country one is in. Don’t make the reader’s brain do extra
work having to parse “superfluous” or “missing” letters.
Colors
Another reader context to take into account is color choice. There is quite a bit of
science about how our brains perceive and process color that is somewhat
universal, as we saw earlier in this chapter. But it’s worth mentioning in the
context of reader preconceptions the significant cultural associations that color
can carry.
Depending on the culture in question, some colors may be lucky, some unlucky;
some may carry positive or negative connotations; some may be associated with
life events like weddings, funerals, or newborn children.
Some colors don’t mean much on their own, but take on meaning when paired or
grouped with other colors: in the United States, red and royal blue to Republicans
and Democrats; pink and light blue often refer to boys and girls; red, yellow, and
green to traffic signals. The colors red, white, and green may signal Christmas in
Canada, but patriotism in Italy. The colors red, white, and blue are patriotic in
multiple places: they will make both an American and a Frenchman think of
home.
Colors may also take on special significance when paired with certain shapes. A
[11]
red octagon means stop in many places (see Figure 4-6 ), but not all.
Figure 4-6. This stop sign from Montreal is labeled in French, but no
English speaker is likely to be confused about its meaning.
Color blindness
Of course, we know that there are many variations in the way different people
perceive color. This is commonly called color blindness but is more properly
referred to as color vision deficiency or dyschromatopsia. A disorder of color
vision may present in one of several specific ways.
Although prevalence estimates vary among experts and for different ethnic and
national groups, about 7% of American men experience some kind of color
perception disorder (women are much more rarely affected: about 0.4 percent in
[12]
America). Red-green deficiency is the most common by far, but yellow-blue
deficiency also occurs. And there are lots of people who have trouble
distinguishing between close colors like blue and purple.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
NOTE
A great resource for help in choosing color palettes friendly
to those with color blindness is the Color Laboratory at
http://colorlab.wickline.org/colorblind/colorlab/
(http://colorlab.wickline.org/colorblind/colorlab/). There you can select
color swatches into a group (or enter custom RGB values)
and simulate how they are perceived with eight types of
dyschromatopsia. Note: the simulation assumes that you
yourself have typical color vision.
Directional orientation
It will also affect what the reader perceives as “earlier” and “later” in a timeline,
where the edge that is read from will be “earlier” and time will be assumed to
progress in the same direction as your reader typically reads text.
This may also pertain to geographic maps: many of us are used to seeing the
globe split somewhere along the Pacific, with north oriented upward. This suits
North Americans just fine, since—scanning from left to right and starting from
the top of the page—we encounter our own country almost immediately. The
convention came about thanks to European cartographers, who designed maps
over hundreds of years with their own continent as the center of the world.
Occasionally, other map makers have chosen to orient the world map differently,
often for the same purpose of displaying their homeland with prominence (such
as Stuart McArthur’s “South-Up Map,” which puts his native Australia toward
the center-top) or simply for the purpose of correcting the distortion effect that
causes Europe to look bigger than it really is (such as R. Buckminster Fuller’s
“Dymaxion Map”).
Things in the world are full of inherent properties. These are physical properties
that are not (usually) subject to interpretation or culture, but exist as properties
you can point to or measure. Some things are larger than others, have specific
colors, well-known locations, and other identifying characteristics. If your
encodings conflict with or don’t reflect these properties, if they are not
compatible, you’re once again asking your reader to spend extra time decoding
and wondering why things are “wrong;” why they don’t look like they’re
expected to (for example, see the boats and airplanes in Figure 4-7).
Figure 4-7. The visual placement of boats above airplanes is jarring, since
they don’t appear that way in the physical world.
Notice how the colors they’ve chosen map to the browser icons, as shown in
Figure 4-9.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-9. The representative colors differ greatly from the colors in the
browser icons. Other choices would better reflect the icons’ colors.
The encodings they’ve chosen aren’t very compatible with the reality of the
browsers’ icons and branding. IE, with a blue and yellow icon, is shown in
shades of purple. Firefox, with a blue and orange icon, is shown in blue—which
is fine, but curious, given the other browser icons that also contain blue and
might be better contenders for the blue encoding. Safari, with a blue icon, is
encoded with yellow. Chrome—which has red, blue, green, and yellow, but no
orange in its icon—is orange. Opera, with its red icon and corresponding red
label, has the only encoding that makes sense. An improved set of encodings that
more closely match the reality of the browser icons shown in the last column of
Figure 4-9.
To use colors as an example of some of these learned conventions, red and green
have strong connotations for bad and good, or stop and go. (See the Color section
in Chapter 6 for more on common color associations.) Beyond color, consider
cultural conventions about spatial representations, such as what left and right
mean politically, or the significance of above and below. Also consider cultural
conventions about the meaning or square versus round, and bright versus dark.
Practically speaking, this pattern and pattern-violation recognition has two major
Your trial membership has ended, Pvvssrinivas.
implications for design. The first isPlease contact
that readers will your
notice patterns administrator or O'Reilly Support.
and assume
/
they are intentional, whether you planned for the patterns to exist or not. The
second is that when they perceive patterns, readers will also expect pattern
violations to be meaningful.
So how should you avoid the potential trap of implying meaning where none is
intended? It all comes down to three simple rules.
These sound simple, and yet violations of these rules are everywhere. You can
probably think of a few already, and will probably start to notice more examples
in your daily life. Maintaining consistency and intention when encoding will
greatly enhance the accessibility and efficiency of your visualization, and, as with
any good habit, will make your life easier in the long run.
Selecting Structure
Just as we don’t write PhD dissertations in sonnet form, or thank-you notes like
legal briefs complete with footnote citations, it’s important that the structure of
your visualization be appropriate to your data.
Figure 4-10. This rendition of the classic table makes good use of color and
line.
Perhaps because it is so elegant and iconic, the Periodic Table is also one of the
most frequently imitated visualizations out there. Designers and satirists are
constantly repurposing its familiar rows and columns to showcase collections of
everything from typefaces to video game controllers, and, ironically,
visualization methods. This phenomenon is a particular peeve to your authors
precisely because it violates the important principle of selecting an appropriate
structure. With the possible (yet questionable) exception of Andrew Plotkin’s
[15]
Periodic Table of Desserts, copycat designers are using a periodic structure
to display data that is not periodic. They are just so many derivative attempts at
cleverness.
WARNING
If you’re using a particular structure just to be cute or clever,
you’re doing it wrong.
If you are tempted to use a periodic table format for your non-periodic data,
consider instead a two-axis scatter plot or table, where the axes are well matched
to the important aspects of your data. This will lead you to a more accurate, and
[16]
less derivative, final product.
NOTE
For another chemistry-oriented example of a specific
structure with an entirely different purpose, check out the
Table of Nuclides:
http://en.wikipedia.org/wiki/Table_of_nuclides
(http://en.wikipedia.org/wiki/Table_of_nuclides)
Beyond that, we must refer you to other tomes (we suggest the books by Yau and
Kosslyn listed in Appendix A to begin with, and Bertin for more dedicated
readers) to help you select just the right structure for your particular
Your trial membership has ended, Pvvssrinivas. /
circumstance; as you can see from Please contact
Figure 4-11, there your
are too many administrator or O'Reilly Support.
to address
each one directly within the scope of this short book. But here are some general
principles and common pitfalls to guide your selection process.C
A good example of this is in comparing two graphs. Beware of what scales you
use on your axes so that the reader can fairly interpret the graph data. If one
graph has a scale of 0 to 10 and the other has a scale of 0 to 5 (Figure 4-11), the
slopes displayed on the graphs will be very different for the same data. Using
unequal scales for data you are attempting to compare makes comparison much
more difficult.
Figure 4-11. The same data appears flatter (top) or steeper (bottom)
depending on the scales chosen. If we were attempting to compare these
data sets, the unequal axes would introduce distortion that made
comparison more difficult.
If I show you a section of the ring in the middle that represents a huge
percentage, it still looks objectively shorter than a section of the outer ring that
may represent a much smaller percentage. Also, having all of these lines wrapped
in a circle makes it difficult to compare their lengths anyway. They only way you
can really grasp the information represented in this graph is to read the
percentage numbers in the labels. In this case, we may as well just have a table of
numbers—it would be faster to read and easier to make comparisons with.
Similarly, the ringed pie graph format known as Nightingale’s Rose (for its
creator, Florence—see Figure 4-12), is almost completely useless. Comparing the
areas of the sliced pie wedges is nearly impossible to do accurately. Line graphs
or stacked bar graphs would have served much better.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
Figure 4-13. A radial layout distorts the data and renders this disk usage
map totally ineffective for all but the coarsest comparisons.
Another common pitfall is the use of a geographic map for any and all data that
includes a location dimension. Sometimes the use of a map will actually distort
your message—such as when the surface area of each region fails to correspond
to your population data (see the section on physical reality in Chapter 5). If your
data is tied to population but your display is based on regional size, the
proportionally larger surface areas of some regions may inflate the appearance of
trends in those regions. Consider using a table or bar graph instead.
NOTE
If you wish to show regional trends, remember that you don’t
have to position states or countries alphabetically; it’s okay to
group them by region or along some other appropriate axis.
Figure 4-14. This rendition of the healthcare plan clearly revels in and aims
to exaggerate the system’s complexity.
It’s fairly obvious that political motivations dominated the design choices for this
visualization; it clearly falls into the category of persuasive visualization (rather
than informative). The chart itself doesn’t leave the reader with any actual
information other than, “Wow, this system is complicated.” When we consider
the title of the press release in which this was unveiled—“America’s New Health
Care System Revealed”—we know those responsible to be disingenuous.
Palmer explained his motivation in an open letter to Rep. John Boehner (R-OH)
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support.
on Flickr (http://www.flickr.com/photos/robertpalmer/3743826461/) /
(http://www.flickr.com/photos/robertpalmer/3743826461/
(http://www.flickr.com/photos/robertpalmer/3743826461/)):
There is no doubt that national healthcare is a complex matter, and this is evident
in both designs. But Palmer’s rendition clearly aims to pare down that complexity
to its essential nature, for the purpose of making things easier to understand,
rather than purposefully clouding what is happening under the abstracted layer.
This is the hallmark of effective editing.
Sometimes a designer will make the visualization more complicated than it need
to be, not because he is trying to make the data look bad, but for precisely the
opposite reason: he wants the data to look as good as possible. This is an equally
bad mistake.
Your data is important and meaningful all on its own; you don’t have to make it
special by trying to get fancy. Every dot, line and word should serve a
communicative purpose: if it is extraneous or outside the scope of the
visualization’s goals, it must go. Edit ruthlessly. Don’t decorate your data.
[ 6 ]
Or shouldn’t try to: that way madness lies.
[ 7 ]
European Soil Bureau. Copyright © 1995–2011,
European Union. Used with stated authorization to
reproduce, with acknowledgment.
http://eusoils.jrc.ec.europa.eu/
(http://eusoils.jrc.ec.europa.eu/)
[ 8 ]
Center for International Earth Science
Information Network (CIESIN) (2007). Copyright ©
2007, The Trustees of Columbia University in the
City of New York. Columbia University. Population,
Landscape, and Climate Estimates (PLACE). Used
under the Creative Commons Attribution License.
http://sedac.ciesin.columbia.edu/place/
(http://sedac.ciesin.columbia.edu/place/)
[ 9 ]
Ware, Information Visualization: Perception for
Design (Morgan Kaufmann), p. 179.
[10]
Tableau Software Public Gallery. Copyright ©
2003–2011 Tableau Software.
http://www.tableausoftware.com/learn/gallery/company-
performance
(http://www.tableausoftware.com/learn/gallery/company-performance)
[11]
Christian Caron (2011). Copyright © 2011,
Christian Caron.
[12]
Montgomery, Geoffrey, for Howard Hughes
Medical Institute. Seeing, Hearing, and Smelling the
World. Chevy Chase, MD: 1995.
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /
[13]
[13]
Your authors take particular interest in examining
information design in the world, take every
opportunity to do so, and hope that everyone else
will start to do the same.
[14]
Michael Dayah (1997). Copyright © 1997 Michael
Dayah. http://www.ptable.com (http://www.ptable.com)
[15]
http://eblong.com/zarf/periodic/index.html
(http://eblong.com/zarf/periodic/index.html)
[16]
Astute readers will note that the periodic table is
also a two-axis layout with carefully chosen axes that
reflect, and facilitate access to, the relevant
properties of the data.
[17]
We care so much about this issue that we
dedicate a section in Chapter 5 to good and bad
uses of circular layouts.
[18]
Robert Palmer (2010). Copyright © 2010, Robert
Palmer. http://rp-network.com/ (http://rp-network.com/)
[19]
http://www.flickr.com/photos/robertpalmer/3743826461/
(http://www.flickr.com/photos/robertpalmer/3743826461/)
PREV NEXT
⏮ ⏭
3. Determine Your Goals and Supporting Data 5. First, Place
Your trial membership has ended, Pvvssrinivas. Please contact your administrator or O'Reilly Support. /