Professional Documents
Culture Documents
course book
Version 1.3 (21 November 2014)
Free University of Bolzano Bozen – Paolo Coletti
Introduction
This book contains the infographics course’s lessons held at the Free University of Bolzano Bozen.
This book is in continuous development, please take a look at its version number, which marks important
changes.
Disclaimers
This book is designed for novice data analyzers. It contains oversimplifications of reality and many technical
details are purposely omitted.
Table of Contents
INTRODUCTION ....................................................... 1 2.5. SCALE VERSUS SCALE VARIABLES ........................ 10
TABLE OF CONTENTS ................................................ 1 3. SPECIFIC CHARTS ........................................... 12
1. BASIC NOTIONS ............................................... 2 3.1. DATA MAP ................................................... 12
1.1. FUNDAMENTAL SCOPES .................................... 2 3.2. TIME SERIES .................................................. 14
1.2. VARIABLES ..................................................... 3 4. MISLEADING FACTORS ................................... 18
1.3. SCALES .......................................................... 3 4.1. DISTRACTIBLE ELEMENTS ................................. 18
2. TRADITIONAL GRAPHS ..................................... 5 4.2. ELEMENTS IN THE WRONG PLACE ...................... 18
2.1. ONE NOMINAL VARIABLE .................................. 5 4.3. WRONG PROPORTIONS BETWEEN DATA AND SIZES 19
2.2. ONE SCALE VARIABLE ....................................... 6 4.4. DIFFERENT UNITS OF MEASURE ......................... 20
2.3. NOMINAL VERSUS NOMINAL VARIABLES ............... 8 4.5. NOT STARTING FROM ZERO .............................. 21
2.4. NOMINAL VERSUS SCALE VARIABLES .................... 9 4.6. TWO AND THREE DIMENSIONAL GROWTH ........... 21
Paolo Coletti Infographics course book
1. Basic notions
1.1. Fundamental scopes
Every time we need to represent anything, being a picturesque view through a painting, numbers through a
table or, as it will be the case in this book, numbers through a picture, we must never forget what is the
scope of our work. To help us in this task, the following questions can give good hints. They can be used
every time we are faced with a representation to point out the bad aspects and to find out possible
corrections and improvements.
What is represented?
Representing something without having a prior knowledge of what is going to be represented is clearly
pointless, but there are numerous examples of representations built up only to impress the viewer, without
any intention of really display something.
The topic which is displayed must be clear in the designer’s mind and also the point which wants to be
underlined must be clear. All the image must be dedicated to these two tasks without adding worthless
extras and without depriving these two points of any of their aspects for other secondary scopes.
Is something misleading?
The most widespread problem in data visualization are misleading graphs, which deliberately or by
ignorance display information in a wrong way to attract the viewer’s attention to the scope of the
representation in an unfair way. The typical example of error are wrong proportions, displaying elements of
the picture not in relative scale to each other. This may seems a childish trick, easy to discover; however,
when it is done using multidimensional pictures it needs a careful eye to be spotted, a can be seen in
section 4.
Is something useless?
Too many graphical representations incorporate elements which do not have anything to do with the main
scope, with the only task to embellish the graphics. Very often these elements distract the attention, when
not moving the viewer’s focus completely away. In some cases they make parts of the representation
incomprehensible, as can be the case when fancy fonts are used to write labels and numbers.
Is something missing?
Very often a graphical representation needs also text and numbers. Even though most of it can be stored in
the caption, it is necessary that axes contain the units of measure and their scale, that colors contain a
proper legend and that, in case useful information is omitted, it be explicitly cited and explained.
A scaleless or unitless graph is useless even when all elements to compare are in scale, any difference
which seems huge can instead be irrelevant. A representation with missing relevant details leads
immediately the suspect that the designer wanted to hide something.
Last but not least a time indication is fundamental. Data will change in time while the representation stays
the same (unless it is fully interactive) and can be looked at even many years after. With a missing time
indication the representation is worthless, it can be referred to right now as well as to 3 millennia ago.
Page 2 of 22 Version 1.3 (21/11/2014)
Infographics course book Paolo Coletti
1.2. Variables
We follow Stevens theory1 and we distinguish the data that we have in three big categories, which are
treated in different ways both by statistics as well by graphical representations.
Scale variables, also called numeric variables, are data expressed in a numerical format, be it continuous or
discrete, on which mathematical operations make sense. They are usually physical measures, such as time,
age, weight, distance in Kilometers, length, number of children, GDP.
Nominal variables, also called categorical variables, are data which represent categories. No mathematical
operation makes sense on them, even though sometimes numbers are used to identify the categories.
Their only scope is dividing the population into categories. For example, country of residence, sex, degree
course. Even though scale variable could divide the population into categories, the fact that they have
many possible values, even infinite values when the variable is continuous, using them to divide the
population into categories would result in a huge number of categories, often with one or zero cases in
each category.
The border between nominal and scale variables is however not so straight. And in the midway we find
ordinal variables, which define only an order on the thing they are representing. For example, the arrival
position in a running competition is clearly ordinal, as it defines the order but no mathematical calculation
can be done (we cannot say that the sixth one is 3 times the second one). A typical example is the 5‐
elements Likert scale used in questionnaires, a set of predetermined answers going from very negative,
indicated with 1, to very positive, indicated with 5. It is clear that answering 4 is always more positive than
answering 2, but we cannot say that it is twice as positive.
Ordinal variables can always be used as nominal variables, as they have a limited number of values and
they divide the population into categories. Sometimes, when they have several values and when the
numbers user to represent the order can somehow also induce a mathematical meaning, they are used as
scale variables.
1.3. Scales
Unless extravagant solutions are absolutely necessary, there are two scales which are typical in
representations of numerical data.
The linear scale is the standard choice, every interval represents a distance which is proportional to its
length, with this proportion fixed for the entire graph. It is often a good idea to include also the zero level in
this scale, in such a way that the viewer is not mislead into thinking that apparently huge differences are
really so big, as in section 4.5. When the zero level is so far away and it is not useful for the problem (for
example, because the problem itself concentrates on the differences and not on the absolute values), it is
necessary to render very evident the fact that the scale does not start from 0, using graphical tools or text.
The alternative scale is the logarithmic scale, where each interval represents a distance which is
exponentially proportional to its length. In the logarithmic scale the logarithm of the values are displayed,
but on the axis still the original value is indicated, not its logarithm. Given the features of logarithm, no
value can be zero nor negative, as its logarithm does not exist. Traditionally logarithmic scale in base 10 is
used, as it is more human comprehensible, but it is equivalent to use any other base such as 2 or . It is vital
to indicate on the axis at least three values, to give the viewer the correct impression that numbers are
growing exponentially. In logarithmic scale the usual starting value is 1, as 0 is not representable.
1
Stevens, S. S. (1946); On the Theory of Scales of Measurement; Science, vol. 103, n. 2684, pages 677–680.
Version 1.3 (21/11/2014) Page 3 of 22
Paolo Coletti Innfographics course bookk
Page 4 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course boo
ok Paolo Coletti
P i
2. Traditiona
al graphs
How one orr several variiables can be
e graphically visualized depends stron
ngly on the vvariable type
e.
2.1. One
e nomina
al variablle
To represennt one nomin nal variables there are tw
wo basic disttinct choices and a bunchh of variation
ns which aree
technically identical.
The first choice is the column plot, sometimes iimproperly ccalled histogram. It repreesents on th he horizontall
axis the variable’s categgories and on
n the vertica l axis the frequency of ca
ases in each category.
Millions of p
patent applicattions in Japan, US, Europe
between 19988 and 2007, sou urce WIPO. Number of pparticipants by race that attennd the workshop on Social Media
gcenterode.woordpress.com.
in 2011, source http://lsuag
The second
d choice is the
t pie chart. It depictss a circle divvided into sllices, each oone proportional to thee
number of cases in thhat categoryy. Comparingg it to the column plot we observve that here e the visuall
information
n on the abssolute numbber of cases is lost, while strong emmphasis is puut on the pe
ercentage off
cases.
The third chhoice is the radar graph, which is a good choice e when the n number of ccategories is at least fivee
and when tthey have a ccircular meaning, such aas months, w week’s days, hours, angu lar directionns such as N,,
NE, E, SE, ettc. Since it iss considered to be a veryy fancy graph
h, it is often a
abused and applied also to variabless
which are nnot circular at all, especiaally in marketting dimensions.
Mon
500 1,4%
400
1,2%
Sun 300 Tue
1,0%
200
100 0,8%
0 0,6%
Sat Wed
d 0,4%
0,2%
0,0%
Fri Thu Apr
A May Jun Jull Aug Sep
Number of sh
hops open per w week day (invented Liine plot of perccentage of peop
ple who use an App more than n 100 times a
data) . month with h respect to total Apps’ users, source Flurry A
Analytics
Possible varriations of th
he previous ttwo basic typpe are:
• the bar p
plot, a versio
on of the colu
umn plot witth horizontal bars;
• the line plot, which consists in a column plo t where a lin
ne connects the top of eeach column. It must nott
be confuused with a mathematical graph, as on the horizzontal axis there is not aa numerical variable butt
Version 1.3 (21/11/2014
4) Page 5 of 22
2
Paolo Coletti Infographics course book
simply categories. Special care must therefore be taken when using it and It is best suited when an
ordinal variable appears on the horizontal axis, especially if it is a time related variable;
• the area plot, which is identical to the line plot but has the area below the line colored. This is many
times a misleading element, as it gives the impression that values between two categories really exist
(they do not, on the horizontal axis we have only categories not a continuous variable);
• all the three dimensional versions of these graphs are also commonly used. However, unless there is a
strong necessity for an extra dimension, this is only extra graphics which moves the focus away from the
main point and confuses the viewer.
30
20
10
2
0
$1,000
$2,000 $3,000 $4,000 $5,000 $1,000 $2,000
$3,000 $4,000 $5,000
Histogram with 5 intervals for 100 cases (invented data) Histogram with 50 intervals for 100 cases (same invented data)
Page 6 of 22 Version 1.3 (21/11/2014)
Infographics course boo
ok Paolo Coletti
P i
12
0
$1,000 $2,000 $
$3,000
$4,000 $5,000
Histogram
m with the first iinterval 3 timess wider (same invented data).
Histogram witth 15 intervals ffor 100 data (sa
ame invented
There are 8 cases in the first box, 7 in thhe second one e
even though itss
datta)
are
ea looks much ssmaller.
B
Boxplot of dailyy log returns of 3M company qquoted at NASD
DAQ in 2012, source www.porttfolioprobe.com
m.
Version 1.3 (21/11/2014
4) Page 7 of 22
2
Paolo Coletti Innfographics course bookk
Country GDP p
per capita (USD) in 2011
GDP pe
er capita (USD) in 2011
United Statees 14.991.300.0
000.000.000
1,6E+16
China 7.203.784.0
000.000.000
1,4E+16
Japan 5.870.357.0
000.000.000 1,2E+16
1,0E+16
Germany 3.604.061.0
000.000.000 8,0E+15
France 2.775.518.0
000.000.000 6,0E+15
4,0E+15
Brazil 2.476.651.0
000.000.000 2,0E+15
0,0E+00
United Kingd
dom 2.429.184.0
000.000.000 Unite
ed China Japa
an Germany Frannce Brazil United Italy
State
es King
gdom
Italy 2.195.937.0
000.000.000
Month
Month
Avverage monthlyy Average month
A hly Jan
preccipitation (cm) in precipitation (cm
p m) Dec
7
Feb
6
Bozen in Bozen 5
4
Average montthly
Jan 1,33 Jul 6,27 Nov
3
Mar precipitation
2 (cm) in Bozeen
Feb 1,06 Ago 4,9 1
Oct 0 Apr
Mar 1,54 Sep 3,79
Apr 3,37 Oct 4,6 Sep May
May 4,2 Nov 3,51
Ago Jun
Jun 6,04 Dec 1,53 Jul
2.3. Nom
minal verrsus nom
minal variiables
The represeentations of two nomina al variables aare variants of the one nnominal variaable graphs.. Usually thee
one with thhe more cateegories is displayed as beefore, while the categories of the othher one are represented d
side by sidee, line by linee using different colors ((in case line or radar gra
aphs turns o ut to be apppropriate) orr
also one abbove the otheer, when it m makes sensee to sum the categories a and thus givee also an info
ormation onn
the sum forr each catego ory of the first variable.
In this casee it makes seense to use a 3D graph,, where the second nom
minal variabl e is represented on thee
third dimen nsion. Howevver, still the iinformation on the exactt proportions is lost.
25
20
15
10
0
Enterntainm
ment Games Lifestyle News Sociall networking
iP
Phone Android
d
Average seession frequenccy per month off smartphone uusage by Early‐stage e
entrepreneuriall activity in the UK by age grou
up,
operating systeem, source Flurrry Analytics source “GEM G Global adult poopulation surveyy (APS) 2002‐20011”
Page 8 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course boo
ok Paolo Coletti
P i
20
15
10
5
0
Androiid
iPhone
European, USS and Japanese patent costs, so
ource Mejer, M M., and Van Average sessio
on frequency peer month of sm martphone usagge by
Pottelsbeerghe (2009) “Beyond the proh hibitive cost of patent opeerating system, source Flurry A
Analytics
proteection in Europe
e”
Sometimees a stacked areea graph can be really meaninggful. Smartphon by operating sysstem, source www.gartner.com
ne shipments b m
2.4. Nom
minal verrsus scale
e variablles
Also the no
ominal versu
us scale reprresentations are variantss of the scale variable rrepresentatio
on, with thee
nominal varriable’s categgories depictted side by sside or one o over the other. For histoograms a spe ecial warningg
is necessaryy: when the number of ccases in eachh category is strongly diffferent, rescaaling the verttical axes forr
visibility purposes can b
be misleadingg, while not doing it can squeeze a distribution too the horizon ntal axis.
Version 1.3 (21/11/2014
4) Page 9 of 22
2
Paolo Coletti Innfographics course bookk
Boxplots of d
daily log return
ns at NASDAQ in
n 2012 for diffeerent Histo
ograms, with ve
ertical axes resccaled, of heights by sex of CSE
commpanies, source www.portfolio oprobe.com students in 2
2009, source htttp://www.wusstl.edu.
Scatterplot o
of number of meedicines used d
daily per 1000 innhabitants and average cost o
of daily medicinne dose. Cases a
are regions,
source “Relazzione sanitaria –– Provincia autonoma di Bolza
ano – 2005”
Page 10 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course boo
ok Paolo Coletti
P i
Scatterp
plot with lines o
of photosynthessis in a biologicaal Scatterplot with regresssion line , source
N 978‐953‐307‐2283‐8.
experiment, ssource chapter 3 of book ISBN http://markwadsworth.blogsspot.it, data fro om OECD.
Two‐input on
ne‐output fuzzyy controller data
a, source Fully Logic Circcle’s size repressent country’s ppopulation (bigg red circle is
Toolbox exaamples for Matlab. Chin
na, big light bluee one is India, bbig yellow one iis USA, oldest
peeople are from Japan, poorestt country is Con ngo, richest
country is Luxembourg), sourcce www.gapminder.org.
Version 1.3 (21/11/2014
4) Page 11 of 22
2
Paolo Coletti Innfographics course bookk
3. Specific ch
harts
Several fieldds make usee of particular variables for which data
d have we
ell‐known prroprieties, which
w can bee
used to dissplay particu
ular graphs which
w turns out to be more
m approppriate and m
more undersstandable byy
expert eyess then standaard ones.
Source A
Annals of Intern
nal Medicine, vo
ol. 133, n. 2, paages 161‐162. Black
k lines are choleera cases 19 Au
ug to 30 Sept
1854 as drawn byy John Snow, so ource ISBN
97811130088281.
Page 12 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course book Paolo Coletti
Exports by sea of French wines in 1864, source C.J. Minard 1865. 1999 volume of minutes of voice telecommunication.
Only the principal route pairs are shown to avoid
cluttering the map. Circular symbols on capital cities
represent the country's total annual outgoing traffic,
source http://mappa.mundi.net and
www.telegeography.com
Version 1.3 (21/11/2014) Page 13 of 22
Paolo Coletti Infographics course book
Page 14 of 22 Version 1.3 (21/11/2014)
Infographics course boo
ok Paolo Coletti
P i
Telecom Italiia daily open, close, low, high prices, source Y
Yahoo!. Horizontal axis skips SSaturdays and
Sundays w
without leaving any blank spacee, other market closed days have a blank spaace instead.
A similar representationn is typical fo
or weather, aas is shares tthe high and
d low of finanncial data, as well as thee
presence off other indicaators such ass precipitatioon and humidity.
3.2.2. Gan
ntt chart
A very alterrnative way to display a starting andd ending tim
me of an eve ent which mooves ahead (in space orr
moves on w with its planned activitie
es) is using oone dimensioon as time aand the otheer one as mo oving space,,
connecting the startingg and endingg points withh a line. The steepness o
of the line inndicates how
w fast is thatt
part proceeeding.
When dealing with projject’s planning, this grapph uses bars and takes the name of Gantt chart,, in honor off
Henry Gantt.
Version 1.3 (21/11/2014
4) Page 15 of 22
2
Paolo Coletti Innfographics course bookk
Gantt chart for Windowss support, sourcce www.xpisobsolete.com
Timetable of ttrains from Parris to Lyon operrating in 1880, rred line is TGV train in 1981, sources E.J. Marrey 1885 and E.R. Tufte 2001.
Page 16 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course book Paolo Coletti
Figurative map of the men losses of the French army in the Russian campaign 1812‐1813, source C.J. Minard 1869.
A new chart of history, source Lectures on History and General Policy by J. Priestley 1788.
Version 1.3 (21/11/2014) Page 17 of 22
Paolo Coletti Infographics course book
4. Misleading factors
4.1. Distractible elements
Any element which is not directly related to the focus of the representation should be avoided, especially
when it is only an embellishment of the chart. The typical example is inserting pictures which do not help
comprehensibility inside bar plots.
As can be seen in these two pictures, the elements added in the column bars do not help the
comprehensibility at all. They do not represent, as they could be, neither the years nor the three analyzed
variables in the second graph, where it could be much better displaying a picture related to the variable
itself. Their only task is to distract the viewer in the first one with the help of a three‐dimensional object
(see section 4.6) and in the second one from the non‐zero start of vertical axis (see section 4.5).
Source Beautiful Evidence by E. Tufte. Source Day Mines 1974 Annual Report.
In this picture instead a fancy left to right perspective effect has been introduced. Its effect on a better
comprehensibility of the data is none, while it makes comparing the bars a very difficult task and gives the
impression of a much larger increase.
Black columns represent New York State aid to localities in billions of USD,
total column represent total budget, source New York Times 1 Feb 1976.
Page 18 of 22 Version 1.3 (21/11/2014)
Infographics course book Paolo Coletti
expenses to others. However, apart from the impact on the viewer, there is no need to use two
dimensions, one dimension is enough as the variable to be displayed is only one. The only advantage in
using two dimensions is the possibility to easily slip insert some rectangles inside other ones. However, this
automatically induce into the viewer the idea that a rectangle is part of the outside one, as it is correctly
depicted for Online Advertising as part of Worldwide Advertising Spend. However, the viewer is expected
to see the same for Iraq War, but the three rectangles are not in the total one, giving the impression of a
larger expense.
Upper part of billion dollar gram by D. McCandless, source www.informationisbeautiful.net
Version 1.3 (21/11/2014) Page 19 of 22
Paolo Coletti Innfographics course bookk
Upper parrt of billion dollar‐o‐gram by D
D. McCandless, source www.in
nformationisbeeautiful.net
4.4. Diffferent un
nits of me
easure
Data with a different un nit of measurre may neve r be put one e aside the otther, unless ttheir units o of measure iss
very clear o
or unless theey are prope
erly rescaledd. Apart from
m evident units of meassure’s difference, in thee
examples b below the last category contains a sshorter sample’s unit, which
w is eviddent only in the second
d
picture. In bboth cases a rescaling wo ould have beeen very appropriate.
Source Scien
nce Indicators 1
1974, National SScience Foundaation. Source New York Timees 8 August 197
78.
Page 20 of 222 V
Version 1.3 (2
21/11/2014))
Infographics course boo
ok Paolo Coletti
P i
4.6. Two
o and thrree dimensional ggrowth
Whenever aa single scalee variable is representedd using a two o dimensional image speecial care mu ust be takenn
in the propo ortions. The viewer is facced up with two differen nt scales: the
e vertical sizee and the are ea size. Onlyy
ontal base is kept constant, it is possiible to respe
if the horizo ect both the pproportions among data and verticall
sizes and areas. If also the horizon
ntal base of the graphiccal element is rescaled, then the area becomess
obviously p proportional to the square of the da ta, as can be e seen in the man in thee first picturre of section n
4.1, and in the two picttures below. In the firstt one, the fact that the e element is aalso a three dimensionall
barrel introduces an eveen further m misleading efffect, as the p perceived volume is the ccube of the linear size.
Source Tim
me 9 April 1979
9. Source W
Washington Posst 25 April 1978
8.
Version 1.3 (21/11/2014
4) Page 21 of 22
2
Paolo Coletti Innfographics course bookk
Source New York TTimes, 19 Decem
mber 1978 Sou
urce www.80200vision.com
Page 22 of 222 V
Version 1.3 (2
21/11/2014))