Data Visualisation Slides 1-6

‘Data Analytics & Visualisation’
Minor ‘Data Science’

Hogeschool Rotterdam, CMI
Week 1

Skills Data Scientist

Let us introduce
ourselves …
Programme
Statistics Visualization
Project
a n d:
n D e m ho n ,
Machine Privacy O in P y t
m ing S hi ny
Learning Engineering o g r a m R a n d
Pr ing in
g r a m m
P r o
Practical Information
• Data Visualisation on Thursday 9:00 - 11:30
• Project lessons Thursday 12:00 - 16:00.
• Other days for homework, project group working meetings.
• We use Microsoft Teams to share learning materials and

communicate about the course.
• Assessment will be done with a written Exam and a

Personal Visualisation Assignment. The nal grade is
average between two partial grades.

fi

Big Data
Data Explosion
Insight, Decision, Action
Why Visualization?
We live in the age of Big Data.

Human beings process visual information
60.000 x faster than textual information!
(https://rhdeepexploration.wordpress.com/2011/12/05/visuals-60000-times-faster/)

 
Useful in two phases…
1. In exploration phase for yourself, trying to

understand and possibly nd hidden patterns in the
data set.
2. In presentation phase for your audience, trying to

communicate insights and trigger actions.
fi
Static Infographics
Tangible Visualisations
Data Animations
source: https://www.youtube.com/watch?v=4gIhRkCcD4U
Interactive Visualisations
source: http://getdolphins.com/blog/interactive-data-visualizations-new-york-times/
In-Class Assignment 1.1
Look for interesting visualizations of the Corona-virus

pandemic, that is now sweeping over the world.
• What data was used for the Viz?
• What story is told by the Viz?

Truth or Beauty?
Terminology?
• Absolute values
• Relative values (ratio’s, • Cummulative values

percentages, per capita
numbers) • Logarithmic scale
• Aggregate • …
• Filter
• Summarize
R for data collection, ltering, cleaning,

wrangling, slicing, dicing, munching,
crunching, modelling, …, graphing,
plotting and drawing.
Shiny for user interaction.

fi
Assignment 1.2
Follow the tutorial on R Data Structures and Graphics.
Make notes of things you don’t understand.
https://www.w3schools.com/R/r_vectors.asp
Base Data Types
Number of children, Floor in a
Numbers Discrete Numbers
building, …
Continuous Numbers Temperature, Time, Length, …
Text Western Script (UTF-8) “Rotterdam”
Other Scripts: Chinese,

Arabic, …
Logical TRUE, FALSE
Man || Woman,
Categories Pass || Fail || Inconclusive…
Real World Data Types

Container Data Structures
• Vector: one dimension, elements have same data type
• List: one dimension, elements may have different data

types
• Matrix: two dimensions, elements have same data type
• Dictionary: key, value pairs, values may have different

data types
• Table or Data.Frame: each column is a vector!
Interactive R Demo
Example Data Sets
• R contains a number of example Data Sets
• Display available Data Sets in R: > data()
• Once you have chosen one you can nd its data

structure with: > str(data.set)
• And its description with: > help(data.set)

•
fi

Exploring Large Data Sets

• To see a rst few rows use: > head(data.set)
• To see the last rows use: > tail(data.set)
• To determine the size: > length(data.set)
• Parts of the data can be selected with square brackets

data.set[…], e.g. > data.set[3, 4], >
data.set[1:5,] or > data.set[“column name”]
• To get the contents of a single column (one attribute)

use: > data.frame$column.name
fi
Basic Plotting
• Graph or Scatter diagram: > plot(x,y)
• Histogram: > hist(x)
• Barplot: > barplot(x)
• Much more possibilities can be found in package

ggplot2: https://cran.r-project.org/web/packages/
ggplot2/ggplot2.pdf. We will come back to this
package later. For now, we use plotting from the base
package.
•
Line Graph
Scatter Diagram
Histogram versus Bar Graph
Assignment 1.3
Install Studio on your laptop. Explore some of the data
sets that are packaged with R.
• What container data structure, what data types?
• Which plots can be used to explore the data?
Install RStudio on your laptop
https://www.rstudio.com/products/rstudio/download/
Homework Assignment: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable
names!)
• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.
• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.
• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.
• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5
• Make one more plot that shows something interesting about the differences
between the species of Irises.
•
Iris
If something is unclear or you need additional help please contact me!
Email: wypaz@hr.nl
Or send me a msg on Teams!

Week 2

Any Questions, New Insights

https://rstudio-education.github.io/hopr/
Part 1 - Chapter 1:3

Homework Last Week: Iris

Homework Last Week: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable
names!)
• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.
• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.
• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.
• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5
• Make one more plot that shows something interesting about the differences
between the species of Irises.
•
Recap …
base data types, vectors,
data.frames, subsetting,
visualizing
Base Data Types

Discrete Numbers (whole Number of children, Floor in
Numbers
numbers, count) a building, …
Continuous Numbers Temperature, Time, Length,

(broken numbers, weigh) …
Text Western Script (UTF-8) “Rotterdam”
Other Scripts: Chinese,

Arabic, …
Logical (a.k.a.
TRUE, FALSE
Boolean)
Categories (in R Man || Woman,

Factors with Levels) Pass || Fail || Inconclusive…
Container Data Types

Homogenous Heterogenous
0D 25
Vector, c(2, 4, 6) or
List, list(1, “Rotterdam”,
1D c(“Amsterdam”,
TRUE)
“Berlin”)
2D Matrix Data.Frame
multi-D Array
Vectors
• Construct your own with: c(1, 2, 3)
• Vector can be named: my.vector <- c(“A”, “B”, “C”)
• Also elements of vector can be named (e.g. built-in
data set islands). Use names() to manipulate them.
> str(islands)
Named num [1:48] 11506 5500 16988 2968 16 ...
- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> head(islands)
Africa Antarctica Asia Australia Axel Heiberg Baf n
11506 5500 16988 2968 16 184

fi

Data Frames (1 of 2)
• Construct your own with: data.frame(col1 =
c(1,2,3), col2 = c(“A”, “B”, “C”))
• A data frame can be named: my.df <-
data.frame(col1 = c(1,2,3), col2 = c(“A”, “B”,
“C”))
• Also elements of vector can be named (e.g. built-in
data set women). Use rownames() or colnames()
to manipulate them.
Data Frames (2 of 2)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
> head(women)
height weight
1 58 115
2 59 117
3 60 120 Observation
4 61 123
5 62 126
6 63 129
Variable
Filtering / Subsetting with

[…] and $
Vector, one dimensional, with one number, e.g. my.vector[3], or one
name my.vector[“First”], or
• Vector with selected elements, e.g. c(2, 4, 5)
• Vector with names
• Logical statement, that is TRUE or FALSE for each element
Data Frame, two dimensional, with two numbers separated by a
comma, e.g. my.df[3,1] or two names my.df[“First”, “B”], or
• Two vectors with selected elements
• Two vectors with names
• Two Logical statements, that are TRUE or FALSE for each element
• In case you leave space in front of or behing “,” empty, everything
is selected
For convenience is it possible to

select a complete column (which is
by the way a named vector) of a
Data Frame with “$”.
E.g. weights <- women$weight

Prede ned R Functions

> help()
> str()
> length()
input or output or
> head() actual parameters return value
> nrow()
> ncol()
> summary()
…
> table()
> plot() default parameter
values side effects?
> barplot() e.g. generating a plot
> boxplot()
…
fi
Explore real-world Data Sets
Discover?
Import
Clean
Transform
Visualize
Import Data Sets from
les on the Internet
fi
Data Structures in Memory
(Vector, Matrix, Data.Frame,
List, …) are different from Data
Structures stored in a File.
Data File Formats
TXT JPEG
CSV PNG
JSON MP3
GeoJSON AVI
XML …
HTML

Lots of File Formats

TXT
CSV: Comma Separated
Values
HTML, XML
JSON
JavaScript Object Notation
• Lists: [“Amsterdam”, “Buenos Aires”,

“Chicago”].
• Dictionary: {“Name”: “Jan Kroon”, “City”:

“Utrecht”, “Children”: [“Lente”,
“Muara”]}
• Lists of Dictionaries, Dictionaries containing Lists.
R Code Hints
• > read.csv() for reading “comma separated value” les (“.csv”).
• > read.csv2() variant used in countries that use a comma “,” as
decimal point and a semicolon “;” as eld separators.
• > read.delim() for reading “tab-separated value” les (“.txt”). By
default, point (“.”) is used as decimal points.
• > read.delim2()for reading “tab-separated value” les (“.txt”). By
default, comma (“,”) is used as decimal points.
• > install.packages(“rjson”)
> library("rjson") 
 
 
example:
json_file <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

 
fi
fi
fi
fi
Visualisation to explore
data sets

like to explore?
What would you
Source: Dr. Andrew Abela “Chart Chooser”

2D Math / Stats plots
Line Plot
• Purpose: Explore the development of variable
values over time. Is the variable almost constant?
Does it increase or decrease? How fast does it
change: linear growth? explosive growth?
exponential growth?
• Example:
> time <- seq(from=0, to=10, by=0.1)
> growth <- time*time
> plot(x=time, y=growth, type=“l”)
visual comparison
over time
 

 
 
Trend Line
Ice Cream Sales

Seasonal / Cyclic Pattern
Barplot
• Purpose: Get an impression of differences between
variables.
• Example: built-in data set precip
> barplot(precip, horiz = TRUE)
Stacked Barplot if the x parameter is a data.frame.
visual comparison
between variables
 

 
Boxplot
• Purpose: Get an impression of the distribution of a
variable in a data set: Center, Quartiles, Outliers…?
• Example: built-in data set islands
> boxplot(islands)
visual inspection
of distribution

Histogram
• Purpose: Get an impression of the distribution of a
variable in a data set: Symmetrical or Skewed?
Uniform distribution? Normal distribution? otherwise
• Example: built-in data set islands
> hist(islands, breaks=10)
visual inspection
of distribution

Scatter plot (Strooi diagram)

• Purpose: Compare two variables of the same
person of object (proefpersoon of proefobject). Is
there a relationship?
• Example: built-in data set women
> plot(x=women$height, y=women$weight)
visual inspection
of relationship

3D plot
> persp(volcano)
Heat Map
> image(volcano)
Contour Graph (isobaren,
hoogtelijnen)
> contour(volcano)
Multiple Data Sets in a
single plot
• plot() creates a new plot
• points(), line(), text(), legend(), …

add data to the existing plot.
• boxplot() with multiple arguments plots multiple

box-plots side by side.
visual comparison
between variables

Geospatial Plots
London Cholera Map
Dr. John Snow,1854
Coördinates on Earth
Longitude, Latitude
Different Projections …
… result in different maps.
with different properties:

https://learn.canvas.net/courses/464/pages/unit-3-dot-7-map-projection-properties
Dates and Times

Time Zones
https://www.timeanddate.com/time/map/
Date and Time formats
Quick Demo
> df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009
0:00:00"))
> as.Date(df$Date, "%m/%d/%Y %H:%M:%S”)
> mytime<-ymd_hms("2015-08-14-05-30-00", tz="America/Halifax")

> mytime
> # Leap year?

> leap_year(mytime)
> # Time differences?

> date1<-ymd_hms("2017-06-20-03-45-23")
> date2<-ymd_hms("2017-10-07-21-02-19")
> difftime(date2,date1)
Full tutorial: https://rpubs.com/mr148/303800

Advanced package
• tidyverse : The tidyverse is a set of packages that work in harmony because they share
common data representations and API design
> install.packages("tidyverse")
> library(“tidyverse”)
• Contains a lot of packages that are useful in the data science, for this Course
ggplot2
ggplot2
• Midwest dataset: a build in dataset
• Try on your own:  

> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))
ggplot2

ggplot2

+ geom_point()
You need to specify what graph you wanna do!
 
ggplot2
> g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() + geom_smooth(method="lm")
# set se=FALSE to turnoff con dence bands
# Delete the points outside the limits

> g + xlim(c(0, 0.1)) + ylim(c(0, 1000000))
fi

In Class Assignment 2.1

1. Use (built-in) data set Seatbelts to visualize the impact
of introduction of the law to use seatbelts. (First turn
Seatbelts in data.frame format with data.frame(Seatbelts))
2. Use (built-in) data set LakeHuron to explore whether

there is a seasonal pattern in the water heights. What is
the trend?
3. Use the (built-in) dataset state.x77.

a) Make sure the object is a data frame, if not change it to a data
frame.
b) Find out how many states have an income of less than 4300.
c) Find out which is the state with the highest income.
d) What are possible causes of high murder rates?
 


Week 3

Seatbelts Data Set

Seatbelts Data Set (continued)
Homework week 2
Exercise 2:  
 
With the dataset swiss, create a data frame of only the rows 1, 2, 3,4,5,6,7 and only the
variables Examination, Education and Infant.Mortality.
b) Create a row that will be the total sum of the column, name it Total. 
>tm <- swiss[1:10,c('Examination', ‘Education',
'Infant.Mortality')]
>tm["Total" ,] <- colSums(tm)
c) Create a new variable that will be swisse the proportion of
Examination (Examination / Total)
>tm$swissbe <-tm$Examination / tm$Examination[length(tm$Examination)]
 

Homework week 2
Exercise 3
For the dataset state.x77  
a. Remove column Frost  
> keep <- c("Population" ,"Income" , "Illiteracy", "Life.Exp" ,
"Murder" , "HS.Grad" , "Area" )
> sta <- sta[keep] 
b. Add a variable to the data frame which should categorize the level of illiteracy: [0,1) is low,
[1,2) is some, [2, inf) is high.
>sta$illlvl <- ifelse(sta$Illiteracy<1, 'low',ifelse(sta$Illiteracy
<2, 'some', ifelse(sta$Illiteracy>=2,'high', NA)))
Different data structures
Data Cleaning
How is it done in R?
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.
An example:
There is a dataset in Teams which we will use to work on
> df <- read_delim(“heartatk4R.txt")

And see what msg do you have on the screen? How can we x it?
fi
Data transformation
An example:

df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))

fi
Data transformation
An example:

df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))

fi
Data transformation
Data transformation
# pipe operator;
data is send to
the next step
# sort in ascending order;

desc(AGE) for descending order
Data transformation
Data transformation
# adds new variables and

preserves existing ones
Data transformation
Data transformation
# lter the dataset
according to conditions given

fi
Data transformation
Can you explain to me what does this code do?

Missing values
df <- df %>% drop_na()
But this might lead to probles such as:
- data bias
- data loss
Possible solutions:
-add more data: (calculate mean, median),(linear regression), many more …

Outliers
By know you should know how to identfy if your data has any outliers
How to deal with outliers ?

1.Removing rows with outliers from your
dataset
2.Consider outliers & inliers separately
3.Remove & replace via imputation

Remove duplicates
Remove duplicates
Modify Data Elements:
gsub()
> dutch.number
> # Cast number to a character string
> d.n <- as.character(dutch.number)
> # Substitution
> result <- gsub(“,”, “.”, d.n)
> # Cast character string to num
> international.number <-
as.double(result)
Sorting a data.frame by
column
> head(mtcars)
> # Sort a column (a vector)
> sort(mtcars$mpg)
> # Sort the whole data.frame
> order(mtcars$mpg)
> mtcars[order(mtcars$mpg),]
> mtcars[order(-mtcars$mpg),]
> mtcars[order(mtcars$mpg,
-mtcars$cyl), ]
Merging two data.frames
> head(area, 3)
Continent Country Land.Area.2013
1 Europe Netherlands 16164
2 Europe Belgium 11787
3 Europe France 210026
> head(inhab, 3)
Continent Country Inhabitants.2016
1 Europe Netherlands 16987330
2 Europe Belgium 11358379
3 Europe France 64720690
> merge(area, inhab)
Continent Country Land.Area.2013 Inhabitants.2016
1 Asia China 3700000 7466964280
2 Asia India 1240000 1324171354
3 Europe Belgium 11787 11358379
Aggregating data
> aggregate(sales$Cars.Sold, list(sales=sales$Year), sum)
sales x
1 2001 209
2 2002 209
3 2003 209
> aggregate(sales$Cars.Sold, list(sales=sales$Month), sum)
sales x
1 April 54
2 August 27
3 December 48
4 February 21
5 January 36
6 July 51
7 June 99
8 March 45
9 May 75
10 November 54
11 October 69
12 September 48
Subsetting
> df[ , 2]
> df[2, ]
> df[2, 2]
> df[df$var1 == “Male”, ]
> subset(df, df$var1 != “Female)
Add trend lines
> abline(a=0, b=1, col=“blue”)
a denotes the intercept
b denotes the slope
y = a + b*x
Homework / In class
https://www.kaggle.com/code/rtatman/data-cleaning-challenge-cleaning-numeric-columns/notebook
Week 4

Explore real-world Data Sets

Discover?
Import
Clean
Transform
Visualize
Practical Problems …
CBS, Kaggle, https://datahub.io/
You can’t nd data. collections ,Google Data Set Search,
…
clean!, gsub(), use lubridate library for
Data is polluted, in wrong format. Dates and Times, use options of
read.csv(): StringsAsFactors = F
Too much data. lter(), subset(), grep()
Data is distributed over multiple Data

merge()
Sets.
Data is too detailed, need summaries,

aggregate()
totals per category.
fi
fi
Data Transformations:
ltering, sorting, wrangling, slicing,
dicing, munching, crunching,
merging, aggregating, …
fi

Matching of arguments
R functions arguments can be matched positionally or by
name. So the following calls to sd are all equivalent
> mydata <- rnorm(100)
> sd(mydata)
> sd(x = mydata)
> sd(x = mydata, na.rm = FALSE)
> sd(na.rm = FALSE, x = mydata)
> sd(na.rm = FALSE, mydata)
Even though it’s legal, it is discouraged messing around with

the order of the arguments too much, since it confuses fellow
developers.

Add trend lines
> abline(a=0, b=1, col=“blue”)
a denotes the intercept
b denotes the slope
y = a + b*x
Visualisation to
present data sets
Prede ned R Functions
> help()
> str()
input or output or > head()
actual parameters return value
> nrow()
> ncol()
… > summary()
> plot()
> install.packages()
default values > library()
side effects?
> merge() 
> aggregate()
fi
User De ned R Functions
Functions can be created using the function() keyword and are stored as R
objects just like anything else. In particular, they are R objects of class “function”.
f <- function(<formal parameters>) {
…
return(variable)
}
Functions in R are “ rst class objects”, which means that they can be treated
much like any other R object.
Importantly,
• Functions can be passed as arguments to other functions.
• Functions can be nested, so that you can de ne a function inside of another
function.
• The return value of a function is the last expression in the function body to be
evaluated.
fi
fi

fi

Shiny for User Interaction

User Interface
Shiny generates HTML
Data-driven Outputs
Recap UI
Server Function
See what this code do
ui <- fluidPage(
selectInput("dataset", label = "Dataset", choices = ls("package:datasets")),
verbatimTextOutput("summary"),
tableOutput("table")
)
▪fluidPage() is a layout function that sets up the basic visual structure of the page
▪selectInput() is an input control that lets the user interact with the app  
by providing a value. In this case,  
It’s a select box with the label “Dataset” a 
nd lets you choose one of the built-in datasets that come with R.
▪verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put
rendered output. verbatimTextOutput() displays code and tableOutput() displays tables
See what this code do

server <- function(input, output, session) {
output$summary <- renderPrint({
dataset <- get(input$dataset, "package:datasets")
summary(dataset)
})
output$table <- renderTable({

dataset <- get(input$dataset, "package:datasets")
dataset
})
}
shinyApp(ui, server)
See what happens if you remove red or green box !
Reactive programming
Reactive programming is another programming paradigm, it is  
programming with asynchronous data streams.
You are able to create data streams of anything, not just from click and hover events.
Streams are cheap and ubiquitous, anything can be a stream: variables, user inputs,
properties, caches, data structures, etc.
The key idea of reactive programming is to specify a graph of dependencies so that

when an input changes, all related outputs are automatically updated.
The input argument is a list-like object that contains all the input data sent
from the browser, named according to the input ID. For example, if your UI
contains a numeric input control with an input ID of count, like so:
ui <- fluidPage(
numericInput("count", label = "Number of values",
value = 100)
)
then you can access the value of that input with input$count. It will
initially contain the value 100, and it will be automatically updated as the
user changes the value in the browser.
Unlike a typical list, input objects are read-only. If you attempt to

modify an input inside the server function, you’ll get an error:
input$count <- 10
}
shinyApp(ui, server)
#> Error: Can't modify read-only reactive value
'count'
This error occurs because input re ects what’s happening in the

browser, and the browser is Shiny’s “single source of truth”. If you
could modify the value in R, you could introduce inconsistencies,
where the input slider said one thing in the browser,
and input$count said something different in R.
fl

One more important thing about input: it’s selective about who is allowed to read it.
read from an input, you must be in a reactive
Exercise
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Exercise
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:
fi
Homework
Homework
Read this:
https://mastering-shiny.org/basic-app.html
https://mastering-shiny.org/basic-case-study.html
Page Lay-out with Panels
ui1 <- fluidPage(
titlePanel("EduCode 'Functions of two variables'"),
sidebarLayout(
sidebarPanel(
selectInput(inputId = 'chosen.function', label = 'Function

description: ', choices = c('f(x,y) = x + y', 'f(x,y) = x * y', 'f(x,y) =
x^2 + y^2', 'f(x,y) = 100*sin(x + y)/sqrt(x^2 + y^2)')),
sliderInput(inputId = 'angle', label = '3D view angle: ', min=0,

max=360, value=90)
),
mainPanel(
tabsetPanel(
tabPanel("3D Plot", plotOutput("Three.D.plot")),
tabPanel("Contour Graph", plotOutput("Contour.graph")),
more to explore: https://shiny.rstudio.com/articles/layout-guide.html

Hogeschool Rotterdam, CMI Week 5

Exercise
Exercise A
ui <- uidPage(textInput("name", "What's your name?"),
textOutput("greeting")
output$greeting <- renderText({
paste0("Hello ", input$name)
})
}
fl
Exercise
fi
Exercise A
fi
Data storytelling
Data storytelling is the concept of building a
compelling narrative based on complex data
and analytics that help tell your story and
influence and inform a particular audience.
WHY?
•Adding value to your data and insights.
•Interpreting complex information and

highlighting essential key points for the audience.
•Providing a human touch to your data.
•Offering value to your audience and industry.

 

The star of
the show: DATA

•Think about your theory. What do you want to prove or disprove? What do you think the
data will tell you?
•Collect data. Collate the data you’ll need to develop your story.
•Define the purpose of your story. Using the data you gathered, you should be able to write
what the goal of your story is in a single sentence.
•Think about what you want to say. Outline everything from the intro to the conclusion.
•Ask questions. Were you right or wrong in your hypothesis? How do these answers shape the
narrative of your data story?
•Create a goal for your audience. What actions would you like them to take after reading
your story?
And this is where the Data

Visualisation comes in
•Reveal patterns, trends, and findings from an
unbiased viewpoint.
•Provide context, interpret results, and articulate
insights.
•Streamline data so your audience can process
information.
•Improve audience engagement.
Build your narrative

As you tell your story, you need to use your data as
supporting pillars to your insights. Help your audience
understand your point of view by distilling complex
information into informative insights. Your narrative and
context are what will drive the linear nature of your data
storytelling.
 
Use visuals to enlighten
Visuals can help educate the audience on your theory. When
you connect the visual assets (charts, graphs, etc.) to your
narrative, you engage the audience with otherwise hidden
insights that provide the fundamental data to support your
theory. Instead of presenting a single data insight to support
your theory, it helps to show multiple pieces of data, both
granular and high level, so that the audience can truly
appreciate your viewpoint.
 
Show data to support
Humans are not naturally attracted to analytics, especially
analytics that lack contextualization using augmented
analytics. Your narrative offers enlightenment, supported by
tangible data. Context and critique are integral to the full
interpretation of your narrative. Using business analytic
tools to provide key insights and understanding to your
narrative can help provide the much-needed context
throughout your data story.
 
A Good Plot contains …
1. Title (What is plotted?)
2. Axis titles including Units
3. Numbers on all axes
4. Legend labeling of all lines or dots if more than one
5. Legible (Colours visible on screen, on projection and in print)
6. Source (Where do underlying Data Sets come from?)

•
 

Avoid Data
Decoration!
Stephen Few’s pitfalls
1. Exceeding the boundaries of a single screen
2. Supplying inadequate context for the data
3. Displaying excessive detail or precision
4. Expressing measures indirectly
5. Choosing inappropriate media of display
6. Introducing meaningless variety
7. Using poorly designed display media
8. Encoding quantitative data inaccurately
9. Arranging the data poorly
10. Ineffectively highlighting what’s important
11. Cluttering the screen with useless decoration
12. Misusing or overusing color
13. Designing an unappealing visual display

Levels of Understanding
1. Describe data sets (Descriptive Statistics /
Summary Statistics, Visualization)
2. Understand, explain some relations between

variables (Inferential Statistics, Detect patterns)
3. Predict new, unseen, future values!
4. “What If …” analysis, predict effect of possible

actions.

New Terminology
• Feature: What we called Variable
• Label: Variable that we are interested in
• Regression: Predict future numerical outcomes based

on historical data
• Classi cation: Predict future categorical outcomes

based on historical (labelled) data
• Clustering: Group (unlabelled) Observations in clusters

fi

You can reduce the dimensionality

(number of variables) by selecting the
most signi cant variables (variables that
have the most in uence on the outcome).
fi
fl
You can see the relation between two
numerical variables in a scatter
diagram. You can calculate this relation
with Pearson’s correlation coëf cient.
fi
Pearsons Correlation
Coef cient
is a measure of linear correlation between two sets of data
fi
Correlation
• Strong / Weak?
• Direction?
• Linear / Non-linear?
• Pearsons correlation
coëf ciënt, number
between -1 and +1.
Nice Viz: https://rpsychologist.com/d3/correlation/

fi

How to discover
correlations?
• Make some plots! Of course …
• Typically make scatter plots of each pair of

variables
• Can you see a relationship? Weak or Strong?
Demo: Wine Quality

• R makes this easy: pairs()
Correlation does NOT
imply Causation
Types of causation
If A and B are correlated then there are different possibilities for
causation:
• A causes B
• B causes A
• C causes A and B (‘lurking factor’)
• A causes C which causes B (or vice versa, indirect causation)
• A causes B and B causes A (cyclic or bi-directional)
• There is no connection between A and B at all (coincidence)
Correlation
Source: xkcd.com/552
How to discover
correlations?
• Make some plots! Of course …
• Typically make scatter plots of each pair of

variables
• Can you see a relationship? Weak or Strong?
• Describe the relationship using functions (not only

linear: y = a.x + b but also quadratic, exponential)
R demo: USJudgeRatings
• R makes this easy: pairs()

Plots
In general, plot the the independent values on the
(horizontal) x-axis and the dependant values on the
(vertical) y-axis.
y = f(x)
Multivariate dependencies:
z = f(x, y)
Some families of functions

• Linear, one variable: f(x) = a*x +b
• Linear, multiple variables: f(x,y) = a*x + b*y + c
• Polynomic (quadratical, cubic, ..): f(x) = a*x^2 + b*x +

c, f(x) = a*x^3 + b*x^2 + c*x + d
• Exponential: f(x) = 10^x
• Logarithmic: f(x) = log(x)
• Gaussian:

Why we need to be quantitive

Later on we are going to try to use some variables to predict
others, this requires tting a sensible function to the
available data.
These problems come in three main categories

1. You have a theoretical model for how the variables
should be related
2. You have no theoretical model and have to guess
something from the data
3. A combination of the two due to e.g. some unexpected
noise
fi

Guessing functions
Often there is not a single right answer.
Which function is good enough?

• Needs to describe the major features of the data
• Should be minimal, as simple as will work
• May well not be unique, you can try tting multiple
functional forms and see which works best.

fi

What features count …

Things to look for and check
match:
• behaviour as x -> +/-

in nite
• turning points (maxima,

minima), gradient = 0
• crossing points with the

axes
fi

Be creative with scale on y-axis

Interpolation and
Extrapolation
Horse Manure Crisis (1894)
Interpolation and Prediction
• Interpolation is estimating a value between two
nearest known data points.
• Extrapolation (or Prediction if the Data Set is a

time series) is estimating a value outside the range
of the Data Set using all data points.
• The problem with Extrapolation / Prediction is that

there will always be a trend break (Dutch: trend
breuk) somewhere in the future but it is unknown
when.

Fitting a function
Linear Regression
• We want to get a function that describes our data well,
but we know that there are some uncertainties that
cause some scatter in the data points.
• Linear function of one variable:

>fit1D <-lm(y~x)or glm(y~x)
• Linear function of two variables: 

>fit2D <- lm(y~x+z)or glm(y~x+z)
• Get some statistics on how good the t is: 

> summary(fit1D)
 
fi

R: Linear Fit, one variable
# linear fit (one variable)
fit <- glm(y~x)
co <- coef(fit)
abline(fit, col=“blue”, lwd=2)

R: Linear Fit, more variables
# linear fit (two variable)
fit <- glm(y~x+z)
co <- coef(fit)
persp(fit, …)
Non-Linear Regression
• We want to get a function that describes our data
well, but we know that there are some uncertainties
that cause some scatter in the data points.
• Non-Linear function: nls(y ~ f(x), data = …,

start = list(p0 = …, p1 = …, …))
• (Calculating sensible starting parameters will make

your life easier.)

R : Polynomial Fit
# polynomialial fit
f <- function(x,a,b,c){a*x^2 + b*x + c}
fit <- nls(y~f(x,a,b,c), data = …, start

= c(a=1, b=1, c=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2], c=co[3]),

add=TRUE, col=“pink”, lwd=2)
R : Exponential Fit
# exponential fit
f <- function(x,a,b){a*exp(b*x)}
fit <- nls(y~f(x,a,b), data = …, start

=c(a=1, b=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2]), add=TRUE,

col=“green”, lwd=2)
R: Logarithmic Fit
# logarithmic fit
f <- function(x,a,b){a*log(x) + b}
fit <- nls(y~f(x,a,b), data = …, start

=c(a=1, b=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2]), add=TRUE,

col=“orange”, lwd=2)
Under Fitting and Over Fitting
• If you t a straight line through data points with a
non-linear functional relationship, then you will not
be able to well describe the behavior of the data.
This is called Under Fitting.
• If you de ne a suitably complex function, you can

get it to pass through all your data points (like with
the splines). However, this does not mean that the
features in your function really exist! They are
probably caused by statistical noise. This is called:
Over Fitting.
fi
fi

Check for Under / Over Fitting
• Look at the data points and the t and use your

brains!
• Ask yourself if the t describes the data well?
• Ask yourself if the function you have used is the

simplest one that could describe the data?
• Test the t on a subset of the data (training set)
fi
fi
fi

What is the best t?

A. Visual inspection: choose the most simple function
that looks good.
B. Separate the available data in ‘training data’ (>

90%) used to t a model, and ‘test data’ (the rest)
to test the model. Use the least squares method to
calculate distance between predictions (values
calculated with tted function) and observations
(measured values) of the test data.
fi
fi

fi
“With four parameters I can t an elephant, and
with ve I can make him wiggle his trunk.”
–John von Neumann

fi
fi
Residual deviance
• Residual is the difference between the observed

value oi and the predicted/expected value ei.
• Residual deviance is the sum of absolute (i.e. made

positive) value of deviance. The higher the residual
deviance, the worse the t.
fi

Residual deviance
• Residual is the difference between the observed

value y and the predicted/expected value y.
• Residual deviance is the sum of absolute values

(i.e. made positive) of all deviance. The higher the
residual deviance, the worse the t.
fi

R squared
Homework:
R person correlation
2 vs
• Brie y describe the difference between the two

concepts above.
fl

Homework: three simple

exercises

Data Analytics Process

Find Data Sets
Import
Research Clean (Tidy) Target

Question? Audience?
Transform
Increasing
Insight
Visualize Model
Communicate
What do we do with other types of

data?
1. Image datasets
2. Natural Language

Photo’s and Movies:

Image Processing
Image processing
Image processing
How do machines store
images?
R, G, B (and Alfa) channels
1. Grayscale pixel values
2. Mean pixel value of
channels
3. Extract edge feature
Image processing
Image processing
• Segment an image into useful regions
• Perform measurements on certain areas
• Determine what object(s) are in the scene
• Calculate the precise location(s) of objects
• Visually inspect a manufactured object
• Construct a 3D model of the imaged object
• Find “interesting” events in a video

Magick package
https://cran.r-project.org/web/packages/magick/vignettes/intro.html
NLP
Natural Langauge processing
NLP- many problems
Headlines:
§ Enraged Cow Injures Farmer with Ax
§ Teacher Strikes Idle Kids
§ Hospitals Are Sued by 7 Foot Doctors
§ Ban on Nude Dancing on Governor’s Desk
§ Iraqi Head Seeks Arms
§ Stolen PainEng Found by Tree
§ Kids Make NutriEous Snacks
§ Local HS Dropouts Cut in Half

NLP- even more ideas to solve it
TextFreq-InverseDocFreq
Source: https://www.quora.com/What-is-a-tf-idf-vector
TF-IDF
• Term Frequency (TF) is the ratio of number of
times a word occurred in a document to the total
number of words in the document.
• Inverse Document Frequency (IDF) is the

logarithm of (total number of documents divided by
number of documents containing the word).

New Feature TF-IDF is
• Highest when word occurs many times within a small

number of documents (thus lending high
discriminating power to these documents)
• Lower when the term occurs in many documents
(thus offering a less pronounced relevance signal)
• Lowest when the term occurs virtually in all
documents
Grammar is the way in which words are
put together to form proper sentences
What it all has in common?
All of them use feature extraction

The Curse of Dimensionality
Often a Data Set has so much features (variables) that a Data
Analyst does not know where to begin: he/she suffers from
The Curse of Dimensionality.
Up to now we looked at Feature Selection to reduce the

number of variables (dimensionality reduction): Which
features contribute the most to the studied effect?
It is sometimes better not to select existing features

(variables), but to construct new features from the existing
features. We call this Feature Extraction.

Regression
“How does the dependent variable

change when the independent
variable(s) change?”
y = b0 + b1*x + e, where:
•b0 and b1 are known as the regression beta
coef cients or parameters:
◦b0 is the intercept of the regression line;
that is the predicted value when x = 0.
◦b1 is the slope of the regression line.
•e is the error term (also known as
the residual errors), the part of y that can be
explained by the regression model
fi
Regression in R
The residuals are the difference between
the actual values and the predicted
values.
So how do we want to interpret this?  

Median shouls be around 0, as we want
our prediction to be symmetrical for both
sides.
Q - Q Plot
Q–Q plot (quantile-quantile plot) is a
probability plot, a graphical method for
comparing two probability distributions by
plotting their quantiles against each other.
> qqnorm(cars$speed, pch = 1, frame = FALSE)

> qqline(model$fitted.values, col = "steelblue", lwd = 2)
 
Regression in R
Residual standard error: The residual standard error is a
measure of how well the model ts the data.
R-squared: It tells us what percentage of the

variation within our dependent variable that the
independent variable is explaining. In other words,
it’s another method to determine how well our
model is tting the data.
fi
fi
Regression in R
With linear regression we are building a linear model of 
y = b0 + b1*x
y=0,16557(x) + 8,28931
It is telling us how much

uncertainty there is with our
coef cient. The standard error is
often used to create con dence
intervals.
fi

fi
Remainder: how did
we do it so far
Overweight?
Extract new feature: BodyMassIndex <- Weight / Height^2

Add column with Feature
Use this new feature for predictions

All togehter
With what we have learned so far we can distinguis 7 main properties, when it comes to the data visualization
The basis: rst three of
seven elements
• Data: the actual variables to be
plotted
• Aesthetics: visual
characteristics that represent
data, e.g. position, size, color,
shape, transparency
• Geometries: the shapes we

use to represent our data
Source: http://www.science-craft.com/category/data-visualisation/
fi
Three more, advanced
elements
• Facets: rows and columns of

sub-plots
• Statistics: summaries and

mathematical models
• Coordinates: the plotting space

we are using
Finally add design element.
• Theme: non-data (meta-data or

eye candy)
What do to/ good advices
with your data
Indicate measurement
errors
Distinguish between
measurements (current, history)
and predictions (future), e.g. by
using solid and dotted lines.
Show, emphasize trend
lines and patterns.
Look at Data Sets from
every angle!
Fitting a function (a very
simple mathematical
model)
Check for Under / Over Fitting
• Look at the data points and the t and use your
brains!
• Ask yourself if the t describes the data well?
• Ask yourself if the function you have used is the

simplest one that could describe the data?
• Tune the t on a subset of the data (training set) and

test it on the remaining (labelled) data (testing set).
fi
fi
fi

What features count …
Things to look for and check match:
• behaviour as x -> +/- in nite
• turning points (maxima, minima), gradient = 0
• crossing points with the axes

fi

Some Examples 1
Source: https://www.theguardian.com/news/datablog/2011/mar/08/international-womens-day-pay-gap#_
Some Examples 2
Some Examples 3
t i l l v i s i b l e
Da t a s
s e l e c t e d
Sca l e s
r e d d o t s
e t r y : c o l o
Geom
Some Examples 4
n t a d d e d
a l el e m e
S t a t i s t i c
Florence Nightingale
https://www.sciencenews.org/pictures/mathtrek/112608/nightingale.swf
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Typical Exam
Questions
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
Theory Exam
• See sample questions

Exam Example Questions
Shotgun questions (a short answer is sufficiënt)
1. Discuss the four V’s that are often used to describe what Big Data is.
2. What is the difference a bar graph and a histogram?
3. Mention at least two file formats and discuss the differences between them.
4. Suppose you have a data set with many variables. Which R function can be used to quickly investigate relation
between all the variables.
 
 
Case Study
2. Visualization Design
Since 2014 there are earthquakes in Groningen a region in the north of Holland where
natural gas is pumped up. Initially NAM, the responable company, denied
responsability for the earthquakes and the collateral damage to houses. In 2014 the
Dutch government decided to put a cap on the quantity of gas that could be pumped
up per year. This cap was lowered in the subsequent years, when the earthquakes did
not stop.
• · 2014; Decision to limit the winning of natural gas to 42.5 billion cube meters and
with 80% around Loppersum (where the heaviest quakes occurred).
• · January 2015: Decision to lower the cap to 39.4 billion cube meters.
• · June 2015: Decision to lower the cap to 30 billion cube meters.

The NAM decided not to reduce gas production at every pump location, but only at a
select number of pump locations (marked with an orange colour on the map below).
•

 
 
 
 
Translation Dutch – English
Meer gaswinning – More pumping up of natural
gas
Minder gaswinning – Less pumping up of natural
gas
Groninger gasveld – Natural gas field of
Groningen
Actieve productie lokaties – Active production
locations
Productie lokaties met verminderde gaswinning –
Production locations with decreased pumping up
of natural gas
Gasleidingen – Gas pipes
 
 
 
 
 
1. Suppose you have earthquake data (date, time,
latitude, longitude, magnitude and depth) and
information about NAM pumps (latitude, longitude,
quantity of gas pumped up each month). Design and
draw a visualization that shows whether there is a
relationship between the reduction of pumped up
natural gas and the earthquakes.
Good Plot
• What are the properties of a good plot? Can you

describe 3 of the properties? And give examples of
bad practices?

Data Visualisation Slides 1-6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Visualisation Slides 1-6

Uploaded by

Copyright:

Available Formats

‘Data Analytics & Visualisation’

Minor ‘Data Science’

Skills Data Scientist

• Project lessons Thursday 12:00 - 16:00.

• Other days for homework, project group working meetings.

• We use Microsoft Teams to share learning materials and

• Assessment will be done with a written Exam and a

We live in the age of Big Data.

1. In exploration phase for yourself, trying to

2. In presentation phase for your audience, trying to

Look for interesting visualizations of the Corona-virus

• Relative values (ratio’s, • Cummulative values

R for data collection, ltering, cleaning,

Shiny for user interaction.

Continuous Numbers Temperature, Time, Length, …

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,

Logical TRUE, FALSE

Real World Data Types

• List: one dimension, elements may have different data

• Matrix: two dimensions, elements have same data type

• Dictionary: key, value pairs, values may have different

• Table or Data.Frame: each column is a vector!

• Display available Data Sets in R: > data()

• Once you have chosen one you can nd its data

• And its description with: > help(data.set)

Exploring Large Data Sets

• To see the last rows use: > tail(data.set)

• To determine the size: > length(data.set)

• Parts of the data can be selected with square brackets

• To get the contents of a single column (one attribute)

• Histogram: > hist(x)

• Barplot: > barplot(x)

• Much more possibilities can be found in package

Install RStudio on your laptop

Or send me a msg on Teams!

Any Questions, New Insights

Homework Last Week: Iris

Base Data Types

Continuous Numbers Temperature, Time, Length,

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,

Categories (in R Man || Woman,

Container Data Types

Filtering / Subsetting with

For convenience is it possible to

Prede ned R Functions

Lots of File Formats

• Lists: [“Amsterdam”, “Buenos Aires”,

• Dictionary: {“Name”: “Jan Kroon”, “City”:

• Lists of Dictionaries, Dictionaries containing Lists.

json_file <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"

json_data <- fromJSON(paste(readLines(json_file), collapse=""))

Source: Dr. Andrew Abela “Chart Chooser”

Ice Cream Sales

• Example: built-in data set precip

> barplot(precip, horiz = TRUE)

Stacked Barplot if the x parameter is a data.frame.

• Example: built-in data set islands

• Example: built-in data set islands

> hist(islands, breaks=10)

Scatter plot (Strooi diagram)

• Example: built-in data set women

> plot(x=women$height, y=women$weight)

• Try on your own:  

• Try on your own:  

• Try on your own: