You are on page 1of 318

‘Data Analytics & Visualisation’

Minor ‘Data Science’


Hogeschool Rotterdam, CMI
Week 1

Skills Data Scientist


Let us introduce
ourselves …
Programme

Statistics Visualization

Project

a n d:
n D e m ho n ,
Machine Privacy O in P y t
m ing S hi ny
Learning Engineering o g r a m R a n d
Pr ing in
g r a m m
P r o
Practical Information
• Data Visualisation on Thursday 9:00 - 11:30

• Project lessons Thursday 12:00 - 16:00.

• Other days for homework, project group working meetings.

• We use Microsoft Teams to share learning materials and


communicate about the course.

• Assessment will be done with a written Exam and a


Personal Visualisation Assignment. The nal grade is
average between two partial grades.

fi

Big Data
Data Explosion
Insight, Decision, Action
Why Visualization?

We live in the age of Big Data.


Human beings process visual information
60.000 x faster than textual information!
(https://rhdeepexploration.wordpress.com/2011/12/05/visuals-60000-times-faster/)


Useful in two phases…

1. In exploration phase for yourself, trying to


understand and possibly nd hidden patterns in the
data set.

2. In presentation phase for your audience, trying to


communicate insights and trigger actions.

fi
Static Infographics
Tangible Visualisations
Data Animations

source: https://www.youtube.com/watch?v=4gIhRkCcD4U
Interactive Visualisations

source: http://getdolphins.com/blog/interactive-data-visualizations-new-york-times/
In-Class Assignment 1.1

Look for interesting visualizations of the Corona-virus


pandemic, that is now sweeping over the world.
• What data was used for the Viz?
• What story is told by the Viz?

Truth or Beauty?
Terminology?
• Absolute values

• Relative values (ratio’s, • Cummulative values


percentages, per capita
numbers) • Logarithmic scale

• Aggregate • …

• Filter

• Summarize

R for data collection, ltering, cleaning,


wrangling, slicing, dicing, munching,
crunching, modelling, …, graphing,
plotting and drawing.

Shiny for user interaction.


fi
Assignment 1.2
Follow the tutorial on R Data Structures and Graphics.
Make notes of things you don’t understand.

https://www.w3schools.com/R/r_vectors.asp
Base Data Types
Number of children, Floor in a
Numbers Discrete Numbers
building, …

Continuous Numbers Temperature, Time, Length, …

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,


Arabic, …

Logical TRUE, FALSE

Man || Woman,
Categories Pass || Fail || Inconclusive…

Real World Data Types


Container Data Structures
• Vector: one dimension, elements have same data type

• List: one dimension, elements may have different data


types

• Matrix: two dimensions, elements have same data type

• Dictionary: key, value pairs, values may have different


data types

• Table or Data.Frame: each column is a vector!

Interactive R Demo
Example Data Sets
• R contains a number of example Data Sets

• Display available Data Sets in R: > data()

• Once you have chosen one you can nd its data


structure with: > str(data.set)

• And its description with: > help(data.set)



fi

Exploring Large Data Sets


• To see a rst few rows use: > head(data.set)

• To see the last rows use: > tail(data.set)

• To determine the size: > length(data.set)

• Parts of the data can be selected with square brackets


data.set[…], e.g. > data.set[3, 4], >
data.set[1:5,] or > data.set[“column name”]

• To get the contents of a single column (one attribute)


use: > data.frame$column.name
fi
Basic Plotting
• Graph or Scatter diagram: > plot(x,y)

• Histogram: > hist(x)

• Barplot: > barplot(x)

• Much more possibilities can be found in package


ggplot2: https://cran.r-project.org/web/packages/
ggplot2/ggplot2.pdf. We will come back to this
package later. For now, we use plotting from the base
package.

Line Graph
Scatter Diagram
Histogram versus Bar Graph
Assignment 1.3
Install Studio on your laptop. Explore some of the data
sets that are packaged with R.
• What container data structure, what data types?
• Which plots can be used to explore the data?

Install RStudio on your laptop

https://www.rstudio.com/products/rstudio/download/
Homework Assignment: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable
names!)

• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.

• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.

• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.

• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5

• Make one more plot that shows something interesting about the differences
between the species of Irises.

Iris
If something is unclear or you need additional help please contact me!

Email: wypaz@hr.nl

Or send me a msg on Teams!


‘Data Analytics & Visualisation’
Minor ‘Data Science’
Hogeschool Rotterdam, CMI
Week 2

Any Questions, New Insights


https://rstudio-education.github.io/hopr/
Part 1 - Chapter 1:3

Homework Last Week: Iris


Homework Last Week: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable
names!)

• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.

• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.

• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.

• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5

• Make one more plot that shows something interesting about the differences
between the species of Irises.

Recap …
base data types, vectors,
data.frames, subsetting,
visualizing

Base Data Types


Discrete Numbers (whole Number of children, Floor in
Numbers
numbers, count) a building, …

Continuous Numbers Temperature, Time, Length,


(broken numbers, weigh) …

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,


Arabic, …

Logical (a.k.a.
TRUE, FALSE
Boolean)

Categories (in R Man || Woman,


Factors with Levels) Pass || Fail || Inconclusive…

Container Data Types


Homogenous Heterogenous

0D 25

Vector, c(2, 4, 6) or
List, list(1, “Rotterdam”,
1D c(“Amsterdam”,
TRUE)
“Berlin”)

2D Matrix Data.Frame

multi-D Array
Vectors
• Construct your own with: c(1, 2, 3)
• Vector can be named: my.vector <- c(“A”, “B”, “C”)
• Also elements of vector can be named (e.g. built-in
data set islands). Use names() to manipulate them.

> str(islands)
Named num [1:48] 11506 5500 16988 2968 16 ...
- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> head(islands)
Africa Antarctica Asia Australia Axel Heiberg Baf n
11506 5500 16988 2968 16 184

fi

Data Frames (1 of 2)
• Construct your own with: data.frame(col1 =
c(1,2,3), col2 = c(“A”, “B”, “C”))
• A data frame can be named: my.df <-
data.frame(col1 = c(1,2,3), col2 = c(“A”, “B”,
“C”))
• Also elements of vector can be named (e.g. built-in
data set women). Use rownames() or colnames()
to manipulate them.

Data Frames (2 of 2)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
> head(women)
height weight
1 58 115
2 59 117
3 60 120 Observation
4 61 123
5 62 126
6 63 129

Variable

Filtering / Subsetting with


[…] and $
Vector, one dimensional, with one number, e.g. my.vector[3], or one
name my.vector[“First”], or
• Vector with selected elements, e.g. c(2, 4, 5)
• Vector with names
• Logical statement, that is TRUE or FALSE for each element
Data Frame, two dimensional, with two numbers separated by a
comma, e.g. my.df[3,1] or two names my.df[“First”, “B”], or
• Two vectors with selected elements
• Two vectors with names
• Two Logical statements, that are TRUE or FALSE for each element
• In case you leave space in front of or behing “,” empty, everything
is selected

For convenience is it possible to


select a complete column (which is
by the way a named vector) of a
Data Frame with “$”.
E.g. weights <- women$weight

Prede ned R Functions


> help()
> str()
> length()
input or output or
> head() actual parameters return value
> nrow()
> ncol()
> summary()

> table()
> plot() default parameter
values side effects?
> barplot() e.g. generating a plot
> boxplot()

fi
Explore real-world Data Sets
Discover?

Import

Clean

Transform

Visualize
Import Data Sets from
les on the Internet
fi
Data Structures in Memory
(Vector, Matrix, Data.Frame,
List, …) are different from Data
Structures stored in a File.
Data File Formats
TXT JPEG

CSV PNG

JSON MP3

GeoJSON AVI

XML …

HTML

Lots of File Formats


TXT
CSV: Comma Separated
Values
HTML, XML
JSON
JavaScript Object Notation

• Lists: [“Amsterdam”, “Buenos Aires”,


“Chicago”].

• Dictionary: {“Name”: “Jan Kroon”, “City”:


“Utrecht”, “Children”: [“Lente”,
“Muara”]}

• Lists of Dictionaries, Dictionaries containing Lists.

R Code Hints
• > read.csv() for reading “comma separated value”  les (“.csv”).
• > read.csv2() variant used in countries that use a comma “,” as
decimal point and a semicolon “;” as eld separators.
• > read.delim() for reading  “tab-separated value”  les (“.txt”). By
default, point (“.”) is used as decimal points.
• > read.delim2()for reading  “tab-separated value”  les (“.txt”). By
default, comma (“,”) is used as decimal points.
• > install.packages(“rjson”)
> library("rjson")



example:

json_file <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"

json_data <- fromJSON(paste(readLines(json_file), collapse=""))



fi
fi
fi
fi
Visualisation to explore
data sets

like to explore?
What would you

Source: Dr. Andrew Abela “Chart Chooser”


2D Math / Stats plots
Line Plot
• Purpose: Explore the development of variable
values over time. Is the variable almost constant?
Does it increase or decrease? How fast does it
change: linear growth? explosive growth?
exponential growth?

• Example:
> time <- seq(from=0, to=10, by=0.1)
> growth <- time*time
> plot(x=time, y=growth, type=“l”)
visual comparison
over time



Trend Line

Ice Cream Sales


Seasonal / Cyclic Pattern
Barplot
• Purpose: Get an impression of differences between
variables.

• Example: built-in data set precip

> barplot(precip, horiz = TRUE)

Stacked Barplot if the x parameter is a data.frame.

visual comparison
between variables


Boxplot
• Purpose: Get an impression of the distribution of a
variable in a data set: Center, Quartiles, Outliers…?

• Example: built-in data set islands

> boxplot(islands)

visual inspection
of distribution

Histogram
• Purpose: Get an impression of the distribution of a
variable in a data set: Symmetrical or Skewed?
Uniform distribution? Normal distribution? otherwise

• Example: built-in data set islands

> hist(islands, breaks=10)

visual inspection
of distribution

Scatter plot (Strooi diagram)


• Purpose: Compare two variables of the same
person of object (proefpersoon of proefobject). Is
there a relationship?

• Example: built-in data set women

> plot(x=women$height, y=women$weight)

visual inspection
of relationship

3D plot
> persp(volcano)
Heat Map
> image(volcano)
Contour Graph (isobaren,
hoogtelijnen)
> contour(volcano)
Multiple Data Sets in a
single plot

• plot() creates a new plot

• points(), line(), text(), legend(), …


add data to the existing plot.

• boxplot() with multiple arguments plots multiple


box-plots side by side.

visual comparison
between variables

Geospatial Plots
London Cholera Map
Dr. John Snow,1854

Coördinates on Earth
Longitude, Latitude
Different Projections …
… result in different maps.

with different properties:


https://learn.canvas.net/courses/464/pages/unit-3-dot-7-map-projection-properties

Dates and Times


Time Zones

https://www.timeanddate.com/time/map/
Date and Time formats
Quick Demo
> df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009
0:00:00"))
> as.Date(df$Date, "%m/%d/%Y %H:%M:%S”)

> mytime<-ymd_hms("2015-08-14-05-30-00", tz="America/Halifax")


> mytime

> # Leap year?


> leap_year(mytime)

> # Time differences?


> date1<-ymd_hms("2017-06-20-03-45-23")
> date2<-ymd_hms("2017-10-07-21-02-19")
> difftime(date2,date1)

Full tutorial: https://rpubs.com/mr148/303800


Advanced package
• tidyverse : The tidyverse is a set of packages that work in harmony because they share
common data representations and API design

> install.packages("tidyverse")
> library(“tidyverse”)
• Contains a lot of packages that are useful in the data science, for this Course
ggplot2

ggplot2
• Midwest dataset: a build in dataset

• Try on your own: 



> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))

ggplot2
• Midwest dataset: a build in dataset

• Try on your own: 



> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))

ggplot2
• Midwest dataset: a build in dataset

• Try on your own: 



> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))
+ geom_point()

You need to specify what graph you wanna do!


ggplot2
> g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() + geom_smooth(method="lm")
# set se=FALSE to turnoff con dence bands

# Delete the points outside the limits


> g + xlim(c(0, 0.1)) + ylim(c(0, 1000000))
fi

In Class Assignment 2.1


1. Use (built-in) data set Seatbelts to visualize the impact
of introduction of the law to use seatbelts. (First turn
Seatbelts in data.frame format with data.frame(Seatbelts))

2. Use (built-in) data set LakeHuron to explore whether


there is a seasonal pattern in the water heights. What is
the trend?

3. Use the (built-in) dataset state.x77.


a) Make sure the object is a data frame, if not change it to a data
frame.
b) Find out how many states have an income of less than 4300.
c) Find out which is the state with the highest income.
d) What are possible causes of high murder rates?

‘Data Analytics & Visualisation’


Minor ‘Data Science’
Hogeschool Rotterdam, CMI
Week 3

Seatbelts Data Set


Seatbelts Data Set (continued)
Seatbelts Data Set (continued)
Seatbelts Data Set (continued)
Homework week 2
Exercise 2: 


With the dataset swiss, create a data frame of only the rows 1, 2, 3,4,5,6,7 and only the
variables Examination, Education and Infant.Mortality.
b) Create a row that will be the total sum of the column, name it Total.

>tm <- swiss[1:10,c('Examination', ‘Education',
'Infant.Mortality')]
>tm["Total" ,] <- colSums(tm)
c) Create a new variable that will be swisse the proportion of
Examination (Examination / Total)
>tm$swissbe <-tm$Examination / tm$Examination[length(tm$Examination)]

Homework week 2
Exercise 3
For the dataset state.x77 

a. Remove column Frost 

> keep <- c("Population" ,"Income" , "Illiteracy", "Life.Exp" ,
"Murder" , "HS.Grad" , "Area" )
> sta <- sta[keep]

b. Add a variable to the data frame which should categorize the level of illiteracy: [0,1) is low,
[1,2) is some, [2, inf) is high.
>sta$illlvl <- ifelse(sta$Illiteracy<1, 'low',ifelse(sta$Illiteracy
<2, 'some', ifelse(sta$Illiteracy>=2,'high', NA)))
Different data structures
Data Cleaning
How is it done in R?
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.

An example:

There is a dataset in Teams which we will use to work on

> df <- read_delim(“heartatk4R.txt")


And see what msg do you have on the screen? How can we x it?

fi
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.

An example:

There is a dataset in Teams which we will use to work on

> df <- read_delim(“heartatk4R.txt")


And see what msg do you have on the screen? How can we x it?

df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))

fi
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.

An example:

There is a dataset in Teams which we will use to work on

> df <- read_delim(“heartatk4R.txt")


And see what msg do you have on the screen? How can we x it?

df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))

fi
Data transformation
Data transformation
# pipe operator;
data is send to
the next step

# sort in ascending order;


desc(AGE) for descending order

Data transformation
Data transformation

# adds new variables and


preserves existing ones
Data transformation
Data transformation

# lter the dataset

according to conditions given


fi
Data transformation

Can you explain to me what does this code do?


Missing values
df <- df %>% drop_na()
But this might lead to probles such as:

- data bias

- data loss
Possible solutions:

-add more data: (calculate mean, median),(linear regression), many more …


Outliers
By know you should know how to identfy if your data has any outliers

How to deal with outliers ?


1.Removing rows with outliers from your
dataset

2.Consider outliers & inliers separately

3.Remove & replace via imputation


Remove duplicates
Remove duplicates
Modify Data Elements:
gsub()
> dutch.number
> # Cast number to a character string
> d.n <- as.character(dutch.number)
> # Substitution
> result <- gsub(“,”, “.”, d.n)
> # Cast character string to num
> international.number <-
as.double(result)
Sorting a data.frame by
column
> head(mtcars)
> # Sort a column (a vector)
> sort(mtcars$mpg)
> # Sort the whole data.frame
> order(mtcars$mpg)
> mtcars[order(mtcars$mpg),]
> mtcars[order(-mtcars$mpg),]
> mtcars[order(mtcars$mpg,
-mtcars$cyl), ]
Merging two data.frames
> head(area, 3)
Continent Country Land.Area.2013
1 Europe Netherlands 16164
2 Europe Belgium 11787
3 Europe France 210026
> head(inhab, 3)
Continent Country Inhabitants.2016
1 Europe Netherlands 16987330
2 Europe Belgium 11358379
3 Europe France 64720690
> merge(area, inhab)
Continent Country Land.Area.2013 Inhabitants.2016
1 Asia China 3700000 7466964280
2 Asia India 1240000 1324171354
3 Europe Belgium 11787 11358379
Aggregating data
> aggregate(sales$Cars.Sold, list(sales=sales$Year), sum)
sales x
1 2001 209
2 2002 209
3 2003 209
> aggregate(sales$Cars.Sold, list(sales=sales$Month), sum)
sales x
1 April 54
2 August 27
3 December 48
4 February 21
5 January 36
6 July 51
7 June 99
8 March 45
9 May 75
10 November 54
11 October 69
12 September 48
Subsetting
> df[ , 2]
> df[2, ]
> df[2, 2]
> df[df$var1 == “Male”, ]
> subset(df, df$var1 != “Female)
Add trend lines

> abline(a=0, b=1, col=“blue”)

a denotes the intercept

b denotes the slope

y = a + b*x
Homework / In class

https://www.kaggle.com/code/rtatman/data-cleaning-challenge-cleaning-numeric-columns/notebook
‘Data Analytics & Visualisation’
Minor ‘Data Science’
Hogeschool Rotterdam, CMI
Week 4

Explore real-world Data Sets


Discover?

Import

Clean

Transform

Visualize
Practical Problems …
CBS, Kaggle, https://datahub.io/
You can’t nd data. collections ,Google Data Set Search,

clean!, gsub(), use lubridate library for
Data is polluted, in wrong format. Dates and Times, use options of
read.csv(): StringsAsFactors = F

Too much data. lter(), subset(), grep()

Data is distributed over multiple Data


merge()
Sets.

Data is too detailed, need summaries,


aggregate()
totals per category.
fi
fi
Data Transformations:
ltering, sorting, wrangling, slicing,
dicing, munching, crunching,
merging, aggregating, …
fi

Matching of arguments
R functions arguments can be matched positionally or by
name. So the following calls to sd are all equivalent
> mydata <- rnorm(100)
> sd(mydata)
> sd(x = mydata)
> sd(x = mydata, na.rm = FALSE)
> sd(na.rm = FALSE, x = mydata)
> sd(na.rm = FALSE, mydata)

Even though it’s legal, it is discouraged messing around with


the order of the arguments too much, since it confuses fellow
developers.

Add trend lines

> abline(a=0, b=1, col=“blue”)

a denotes the intercept

b denotes the slope

y = a + b*x
Visualisation to
present data sets
Prede ned R Functions
> help()
> str()
input or output or > head()
actual parameters return value
> nrow()
> ncol()
… > summary()
> plot()
> install.packages()
default values > library()
side effects?
> merge()

> aggregate()

fi
User De ned R Functions
Functions can be created using the function() keyword and are stored as R
objects just like anything else. In particular, they are R objects of class “function”.
f <- function(<formal parameters>) {

return(variable)
}
Functions in R are “ rst class objects”, which means that they can be treated
much like any other R object.
Importantly,
• Functions can be passed as arguments to other functions.
• Functions can be nested, so that you can de ne a function inside of another
function.
• The return value of a function is the last expression in the function body to be
evaluated.

fi
fi

fi

Shiny for User Interaction


User Interface
Shiny generates HTML
Data-driven Outputs
Recap UI
Server Function
See what this code do
ui <- fluidPage(
selectInput("dataset", label = "Dataset", choices = ls("package:datasets")),
verbatimTextOutput("summary"),
tableOutput("table")
)

▪fluidPage() is a layout function that sets up the basic visual structure of the page
▪selectInput() is an input control that lets the user interact with the app 

by providing a value. In this case, 

It’s a select box with the label “Dataset” a

nd lets you choose one of the built-in datasets that come with R.
▪verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put
rendered output. verbatimTextOutput() displays code and tableOutput() displays tables

See what this code do


server <- function(input, output, session) {
output$summary <- renderPrint({
dataset <- get(input$dataset, "package:datasets")
summary(dataset)
})

output$table <- renderTable({


dataset <- get(input$dataset, "package:datasets")
dataset
})
}
shinyApp(ui, server)

See what happens if you remove red or green box !

Reactive programming
Reactive programming is another programming paradigm, it is 

programming with asynchronous data streams.

You are able to create data streams of anything, not just from click and hover events.
Streams are cheap and ubiquitous, anything can be a stream: variables, user inputs,
properties, caches, data structures, etc. 

 The key idea of reactive programming is to specify a graph of dependencies so that


when an input changes, all related outputs are automatically updated. 
The input argument is a list-like object that contains all the input data sent
from the browser, named according to the input ID. For example, if your UI
contains a numeric input control with an input ID of count, like so:
ui <- fluidPage(
numericInput("count", label = "Number of values",
value = 100)
)

then you can access the value of that input with input$count. It will
initially contain the value 100, and it will be automatically updated as the
user changes the value in the browser.

Unlike a typical list, input objects are read-only. If you attempt to


modify an input inside the server function, you’ll get an error:
server <- function(input, output, session) {
input$count <- 10
}

shinyApp(ui, server)
#> Error: Can't modify read-only reactive value
'count'

This error occurs because input re ects what’s happening in the


browser, and the browser is Shiny’s “single source of truth”. If you
could modify the value in R, you could introduce inconsistencies,
where the input slider said one thing in the browser,
and input$count said something different in R. 

fl

One more important thing about input: it’s selective about who is allowed to read it.
read from an input, you must be in a reactive
Exercise
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Exercise
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:

fi
Homework

Homework

Read this:

https://mastering-shiny.org/basic-app.html

https://mastering-shiny.org/basic-case-study.html
Page Lay-out with Panels
ui1 <- fluidPage(

titlePanel("EduCode 'Functions of two variables'"),

sidebarLayout(

sidebarPanel(

selectInput(inputId = 'chosen.function', label = 'Function


description: ', choices = c('f(x,y) = x + y', 'f(x,y) = x * y', 'f(x,y) =
x^2 + y^2', 'f(x,y) = 100*sin(x + y)/sqrt(x^2 + y^2)')),

sliderInput(inputId = 'angle', label = '3D view angle: ', min=0,


max=360, value=90)

),

mainPanel(

tabsetPanel(

tabPanel("3D Plot", plotOutput("Three.D.plot")),

tabPanel("Contour Graph", plotOutput("Contour.graph")),

more to explore: https://shiny.rstudio.com/articles/layout-guide.html


‘Data Analytics & Visualisation’
Minor ‘Data Science’
Hogeschool Rotterdam, CMI Week 5

Exercise
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Exercise A
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.

ui <- uidPage(textInput("name", "What's your name?"),

textOutput("greeting")

server <- function(input, output, session) {

output$greeting <- renderText({

paste0("Hello ", input$name)

})

}
fl
Exercise
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:

fi
Exercise A
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:

fi
Data storytelling
Data storytelling is the concept of building a
compelling narrative based on complex data
and analytics that help tell your story and
influence and inform a particular audience.
WHY?
•Adding value to your data and insights.

•Interpreting complex information and


highlighting essential key points for the audience.

•Providing a human touch to your data.

•Offering value to your audience and industry.



The star of
the show: DATA

•Think about your theory. What do you want to prove or disprove? What do you think the
data will tell you?
•Collect data. Collate the data you’ll need to develop your story.
•Define the purpose of your story. Using the data you gathered, you should be able to write
what the goal of your story is in a single sentence.
•Think about what you want to say. Outline everything from the intro to the conclusion.
•Ask questions. Were you right or wrong in your hypothesis? How do these answers shape the
narrative of your data story?
•Create a goal for your audience. What actions would you like them to take after reading
your story?

And this is where the Data


Visualisation comes in
•Reveal patterns, trends, and findings from an
unbiased viewpoint.
•Provide context, interpret results, and articulate
insights.
•Streamline data so your audience can process
information.
•Improve audience engagement.

Build your narrative


As you tell your story, you need to use your data as
supporting pillars to your insights. Help your audience
understand your point of view by distilling complex
information into informative insights. Your narrative and
context are what will drive the linear nature of your data
storytelling.

Use visuals to enlighten
Visuals can help educate the audience on your theory. When
you connect the visual assets (charts, graphs, etc.) to your
narrative, you engage the audience with otherwise hidden
insights that provide the fundamental data to support your
theory. Instead of presenting a single data insight to support
your theory, it helps to show multiple pieces of data, both
granular and high level, so that the audience can truly
appreciate your viewpoint.

Show data to support
Humans are not naturally attracted to analytics, especially
analytics that lack contextualization using augmented
analytics. Your narrative offers enlightenment, supported by
tangible data. Context and critique are integral to the full
interpretation of your narrative. Using business analytic
tools to provide key insights and understanding to your
narrative can help provide the much-needed context
throughout your data story.

A Good Plot contains …
1. Title (What is plotted?)

2. Axis titles including Units

3. Numbers on all axes

4. Legend labeling of all lines or dots if more than one

5. Legible (Colours visible on screen, on projection and in print)

6. Source (Where do underlying Data Sets come from?)




Avoid Data
Decoration!
Stephen Few’s pitfalls
1. Exceeding the boundaries of a single screen
2. Supplying inadequate context for the data
3. Displaying excessive detail or precision
4. Expressing measures indirectly
5. Choosing inappropriate media of display
6. Introducing meaningless variety
7. Using poorly designed display media
8. Encoding quantitative data inaccurately
9. Arranging the data poorly
10. Ineffectively highlighting what’s important
11. Cluttering the screen with useless decoration
12. Misusing or overusing color
13. Designing an unappealing visual display

Levels of Understanding
1. Describe data sets (Descriptive Statistics /
Summary Statistics, Visualization)

2. Understand, explain some relations between


variables (Inferential Statistics, Detect patterns)

3. Predict new, unseen, future values!

4. “What If …” analysis, predict effect of possible


actions.

New Terminology
• Feature: What we called Variable

• Label: Variable that we are interested in

• Regression: Predict future numerical outcomes based


on historical data

• Classi cation: Predict future categorical outcomes


based on historical (labelled) data

• Clustering: Group (unlabelled) Observations in clusters


fi

You can reduce the dimensionality


(number of variables) by selecting the
most signi cant variables (variables that
have the most in uence on the outcome).
fi
fl
You can see the relation between two
numerical variables in a scatter
diagram. You can calculate this relation
with Pearson’s correlation coëf cient.

fi
Pearsons Correlation
Coef cient
is a measure of linear correlation between two sets of data
fi
Correlation
• Strong / Weak?

• Direction?

• Linear / Non-linear?

• Pearsons correlation
coëf ciënt, number
between -1 and +1.

Nice Viz: https://rpsychologist.com/d3/correlation/


fi

How to discover
correlations?

• Make some plots! Of course …

• Typically make scatter plots of each pair of


variables

• Can you see a relationship? Weak or Strong?

Demo: Wine Quality


• R makes this easy: pairs()
Correlation does NOT
imply Causation
Types of causation
If A and B are correlated then there are different possibilities for
causation:

• A causes B

• B causes A

• C causes A and B (‘lurking factor’)

• A causes C which causes B (or vice versa, indirect causation)

• A causes B and B causes A (cyclic or bi-directional)

• There is no connection between A and B at all (coincidence)

Correlation

Source: xkcd.com/552
How to discover
correlations?
• Make some plots! Of course …

• Typically make scatter plots of each pair of


variables

• Can you see a relationship? Weak or Strong?

• Describe the relationship using functions (not only


linear: y = a.x + b but also quadratic, exponential)

R demo: USJudgeRatings

• R makes this easy: pairs()


Plots
In general, plot the the independent values on the
(horizontal) x-axis and the dependant values on the
(vertical) y-axis.

y = f(x)

Multivariate dependencies:

z = f(x, y)

Some families of functions


• Linear, one variable: f(x) = a*x +b

• Linear, multiple variables: f(x,y) = a*x + b*y + c

• Polynomic (quadratical, cubic, ..): f(x) = a*x^2 + b*x +


c, f(x) = a*x^3 + b*x^2 + c*x + d

• Exponential: f(x) = 10^x

• Logarithmic: f(x) = log(x)

• Gaussian:

Why we need to be quantitive


Later on we are going to try to use some variables to predict
others, this requires tting a sensible function to the
available data.

These problems come in three main categories


1. You have a theoretical model for how the variables
should be related
2. You have no theoretical model and have to guess
something from the data
3. A combination of the two due to e.g. some unexpected
noise

fi

Guessing functions
Often there is not a single right answer.

Which function is good enough?


• Needs to describe the major features of the data
• Should be minimal, as simple as will work
• May well not be unique, you can try tting multiple
functional forms and see which works best.

fi

What features count …


Things to look for and check
match:

• behaviour as x -> +/-


in nite

• turning points (maxima,


minima), gradient = 0

• crossing points with the


axes
fi

Be creative with scale on y-axis


Interpolation and
Extrapolation
Horse Manure Crisis (1894)
Interpolation and Prediction
• Interpolation is estimating a value between two
nearest known data points.

• Extrapolation (or Prediction if the Data Set is a


time series) is estimating a value outside the range
of the Data Set using all data points.

• The problem with Extrapolation / Prediction is that


there will always be a trend break (Dutch: trend
breuk) somewhere in the future but it is unknown
when.

Fitting a function
Linear Regression
• We want to get a function that describes our data well,
but we know that there are some uncertainties that
cause some scatter in the data points.

• Linear function of one variable:


>fit1D <-lm(y~x)or glm(y~x)

• Linear function of two variables:



>fit2D <- lm(y~x+z)or glm(y~x+z)

• Get some statistics on how good the t is:



> summary(fit1D)

fi

R: Linear Fit, one variable

# linear fit (one variable)

fit <- glm(y~x)

co <- coef(fit)

abline(fit, col=“blue”, lwd=2)


R: Linear Fit, more variables

# linear fit (two variable)

fit <- glm(y~x+z)

co <- coef(fit)

persp(fit, …)
Non-Linear Regression
• We want to get a function that describes our data
well, but we know that there are some uncertainties
that cause some scatter in the data points.

• Non-Linear function: nls(y ~ f(x), data = …,


start = list(p0 = …, p1 = …, …))

• (Calculating sensible starting parameters will make


your life easier.)

R : Polynomial Fit
# polynomialial fit

f <- function(x,a,b,c){a*x^2 + b*x + c}

fit <- nls(y~f(x,a,b,c), data = …, start


= c(a=1, b=1, c=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2], c=co[3]),


add=TRUE, col=“pink”, lwd=2)
R : Exponential Fit
# exponential fit

f <- function(x,a,b){a*exp(b*x)}

fit <- nls(y~f(x,a,b), data = …, start


=c(a=1, b=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2]), add=TRUE,


col=“green”, lwd=2)
R: Logarithmic Fit
# logarithmic fit

f <- function(x,a,b){a*log(x) + b}

fit <- nls(y~f(x,a,b), data = …, start


=c(a=1, b=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2]), add=TRUE,


col=“orange”, lwd=2)
Under Fitting and Over Fitting
• If you t a straight line through data points with a
non-linear functional relationship, then you will not
be able to well describe the behavior of the data.
This is called Under Fitting.

• If you de ne a suitably complex function, you can


get it to pass through all your data points (like with
the splines). However, this does not mean that the
features in your function really exist! They are
probably caused by statistical noise. This is called:
Over Fitting.
fi
fi

Check for Under / Over Fitting

• Look at the data points and the t and use your


brains!

• Ask yourself if the t describes the data well?

• Ask yourself if the function you have used is the


simplest one that could describe the data?

• Test the t on a subset of the data (training set)

fi
fi
fi

What is the best t?


A. Visual inspection: choose the most simple function
that looks good.

B. Separate the available data in ‘training data’ (>


90%) used to t a model, and ‘test data’ (the rest)
to test the model. Use the least squares method to
calculate distance between predictions (values
calculated with tted function) and observations
(measured values) of the test data.
fi
fi

fi
“With four parameters I can t an elephant, and
with ve I can make him wiggle his trunk.”

–John von Neumann


fi
fi
Residual deviance

• Residual is the difference between the observed


value oi and the predicted/expected value ei.

• Residual deviance is the sum of absolute (i.e. made


positive) value of deviance. The higher the residual
deviance, the worse the t.
fi

Residual deviance

• Residual is the difference between the observed


value y and the predicted/expected value y.

• Residual deviance is the sum of absolute values


(i.e. made positive) of all deviance. The higher the
residual deviance, the worse the t.
fi

R squared
Homework:
R person correlation
2 vs

• Brie y describe the difference between the two


concepts above.
fl

Homework: three simple


exercises
‘Data Analytics & Visualisation’
Minor ‘Data Science’
Hogeschool Rotterdam, CMI

Data Analytics Process


Find Data Sets

Import

Research Clean (Tidy) Target


Question? Audience?
Transform
Increasing
Insight
Visualize Model

Communicate

What do we do with other types of


data?

1. Image datasets
2. Natural Language

Photo’s and Movies:


Image Processing
Image processing
Image processing
How do machines store
images?
R, G, B (and Alfa) channels
1. Grayscale pixel values
2. Mean pixel value of
channels
3. Extract edge feature
Image processing
Image processing
• Segment an image into useful regions

• Perform measurements on certain areas

• Determine what object(s) are in the scene

• Calculate the precise location(s) of objects

• Visually inspect a manufactured object

• Construct a 3D model of the imaged object

•  Find “interesting” events in a video


Magick package

https://cran.r-project.org/web/packages/magick/vignettes/intro.html
NLP
Natural Langauge processing
NLP- many problems
Headlines:

§ Enraged Cow Injures Farmer with Ax

§ Teacher Strikes Idle Kids

§ Hospitals Are Sued by 7 Foot Doctors

§ Ban on Nude Dancing on Governor’s Desk

§ Iraqi Head Seeks Arms

§ Stolen PainEng Found by Tree

§ Kids Make NutriEous Snacks

§ Local HS Dropouts Cut in Half


NLP- even more ideas to solve it
TextFreq-InverseDocFreq

Source: https://www.quora.com/What-is-a-tf-idf-vector
TF-IDF
• Term Frequency (TF) is the ratio of number of
times a word occurred in a document to the total
number of words in the document.

• Inverse Document Frequency (IDF) is the


logarithm of (total number of documents divided by
number of documents containing the word).

New Feature TF-IDF is

• Highest when word occurs many times within a small


number of documents (thus lending high
discriminating power to these documents)
• Lower when the term occurs in many documents
(thus offering a less pronounced relevance signal)
• Lowest when the term occurs virtually in all
documents
Grammar is the way in which words are
put together to form proper sentences
What it all has in common?

All of them use feature extraction


The Curse of Dimensionality
Often a Data Set has so much features (variables) that a Data
Analyst does not know where to begin: he/she suffers from
The Curse of Dimensionality.

Up to now we looked at Feature Selection to reduce the


number of variables (dimensionality reduction): Which
features contribute the most to the studied effect?

It is sometimes better not to select existing features


(variables), but to construct new features from the existing
features. We call this Feature Extraction.

Regression

“How does the dependent variable


change when the independent
variable(s) change?”
y = b0 + b1*x + e, where:
•b0 and b1 are known as the regression beta
coef cients or parameters:
◦b0  is the  intercept  of the regression line;
that is the predicted value when x = 0.
◦b1 is the slope of the regression line.
•e  is the  error term  (also known as
the  residual errors), the part of y that can be
explained by the regression model
fi
Regression in R
The residuals are the difference between
the actual values and the predicted
values. 

So how do we want to interpret this? 



Median shouls be around 0, as we want
our prediction to be symmetrical for both
sides.
Q - Q Plot
Q–Q plot (quantile-quantile plot) is a
probability plot, a graphical method for
comparing two probability distributions by
plotting their quantiles against each other.

> qqnorm(cars$speed, pch = 1, frame = FALSE)


> qqline(model$fitted.values, col = "steelblue", lwd = 2)


Regression in R
Residual standard error: The residual standard error is a
measure of how well the model ts the data.

R-squared: It tells us what percentage of the


variation within our dependent variable that the
independent variable is explaining. In other words,
it’s another method to determine how well our
model is tting the data. 
fi
fi
Regression in R
With linear regression we are building a linear model of

y = b0 + b1*x
y=0,16557(x) + 8,28931

 It is telling us how much


uncertainty there is with our
coef cient. The standard error is
often used to create con dence
intervals. 
fi

fi
Remainder: how did
we do it so far
Overweight?

Extract new feature: BodyMassIndex <- Weight / Height^2


Add column with Feature

Use this new feature for predictions


All togehter

With what we have learned so far we can distinguis 7 main properties, when it comes to the data visualization
The basis: rst three of
seven elements
• Data: the actual variables to be
plotted

• Aesthetics: visual
characteristics that represent
data, e.g. position, size, color,
shape, transparency

• Geometries: the shapes we


use to represent our data

Source: http://www.science-craft.com/category/data-visualisation/

fi
Three more, advanced
elements

• Facets: rows and columns of


sub-plots

• Statistics: summaries and


mathematical models

• Coordinates: the plotting space


we are using

Finally add design element.

• Theme: non-data (meta-data or


eye candy)
What do to/ good advices
with your data
Indicate measurement
errors
Distinguish between
measurements (current, history)
and predictions (future), e.g. by
using solid and dotted lines.
Show, emphasize trend
lines and patterns.
Look at Data Sets from
every angle!
Fitting a function (a very
simple mathematical
model)
Check for Under / Over Fitting
• Look at the data points and the t and use your
brains!

• Ask yourself if the t describes the data well?

• Ask yourself if the function you have used is the


simplest one that could describe the data?

• Tune the t on a subset of the data (training set) and


test it on the remaining (labelled) data (testing set).

fi
fi
fi

What features count …

Things to look for and check match:

• behaviour as x -> +/- in nite

• turning points (maxima, minima), gradient = 0

• crossing points with the axes


fi

Some Examples 1

Source: https://www.theguardian.com/news/datablog/2011/mar/08/international-womens-day-pay-gap#_
Some Examples 2
Some Examples 3

t i l l v i s i b l e
Da t a s
s e l e c t e d
Sca l e s
r e d d o t s
e t r y : c o l o
Geom
Some Examples 4

n t a d d e d
a l el e m e
S t a t i s t i c
Florence Nightingale

https://www.sciencenews.org/pictures/mathtrek/112608/nightingale.swf
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Typical Exam
Questions
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
Theory Exam

• See sample questions


Exam Example Questions
Shotgun questions (a short answer is sufficiënt)
1. Discuss the four V’s that are often used to describe what Big Data is.
2. What is the difference a bar graph and a histogram?
3. Mention at least two file formats and discuss the differences between them.
4. Suppose you have a data set with many variables. Which R function can be used to quickly investigate relation
between all the variables.



Case Study
2. Visualization Design
Since 2014 there are earthquakes in Groningen a region in the north of Holland where
natural gas is pumped up. Initially NAM, the responable company, denied
responsability for the earthquakes and the collateral damage to houses. In 2014 the
Dutch government decided to put a cap on the quantity of gas that could be pumped
up per year. This cap was lowered in the subsequent years, when the earthquakes did
not stop.
• · 2014; Decision to limit the winning of natural gas to 42.5 billion cube meters and
with 80% around Loppersum (where the heaviest quakes occurred).

• · January 2015: Decision to lower the cap to 39.4 billion cube meters.

• · June 2015: Decision to lower the cap to 30 billion cube meters.


The NAM decided not to reduce gas production at every pump location, but only at a
select number of pump locations (marked with an orange colour on the map below).








Translation Dutch – English
Meer gaswinning – More pumping up of natural
gas
Minder gaswinning – Less pumping up of natural
gas
Groninger gasveld – Natural gas field of
Groningen
Actieve productie lokaties – Active production
locations
Productie lokaties met verminderde gaswinning –
Production locations with decreased pumping up
of natural gas
Gasleidingen – Gas pipes





1. Suppose you have earthquake data (date, time,
latitude, longitude, magnitude and depth) and
information about NAM pumps (latitude, longitude,
quantity of gas pumped up each month). Design and
draw a visualization that shows whether there is a
relationship between the reduction of pumped up
natural gas and the earthquakes.
Good Plot

• What are the properties of a good plot? Can you


describe 3 of the properties? And give examples of
bad practices?

You might also like