You are on page 1of 68

Applied Statistics – Data Presentation, Tables and Charts

Tables and Visualizations


Column Chart

Pie Chart
Overview
1 Variable
tables
Bar chart / Bar plot
Categorical Pareto - Chart

Contingency
2 Variables Side-by-Side Chart
tables
Tables
and
visualizations Ordered Stem-and-Leaf
Arrays

Histogram
Frequency
Numerical
Distribution
Percentage Polygon

Distribution
Cumulative Percentage Polygon
Function
Furtwangen University 2
Column Chart

Furtwangen University 3
Column Chart

z <- c(1,1,2,2,2,3,3,3,3,3,3,3,4,4,4,4,5,5,5,6,6)
z1 <- table(z)
plot(z1, xlim=c(0,6), ylim=c(0,10), xlab = "z-Values",
ylab = "Frequencies", main = "Column Chart")

z1
z
123456
237432

Task:
Create a Column Chart for the following data:
"a","a","b","b","b","c","c","c","c",
"c","c","c","d","d","d","d","e","e","e","f","f"

?
Furtwangen University 4
Column Chart

This should be the solution

Furtwangen University 5
Column Chart

Task:
Create the figure on the right side for data Cars93
The data are in package „MASS“

Display the data with View()


Display the variable types with str()

You need to create a table() for the types of Cars93 at first

Take the plot() function

?
Furtwangen University 6
Column Chart

This should be the solution

install.packages("MASS")
library(MASS) # for Cars93

Furtwangen University 7
Pie Chart

Furtwangen University 8
Pie Chart

For categorical variables, with the table() function it is easy to get the frequencies
table <- table(Cars93$Type)

➢ In a vector or matrix, the data need to be of the same type.


➢ In general we have different data types in a data set.
➢ The data structure for different data types is a data frame,
which is created with the data.frame() function

frame <- data.frame(table) > str(table)


'table' int [1:6(1d)] 16 11 22 21 14 9
- attr(*, "dimnames")=List of 1
..$ : chr [1:6] "Compact" "Large" "Midsize" "Small" ...

Furtwangen University 9
Pie Chart

install.packages("MASS")
library("MASS")

table <- table(Cars93$Type)


frame <- data.frame(table)
#add a new column
frame$rel <- (round(frame$Freq/sum(frame$Freq), digits = 2)*100)
lbls <- paste(frame$Var1, frame$rel, "%")
pie(table, labels = lbls, radius = 1, main = "Distribution of Car Types")

Task:
Create the same figure, not with colours, but in different grey colours
with the grey.colours() function

?
Furtwangen University 10
Pie Chart

This should be the solution

Furtwangen University 11
Pie Chart

Task
Create the figure on the right, based on the following
statements:

slices <- c(10, 12, 4, 16, 8)


lbls <- c("US", "UK", "Australia", "Germany", "France")

Furtwangen University 12
Pie Chart

This should be the solution

Furtwangen University 13
3D Pie Chart

install.packages("plotrix")
library(plotrix)

table <- table(Cars93$Type)


frame <- data.frame(table)
frame$rel <- round(frame$Freq/sum(frame$Freq), digits = 2)
lbls <- paste(frame$Var1, frame$rel, "%")
pie3D(table, labels = lbls, radius = 2, explode = 0.2,
height = 0.1, main = "Distribution of car types")

Furtwangen University 14
Bar Chart

Furtwangen University 15
Bar Chart

library("MASS")

table(Cars93$Type)
barplot(table(Cars93$Type), ylim = c(0, 30),
xlab = "Car Type",
ylab = "Frequencies of the car types",
axis.lty = "solid",
space = 0.1,
main = "Frequencies of Car Types")

Furtwangen University 16
Bar Chart

ggplot2 is a package in R for producing graphics

library("MASS")
library("ggplot2")

Task:
Try to create the same figure with ggplot2,
+ with different colours, according to Type
+ x-axis = only car types, no additional text
+ y-axis = “Absolute frequencies of the car types",
+ header = “Frequencies of the Car Types“
+ header in the center of the figure

Use:

?
ggplot(Cars93, aes(Type))+
geom_bar(fill = "grey80", colour = 'black')

Furtwangen University 17
Bar Chart

This should be the solution

Furtwangen University 18
Bar Chart, Relative Frequencies
library("MASS")
library("plotly")

table <- table(Cars93$Type)


frame <- data.frame(table)
frame$rel <- round(frame$Freq/sum(frame$Freq), digits = 2)
n <- nrow(table) #Anzahl Types
co <- grey.colors(n, 0.2, 0.9)
ggplot(frame, aes(x=Var1, y=rel))+
geom_bar(stat="identity", fill = co, col = "black")+
theme_bw()+
labs(x = "", y = "Relative frequencies")+
ggtitle("Ratios of Car Types")+
theme(plot.title = element_text(hjust = 0.5))

Task:

?
Try to create the same figure with ggplot,
+ with different colours according to palette =
"viridis"

Furtwangen University 19
Bar Chart, Relative Frequencies

This should be the solution

Furtwangen University 20
Bar Chart, Absolute Frequencies

library("MASS")
library("plotly")

ggplot(Cars93, aes(Manufacturer))+
geom_bar(fill = "grey70", colour = 'black')+
theme_bw()+
theme(legend.position = "none")+
theme(axis.text.x = element_text(angle = -45,
vjust = 0.5))+
labs(x = "", y = "Absolute frequencies of
manufacturers")+
ggtitle("Absolute Frequencies of Manufacturers")+
theme(plot.title = element_text(hjust = 0.5))

Task:
Try to create the same figure with ggplot,
+ with different colours, given by manufacturer
?
Furtwangen University 21
Bar Chart

This should be the solution

Furtwangen University 22
Bar Chart, stacking
library("MASS")
A bad example! library("plotly")

ggplot(Cars93, aes(Manufacturer, fill = Type))+


geom_bar(position = "dodge")+
theme_bw()+
theme(axis.text.x = element_text(angle = -45,
vjust = 0.5))+
labs(x = "", y = "Absolute frequencies")+
ggtitle("Frequencies of Car Types per Manufacturer")+
theme(plot.title = element_text(hjust = 0.5))

Furtwangen University 23
Bar Chart, stacking

library("MASS")
library("plotly")

# fill = stacking criteria


ggplot(Cars93, aes(Manufacturer, fill = Type))+
geom_bar()+
theme_bw()+
theme(axis.text.x = element_text(angle = -45,
vjust = 0.5))+
labs(x = "", y = "Absolute frequencies")+
ggtitle("Frequencies of Car Types per Manufacturer")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")

Task:
Try to create the same figure with ggplot,

?
+ with different grey-colours, given by manufacturer

Furtwangen University 24
Bar Chart, stacking

This should be the solution

Furtwangen University 25
Bar Chart, Horizontal Display

library("MASS")
library("plotly")

ggplot(Cars93, aes(Manufacturer, fill = Type))+


geom_bar(fill = "grey70", colour = 'black')+
coord_flip()+
theme_bw()+
theme(legend.position = "none")+
labs(x = "", y = "Absolute frequencies of manufacturers")+
ggtitle("Absolute Frequencies of Manufacturers")+
theme(plot.title = element_text(hjust = 0.5))

Task
Create the same figure
in different colors, given from manufacturer

?
Furtwangen University 26
Bar Chart, Horizontal Display

This should be the solution

Furtwangen University 27
Bar Chart, Horizontal Display

library("MASS")
library("plotly")

ggplot(Cars93, aes(Manufacturer, fill = Type))+


geom_bar()+
coord_flip()+
theme_bw()+
theme(axis.text.x = element_text(angle = -45, vjust = 0.5))+
labs(x = "", y = "Absolute frequencies")+
ggtitle("Frequencies of Car Types per Manufacturer")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")

Task
Create the same figure

?
in different grey colors, legend on the right

Furtwangen University 28
Bar Chart, Horizontal Display

This should be the solution

Furtwangen University 29
Pareto Chart

Furtwangen University 30
Pareto Chart

➢ In a Pareto Chart, the frequencies for each category are plotted as vertical bars in decending order, and are combined
with the cumulative percentage line on the same chart

➢ Pareto charts gettheir name from the Pareto Principle: In many data sets, a few categories of a categorical variable
represent the majority of the data, while all other categories represent a relatively small amount of data

Absolute frequency in decending order

Cumulative absolute frequency

Percentage distribution

Cumulative percentage distribution

!
46% of all cars are Small or Midsize

1-78% = 22% of all cars are Large or Vans

Furtwangen University 31
Pareto Chart

library(MASS) #Cars93
library(qcc) # Pareto chart

tab = table(Cars93$Type)
pareto.chart(tab, xlab = "Car types",
ylab = "Absolute frequency of car types",
col = c("red", "blue"),
cumperc = seq(0, 100, by = 5), # ranges percentages right
ylab2 = "Cumulative relative frequency of car types",
main = "Pareto Chart for Car Types") # title of the chart

Furtwangen University 32
Pareto Chart
Task
The file „Interruptions“ contains the number of network
interruptions per day in a company for 130 days

Take the file from Felix and save it on your laptop,


Read the file into R with
intrupt <- read.csv(„enter your path“/Interruptions.csv",
sep = ";", header = TRUE)

Create a Pareto Chart for the interruptions,


At first create a table

Result must be the figure on the right side,


The colour of thebars must be red, blue, red, blue,…

?
Furtwangen University 33
Pareto Chart

This should be the solution

Furtwangen University 34
Pareto Chart

Task
Create the same chart in grey colours

Furtwangen University 35
Pareto Chart
Task
Create the same pareto chart in grey colours ∰
Solution
library(qcc) # Pareto chart
interruptions <- read.csv("D:/HFU Arbeitskreise/Leuchtturm/Buch/Data/Interruptions.csv", sep = ";", header = TRUE)
tab = table(interruptions$Interruptions)
pareto.chart(tab, xlab = "Number of interruptions",
ylab = "Absolute frequency of interruptions",
col = c("grey50"),
cumperc = seq(0, 100, by = 5),
ylab2 = "Cumulative relative frequency of interruptions",
main = "Network Interruptions - Pareto Chart")

Furtwangen University 36
Pareto Chart
Task
Create a Pareto Chart for the interruptions,
Take ggplot()

Result must be the figure on the right side


With gray colours

Furtwangen University 37
Pareto Chart

Task
Create a Pareto Chart for the Car Types,
Take ggplot()

Result must be the figure on the right side

Furtwangen University 38
Pareto Chart

Same task in grey, here´s the solution


install.packages("qcc")
install.packages("ggQC")
library(qcc) # Pareto chart, stat_pareto
library(ggplot2)
library(ggQC)
library("MASS")

tab = table(Cars93$Type)
tab1 <- as.data.frame((tab))
ggplot(tab1, aes(x = Var1, y = Freq)) +
labs(x = "Car Types", y = "Absolute frequency of car types")+
ggtitle("Pareto Chart for Car Types")+
theme(plot.title = element_text(hjust = 0.5))+
stat_pareto(point.color = "grey50",
point.size = 3,
line.color = "black",
bars.fill = c("grey50", "grey90")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5))
Furtwangen University 39
Stem – and - leaf Diagramm

Furtwangen University 40
Stem-and-leaf Diagramm

The prices on the right side have 1 decimal point > View(Cars93$Price)

With the following R code we round off the values to the nearest whole number and sort them in decending order:

round(sort(Cars93$Price),0) #decreasing = TRUE is default

We display the rounded values we can take the cat() function and we can tell the number of digits per line with the
fill() function:

cat(round(sort(Cars93$Price)), fill=50)

Furtwangen University 41
Stem-and-leaf Diagramm

library("MASS") The results are rounded to Integers > View(Cars93$Price)


library(readxl)

stem(Cars93$Price) 7

8
Also 9!

10
➢ The number to the left of the vertical line is the stem, the numbers of the right are the leaves.
➢ The decimal point information means multiply each stem by 10 (or 100, or 10000,…),
then add each leave to the stem
Furtwangen University 42
Stem-and-leaf Diagramm
Example stem(Fastfood$Amount) stem(Screwlength$Width)
Take the files „Fastfood“ and „Screwlenght“
from Felix and create the solutions on the The decimal point is at the | The decimal point is 2 digit(s)
right. to the left of the |
4|9
Take View(function) to display the amounts 5 | 589 830 | 27
and widths. 6 | 3558 832 | 3
7 | 149 834 | 381
Click on the column with the data to sort 8 | 33 836 | 3
them, so you can check your results 9 | 56 838 | 2356
840 | 3559001234459
842 | 000279969
844 | 4778
846 | 002569
848 | 114988

Furtwangen University 43
Data Frames

Furtwangen University 44
Pareto Chart - Repetition

Solution
library(qcc) # Pareto chart
intrupt <- read.csv("D:/HFU
Arbeitskreise/Leuchtturm/Buch/Data/Interruptions.csv",
sep = ";", header = TRUE)
tab = table(interruptions$Interruptions)
pareto.chart(tab, xlab = "Number of interruptions",
ylab = "Absolute frequency of interruptions",
col = c("red", "blue"), # colors of the chart
cumperc = seq(0, 100, by = 5), # ranges on the right
ylab2 = "Cumulative relative frequency of interruptions",
main = "Network Interruptions - Pareto Chart")

Furtwangen University 45
Data Frames

intrupt <- read.csv("D:/HFU Arbeitskreise/Leuchtturm/Buch/Data/Interruptions.csv", sep = ";", header = TRUE)

View(intrupt) str(intrupt)

'data.frame': 130 obs. of 2 variables:


Original data
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Interruptions: int 0 5 4 4 1 0 0 1 1 0 ...

How can we create a data.fame that shows us the number of days with same
number of interruptions?

The frequency distribution of the interruptions

Furtwangen University 46
Data Frames

tab.intrupt <- table(intrupt$Interruptions)


class(tab.intrupt)
View(tab.intrupt)
[1] "table"

To become more comfortable with operations we convert the table to Same display but different
a data frame with data.frame()
class(frame.intrupt)
frame.intrupt <- data.frame(tab.intrupt) [1] "data.frame"
View(frame.intrupt)

We rename the columns: Column Var1 becomes „number of disruptions“


and Freq becomes „ days ni“, the number of days.

names(frame.intrupt)[1] = "number of disruptions"


names(frame.intrupt)[2] = "days ni"
View(frame.intrupt)
Furtwangen University 47
Data Frames

𝒙𝒊 The i-th observation


n Total number of observations
𝒏𝒊 Absolute frequency of 𝒙𝒊 (how often appeares 𝑥𝑖 )
𝒌

𝒏 = ෍ 𝒏𝒊 In case k different observations 𝒙𝒊


𝒊=𝟏

𝑵𝒊 = ෍ 𝒏𝒋 The cumulated number of observations till 𝒙𝒊


𝒋=𝟏

𝒏𝒊
𝒇𝒊 = Relative frequency of 𝒙𝒊
𝒏

𝑭𝒊 = σ𝒊𝒋=𝟏 𝒇𝒋 Cumulated relative frequency till 𝒙𝒊

Furtwangen University 48
Data Frames

frame.intrupt$Ni <- cumsum(frame.intrupt$"days ni")


frame.intrupt$fi <- round(frame.intrupt$"days ni" /
sum(frame.intrupt$"days ni"), digits = 2)
frame.intrupt$Fi <- round (cumsum(frame.intrupt$fi), digits = 2)
View (frame.intrupt)

Corresponds with the reult of the Pareto Chart

Furtwangen University 49
Data Frames

Interpretations
1. On 32 days we had 1 interruption (n2)
2. On 107 days there were less than 2 interruptions (N3)
3. The amount of the days with 1 interruption is 0.25 or
25% (f2)
4. The amount of the days with 2 or 3 interruptions is
0.15 + 0.09 = 0.24 or 24% (f3 + f4)
5. The amount of the days with 2 or less interruptions is
0.83 or 82% (F3)
6. The amount of the days with 3 or more interruptions is
0.17 or 17% (1 – amount of days with 2 or less
interrruptions, 1 – F3 = 1 – 0.83)

! And now you understand the Pareto Chart

Furtwangen University 50
Pareto Chart

Task
In a previous slide we had the Pareto library(MASS) #Cars93
chart with the solution on the right side. library(qcc) # Pareto chart

Create a table to proove these results tab = table(Cars93$Type)


and add 2 more columns, called „fi %“ pareto.chart(tab, xlab = "Car types",
and „FI %“, to get the percentages on the ylab = "Absolute frequency of car types",
right. col = c("red", "blue"),
cumperc = seq(0, 100, by = 5), # ranges percentages right
ylab2 = "Cumulative relative frequency of car types",
main = "Pareto Chart for Car Types") # title of the chart

?
Furtwangen University 51
Histogram

Furtwangen University 52
Histogram

➢ A vertical bar chart of the data in a frequency distribution is called a histogram

➢ In a histogram there are no gaps between adjacent bars

➢ The class boundaries (or class midpoints) are shown on the horizontal axis

➢ The vertical axis is either frequency, relative frequency, or percentage

➢ The height of the bars represent the frequency, relative frequency, or percentage

Furtwangen University 53
Histogram

R creates intervals (classes) with lower and upper boundaries

8
Histogram: Age Of Students
6

Frequency
4

0
5 15 25 35 45 55 More

Furtwangen University 54
Histogram

hist(Cars93$Price, col="grey", xlab = "Price * $1.000", xlim=c(0,70),


main="Prices of 93 Models of 1993 Cars")

Absolute frequencies

hist(Cars93$Price, col="red", xlab = "Price * $1.000", ylab = "Relative


frequency", xlim = c(0,70), ylim = c(0,0.07), main="Prices of 93 Models
of 1993 Cars", prob = TRUE)

Result:
+ prob = TRUE: density function, Relative frequencies,
+ density plot, Density plot
+ Density distribution of the car prices

Furtwangen University 55
Histogram

With lines() function we can add a line to the histogram

hist(Cars93$Price, col="blue", xlab = "Price * $1.000", xlim=c(0,70),


main="Prices of 93 Models of 1993 Cars", prob = TRUE)
lines(density(Cars93$Price)) Percentage polygon

Furtwangen University 56
Histogram

Task
Take the „state.x77“ dataset from package „datasets“
Create a histogram for the „income“
Solution should be the figure on the right

Furtwangen University 57
Cumulative Frequency Distribution

Furtwangen University 58
Cumulative Frequency Distribution

➢ A cumulative frequency distribution of an intervall (class) is the sum of its own frequency plus all frequencies in the preeding classes

➢ It´s the graph of 𝑵𝒊

➢ We can assign the result of hist() zu a variable


prices <- hist(Cars93$Price, col = "grey", xlab = "Price * $1.000", xlim = c(0,70), main = "Prices of 93 Models of 1993 Cars")

➢ We display the content of the variabel


prices

Furtwangen University 59
Cumulative Frequency Distribution
> prices
$breaks
[1] 5 10 15 20 25 30 35 40 45 50 55 60 65 Intervalwidth = 5

$counts
[1] 12 21 29 10 9 5 4 1 1 0 0 1 Absolute frequencies 𝒏𝒊

$density
[1] 0.025806452 0.045161290 0.062365591 0.021505376 0.019354839 0.010752688 0.008602151 Relative frequencies 𝒇𝒊
[8] 0.002150538 0.002150538 0.000000000 0.000000000 0.002150538

$mids
[1] 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 Midpoints of the classes

$xname
[1] "Cars93$Price" Analyzed variable

$equidist
[1] TRUE Equidistant > interval width =

attr(,"class")
[1] "histogram"

Furtwangen University 60
Cumulative Frequency Distribution

➢ We have access to the information, for our frequency distribution we need the counts

➢ We can display the counts at first


prices$counts
[1] 12 21 29 10 9 5 4 1 1 0 0 1

➢ We change prices$counts with cumsum() function

prices$counts <- cumsum(prices$counts)

➢ And call the plot() function


plot(prices, col = "yellow", xlab = "Price * $1.000", xlim = c(0,70),
ylab = "Cumulative frequencies",
main = "Cumulative Frequency Distribution of Car Prices")
Furtwangen University 61
Empirical Cumulative Distribution Function
(ECDF)

Furtwangen University 62
Empirical Cumulative Frequency Distribution

➢ The empirical cumulative distribution function is more detailt: It doesn´t show the frequency within an interval
➢ For a certain value x it shows the portion of values that are less or equal of this certain value x
➢ For this in R the ecdf() function is availbale

plot(ecdf(Cars93$Price), col = "blue", xlab="Price* $1.000", xlim = c(0,70),


ylab = "Cumulative frequencies",
main = "Empirical Cumulative Distribution Function (ECDF) of Car Prices")

Furtwangen University 63
Empirical Cumulative Frequency Distribution

A geom that draws a step-function,


for example to visualize an
empirical cumulative distribution
function

library(ggplot2)
quants <- quantile(Cars93$Price)
ggplot(NULL, aes(x = Cars93$Price))+
geom_step(stat = "ecdf", color = "red")+
labs(x = "Price * $1.000", y = "Cumulative frequencies")+
geom_vline(aes(xintercept = quants), linetype = "dashed")+
scale_x_continuous(breaks = quants, labels = quants)+
ggtitle("Empirical Cumulative Distribution Function (ECDF) of Car Prices")+
theme_bw()
Furtwangen University 64
Other Examples

Other Examples

Furtwangen University 65
Scatterplots

➢ Take file „mpg“ from ggplot() package

# displ = engine displacement, in litres


# hwy = highway miles per gallon
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point() Layer

ggplot(mpg, aes(x = displ, y = hwy, colour = class))+


geom_point() low

It´s 3-dimensional:
➢ X-axis = displ Class identifies a colour Fuel
➢ Y-axis = hwy
economy
➢ Colour = class

high
Cars with high fuel economy for their high
engine sizes are 2-seaters
low Engine size high
Furtwangen University 66
Facetting
ggplot(mpg, aes(x = displ, y = hwy, colour = class))+
geom_point()+
facet_wrap(~class)

Furtwangen University 67
Boxplots

ggplot(mpg, aes(x = displ, y = hwy, colour = class))+


geom_boxplot()

Furtwangen University 68

You might also like