You are on page 1of 9

Contents

Misleading ways to present data

Advantages of Numerical Summaries

Summaries of Spreads

Reporting Center and Spread

Formulas

Base R Codes

GGplot Codes

Misleading ways to present data


Real estate agents can use the average sale price of suburbs. A few houses sold at a high price
(outliers) could increase the average, despite that the median price of houses are much lower.
Bad suburbs are presented to be better.

Price per square foot instead of the predominant Australian metric unit of square meter can
make the price look lower.

Advantages of Numerical Summaries

Reduces all the data to 1 simple number. This loses a lot of information. However it allows easy
communication and comparisons.

Major features that can be summarised numerically:

● Maximum
● Minimum
● Center (mean, median)
● Spread (standard deviation, range, IQR)

Summaries of Spread

IQR
Range
Variance
Standard deviation

Reporting center and spread

Correct pairs to present: (mean, SD) OR (median, IQR)


INCORRECT pairs to present: (mean, IQR) OR (median, SD)

Note: (mean, SD) pair is not robust; (median, IQR) IS robust


Characteristics of Summaries

Summary Robust Center or Spread Compares 2 Property of Effect of


or more data shift in scaling
set? variance (multiplying/
(adding/subtr dividing)ever
acting ‘n’ y data
from every number by ‘n’
data number
effect)

IQR

Range

Median

Mean Shifts Scales

Variance

Standard No change Scales. Sd*n


Deviation
Formulas

Function Formula

Coefficient of Variation (CV): combines mean and CV = SD/mean


standard deviation into 1 summary

Uses: analytical chemistry to express precision and


repeatability of an assay. Engineering and physics for
quality assurance studies. Economics for determining
the volatility of a security.

Upper threshold Q1 - 1.5IQR

Lower threshold Q3 - 1.5IQR

Standard units: how many standard deviations is it (data point - mean)/SD


above or below the mean OR gap/SD

OR data point = mean + SD * standard units

Standard deviation population SDpop = RMS of (gaps from the mean)

Area under the normal curve pnorm(number you want to find area under, mean =
Note: default mean = 0, SD=1 enter, sd = enter)

Finding area between -2.5 and 2.5 sd pnorm(2.5)-pnorm(-2.5)

dnorm(

rnorm(

To remove errors ```{r setup, include=FALSE}


knitr::opts_chunk$set(message = F, warning = F)
```

Base R Codes

Function Code
Histogram with one variable hist(iris$Sepal.Length)

Boxplot with one variable boxplot(iris$Petal.Length)

Average/mean mean(datasetname$variable)

Eg. mean(Newtown2017$Sold)

Importing Data into R Studio Import dataset → (choose data) → import

Show dataset: displays data size, str(datasetname)


variable names, variable
classifications

Tidyverse library(tidyverse)

head(datasetname,rows)

Eg. head(Newtown2017,2)

Filter to find further subset of mean(Newtown2017$Sold[Newtown$Type == “House” &


data Newtown2017$Bedrooms == “4”])

Note: this focuses on houses with 4 bedrooms (large), the mean price

Median median(datasetname$variable)

Median focusing on variable median(datasetname$variable[datasetname$Type==”variable” &


datasetname$variable2==”number”])

Eg. median(Newtown2017$Sold[Newtown$Type ==”House” &


Newtown2017$Bedrooms==”4”])

Mean and median together c(mean(datasetname$variable), median(datasetname$variable))

Gaps gaps = datasetname$variable - mean(dataset$variable)

Maximum gap max(gaps)

Standard Deviation for sample sd(datasetname$variable)

Standard deviation for Sample sd * sqrt((n-1)/n): sd(datasetname$variable) * sqrt((n-1)/n)


population
OR

Rafalib package + popsd(datasetname$variable)


Make barplot

Quantile quantile(datasetname$variable)

IQR quantile(datasetname$variable)[4] -
quantile(quantile(datasetname$variable)[2]

Moving data up Eg. A = c(1:20)


B=A+5

NOTE: mean(B) = mean(A) + 5


sd(A) = sd(B)

Boxplot values summary(datasetname)

Ordering sort(datasetname)

Population standard deviation library(rafalib)


popsd(datasetname)

NOTE: without rafalib package, sd(datasetname) outputs sample sd

GGplot Codes

NOTE:
NO $ IN GGPLOT
Function Code
Histogram: 1 variable ggplot(iris, aes(Petal.length)) + geom_histogram()

Histogram: 1 variable + coloured ggplot(iris, aes(Petal.length)) +


geom_histogram(aes(fill=Species))

Boxplot: 1 variable ggplot(iris, aes(Petal.length)) + geom_boxplot()

Calculating popsd without the function


Quiz

Question Answer

What feature of a data can be easily communicated by The maximum of quantitative data
a single numerical summary?

The mean is the unique point at which the data is Balanced


____.

For measuring the spread of data, what is wrong with It will always be 0
calculating the mean of the “gaps”, where “gap” =
data - mean?

Can standard deviation be negative? No. Involves RMS which cannot be negative.

In R, does the sd() command works out the False. sd() command gives sample sd. Rafalib
population SD, and does rafalib package need to be package + popsd() command gives population sd.
installed?
Project1 <- read.csv("Downloads/Project1Data.csv")
View(Project1Data)
str(Project1)
library(tidyverse)

ggplot(Project1, aes(Breakfast,fill=Employment))+geom_bar(stat = "count", bins=10) +


labs(x="Number of days per week breakfast is consumed", y = "Frequency", title="Breakfast
Habits vs Employment Status" )

Potential contents:

Advantages and disadvantages of numerical and graphical summaries

Skills:
- Find mean, median, iqr, range etc. in boxplot, histogram and dataset both by hand, R
studio and ggplot

You might also like