You are on page 1of 46

Data Analytics in R

[1] Introduction to Data Analytics

Rafael Greminger

1
The Basics

2
Rstudio

3
Getting help

Various ways to get help when you are stuck:

– Built-in help files: ?mean opens helpfile for function mean

– The internet is your friend.

– Our textbook https://rgreminger.github.io/files/msin0010book/

– DataCamp exercises

– Me

4
Packages

Packages are code provided by other users.

They extend R’s functionality.

They need to be installed once.

They need to be loaded in every session!

# Install tydyverse package (only run once)


install.packages('tidyverse')
# Load tidyverse package (run every session)
library(tidyverse)
# note: anything on a line after # is a comment and will not run

5
Types of objects

There are many different types of objects in R:

– Lists

– Vectors

– matrices

– data frames

– tibbles

– ....

6
Creating / assigning objects

a <- 1 # assign value 1 to object a


print(a) # print object a

[1] 1
b <- c(1,2) # assign a vector [1,2] to object b
print(b)

[1] 1 2

7
Creating / assigning objects

Name <- c("Jon", "Bill")


Age <- c(23, 41)
df <- data.frame(Name, Age)
head(df)

Name Age
1 Jon 23
2 Bill 41

8
Keept track of your objects!

9
Functions

We already used various functions now:

– library()

– c()

– print()

Functions are the workhorses of R.

– They take input, do something, and return an output.

Packages provide more functions.

10
Different ways are valid

There are usually multiple ways of doing the same thing in R.

There is NOT the one and only way of doing things.

– Multiple functions/packages can achieve the same thing.

I show ways that work for the purpose of this course.

– You can always do it differently if you want to.

11
Working with data sets

12
Different formats

Data sets come in different storage formats:

– .csv: comma-separated values

– .xls: Excel spreadsheet

– .txt: text file

– Provided by an R package

– ...

13
Loading data in R

R can read in many different formats.

We use two:

1. provided by an R package

2. .csv spreadsheets

14
Loading data from a package

library(ISLR) # load the package containing the data, here ISLR


data(OJ) # load the OJ dataset, it will be stored as `OJ`
head(OJ,2)
Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
1 CH 237 1 1.75 1.99 0 0.0 0
2 CH 239 1 1.75 1.99 0 0.3 0
SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
1 0 0.5 1.99 1.75 0.24 No 0.000000
2 1 0.6 1.69 1.75 -0.06 No 0.150754
PctDiscCH ListPriceDiff STORE
1 0 0.24 1
2 0 0.24 1

15
Loading data from csv file

# Load tidyverse package, which provides read_csv() function


library(tidyverse)
# Load beer dataset
df_beer <- read_csv("beer.csv")
head(df_beer,2)

# A tibble: 2 x 11
store upc week move price sale profit brand packsize itemsize
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWE~ 6 12
2 86 1820000784 91 9 3.79 <NA> 28.2 O'DOU~ 6 12

16
Errors, errors, errors . . .

17
Errors

Computers are precise and powerful, but stupid. . .

They do exactly what you tell them to do.

If you are not precise, they won’t understand.

– “Computer says no”.

If you tell them to do something they cannot do, they will fail.

– “Computer says no”.

18
A common error: file locations

R needs to know where to look for the file beer.csv.

– read_csv("beer.csv") will look in the working directory.


# Check working directory, which is where R will look for files
getwd()
[1] "C:/Users/Rafael/Dropbox (UCL)/Teaching/Data Analytics MSIN0010/week 1/
# Change working directory
setwd("new_directory")
# Or specify different path
df_beer <- read_csv("PATH/beer.csv") # will look in PATH

19
Interpreting errors

If you do not specify the correct path, R will throw an error.


df_beer <- read_csv("wrong_path/beer.csv")
Error: 'wrong_path/beer.csv' does not exist in current working directory ('

Read the error messages!

– They are usually quite helpful.

If not, copy and paste the error message into Google.

– Most likely someone else has had the same problem before.

20
Inspecting datasets

21
Raw data: Excel

22
Accessing rows and columns

# first three elements of column price


df_beer$price[1:3]
[1] 3.49 3.79 3.69
# first two elements of column 3
df_beer[1:2,3]
# A tibble: 2 x 1
week
<dbl>
1 91
2 91

NOTE: First index is always row, second index is always column.


df_beer[ROW_INDEX,COLUMN_INDEX]

23
Summary statistics

# Mean of beer prices


mean(df_beer$price)
[1] 4.282289
# Median of beer prices
median(df_beer$price)
[1] 3.99
# Variance of beer prices
var(df_beer$price)
[1] 1.057594
# Correlation of price and profit
cor(df_beer$price,df_beer$profit)
[1] 0.05824536

24
Summary statistics

Statistic R function
Mean mean()
Median median()
Variance var()
Standard deviation sd()
Correlation cor()
Minimum min()
Maximum max()

25
Data visualization

26
Why visualize data?

The goal is not to produce pretty pictures.

The goal is to clearly communicate insights.

27
The tidyverse data visualization pipeline

tidyverse: an opinionated collection of R packages.

– Data wrangling

– Data visualization

There are many different graphs that can be used to visualize data.

– https://www.r-graph-gallery.com/ provides a fairly exhaustive list.

We focus on a few main ones:

– Histogram, boxplot, scatter plot, line plot

28
Histogram

Distribution of Weekly Prices

4000

3000
count

2000

1000

0
3 4 5 6 7
price

29
Histogram

df_beer %>%
ggplot(aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")

df_beer %>% passes the data frame df_beer to the next function

1. ggplot() initializes the plot

2. aes() specifies the aesthetics to be plotted, here price on the x-axis

3. geom_histogram() specifies the type of plot

4. labs() specifies labels for the plot (here, title)

30
Histogram

ggplot(data=df_beer, aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")

Produces the same plot.

The difference is how we specify which data frame to use.

31
Boxplot

Distribution of Weekly Prices


7

5
price

−0.4 −0.2 0.0 0.2 0.4

32
Boxplot

df_beer %>%
ggplot(aes(y=price)) +
geom_boxplot() +
labs(title="Distribution of Weekly Prices")

Now we’re using geom_boxplot() instead of geom_histogram().

And we’re saying y-axis is price instead of x-axis

33
Scatter plot

Price vs. Demand

300

200
move

100

0
3 4 5 6 7
price

34
Scatter plot

df_beer %>%
ggplot(aes(x=price, y=move)) +
geom_point() +
labs(title="Price vs. Demand")

Now we are using geom_point().

And we’re specifying both x- and y-axis in aes().

35
Line plot

Budweiser price over time in store 86


4.0

3.9

3.8
price

3.7

3.6

3.5

100 150 200 250 300


week

36
Line plot

df_beer %>%
filter(brand == "BUDWEISER BEER" & store == 86) %>%
ggplot(aes(x=week, y=price)) +
geom_line() +
labs(title="Budweiser price over time in store 86")

Now we are using geom_line().

We also use filter() to select only one brand and one store.

37
Data Wrangling

38
Tidyverse data wrangling pipeline

Many useful functions for data wrangling are provided by the tidyverse package.

Here, I will only discuss a two of them that we will be using.

39
Filtering data

df_subset <- df_beer %>%


filter(brand == "BUDWEISER BEER" & store == 86)
head(df_subset,2)
# A tibble: 2 x 11
store upc week move price sale profit brand packsize itemsize
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWE~ 6 12
2 86 1820000016 92 46 3.49 <NA> 19.0 BUDWE~ 6 12

We already saw the filter() function in the line plot.

It allows us to select only a subset of the data.

40
Filtering data

df_subset <- df_beer %>%


filter(brand == "BUDWEISER BEER" & store == 86)

& is AND operator.

This again uses pipe operator.

– filter(df_beer, brand == "BUDWEISER BEER" & store == 86) would


produce the same result.

41
Grouped data

We often have data for multiple groups.


– Panel data: data for multiple individuals over time

We can use group_by() to group data by a variable and then summarize each
group with another function.
average_price_in_week <- df_beer %>%
group_by(week) %>%
summarize(mean(price))
head(average_price_in_week,2)
# A tibble: 2 x 2
week `mean(price)`
<dbl> <dbl>
1 91 3.98
2 92 4.03

42
Grouped data

We can also used grouped data to create a plot for each group.
df_beer %>%
group_by(week) %>%
summarize(mean(price)) %>%
ggplot(aes(x=week, y=`mean(price)`)) +
geom_line() +
labs(title="Average price over time (all brands+stores)")

43
Grouped data

Average beer price over time (all brands+stores)

5.5

5.0
mean(price)

4.5

4.0

100 200 300 400


week

44
A few tips to remember

45
Tips to remember

Keep track of variables and what they are

Know your working directory

Use the help function

Understand the different data structures you’re using

Do not just copy and run code, understand it!

Learn about the packages you’re using (e.g. tidyverse,ggplot2, . . . )

46

You might also like