Professional Documents
Culture Documents
Rafael Greminger
1
The Basics
2
Rstudio
3
Getting help
– DataCamp exercises
– Me
4
Packages
5
Types of objects
– Lists
– Vectors
– matrices
– data frames
– tibbles
– ....
6
Creating / assigning objects
[1] 1
b <- c(1,2) # assign a vector [1,2] to object b
print(b)
[1] 1 2
7
Creating / assigning objects
Name Age
1 Jon 23
2 Bill 41
8
Keept track of your objects!
9
Functions
– library()
– c()
– print()
10
Different ways are valid
11
Working with data sets
12
Different formats
– Provided by an R package
– ...
13
Loading data in R
We use two:
1. provided by an R package
2. .csv spreadsheets
14
Loading data from a package
15
Loading data from csv file
# A tibble: 2 x 11
store upc week move price sale profit brand packsize itemsize
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWE~ 6 12
2 86 1820000784 91 9 3.79 <NA> 28.2 O'DOU~ 6 12
16
Errors, errors, errors . . .
17
Errors
If you tell them to do something they cannot do, they will fail.
18
A common error: file locations
19
Interpreting errors
– Most likely someone else has had the same problem before.
20
Inspecting datasets
21
Raw data: Excel
22
Accessing rows and columns
23
Summary statistics
24
Summary statistics
Statistic R function
Mean mean()
Median median()
Variance var()
Standard deviation sd()
Correlation cor()
Minimum min()
Maximum max()
25
Data visualization
26
Why visualize data?
27
The tidyverse data visualization pipeline
– Data wrangling
– Data visualization
There are many different graphs that can be used to visualize data.
28
Histogram
4000
3000
count
2000
1000
0
3 4 5 6 7
price
29
Histogram
df_beer %>%
ggplot(aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")
df_beer %>% passes the data frame df_beer to the next function
30
Histogram
ggplot(data=df_beer, aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")
31
Boxplot
5
price
32
Boxplot
df_beer %>%
ggplot(aes(y=price)) +
geom_boxplot() +
labs(title="Distribution of Weekly Prices")
33
Scatter plot
300
200
move
100
0
3 4 5 6 7
price
34
Scatter plot
df_beer %>%
ggplot(aes(x=price, y=move)) +
geom_point() +
labs(title="Price vs. Demand")
35
Line plot
3.9
3.8
price
3.7
3.6
3.5
36
Line plot
df_beer %>%
filter(brand == "BUDWEISER BEER" & store == 86) %>%
ggplot(aes(x=week, y=price)) +
geom_line() +
labs(title="Budweiser price over time in store 86")
We also use filter() to select only one brand and one store.
37
Data Wrangling
38
Tidyverse data wrangling pipeline
Many useful functions for data wrangling are provided by the tidyverse package.
39
Filtering data
40
Filtering data
41
Grouped data
We can use group_by() to group data by a variable and then summarize each
group with another function.
average_price_in_week <- df_beer %>%
group_by(week) %>%
summarize(mean(price))
head(average_price_in_week,2)
# A tibble: 2 x 2
week `mean(price)`
<dbl> <dbl>
1 91 3.98
2 92 4.03
42
Grouped data
We can also used grouped data to create a plot for each group.
df_beer %>%
group_by(week) %>%
summarize(mean(price)) %>%
ggplot(aes(x=week, y=`mean(price)`)) +
geom_line() +
labs(title="Average price over time (all brands+stores)")
43
Grouped data
5.5
5.0
mean(price)
4.5
4.0
44
A few tips to remember
45
Tips to remember
46