You are on page 1of 17

Intro to R & Tidy - STAT 5000

Bird

7/14/2021

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --


## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Welcome to Week 1 Introduction to R and Tidyverse!
The first thing we need to do before we learn to program in R is to learn how to get started.
This is R Markdown, a package inside of R Studio, which uses R. It’s a lot in the beginning but play with it
for a few hours and it should be easy peasy.
First, download R. https://www.r-project.org/
Second, download R Studio. https://www.rstudio.com/
Third, spend some time with RMarkdown.
# install.packages("rmarkdown")
# library(rmarkdown)

install.packages() will install a variety of useful packages for you.


Once installed, RMarkdown will take some time to figure out.
File > New File > R Markdown .. will create a neat little file to start out with.
When you try to Knit your file into an even neater little PDF, you will encounter an issue..
So make sure you follow the error codes and debug it (Hint: you need a TeX application to process it).

1
Let’s get to some coding though.
This will generally follow along with the textbook so you don’t need to have it yourself.
Once we get to more advanced things, the textbook won’t be able to keep up anyway so you still won’t need
it.
#options(width=70)
ruv <- runif(n=20,min=0,max=1)
round(ruv,4)

## [1] 0.7777 0.5198 0.0846 0.2466 0.5978 0.5376 0.0485 0.7557 0.1694 0.4653
## [11] 0.7779 0.3463 0.9613 0.3301 0.4665 0.2034 0.6380 0.8020 0.5239 0.1976
Here, we set a vector, ruv, equal to a random uniform variable via the runif function.
We sample 20 with a minimum possible value of 0 and maximum possible value of 1.

2
R as a simple calculator can be useful from time to time:
(8 * 3) + 12/40 - (7ˆ3) + sqrt(9)

## [1] -315.7
In R Markdown, I only entered my code. echo=TRUE parameter will keep the original code in the PDF.
For homework, you can code directly in R Markdown using code blocks. Keeping echo=TRUE will let me see
your code before your output prints as it normally would in R or R Studio.

3
Text formatting
italic or italic
bold bold
superscript2 and subscript2

Single backslash at the end of a line for new line.

Headings

1st Level Header


2nd Level Header
3rd Level Header

Lists
• Bulleted list item 1
• Item 2
– Item 2a
– Item 2b
1. Numbered list item 1
2. Item 2. The numbers are incremented automatically in the output.

Links and images


http://example.com
linked phrase

Tables

First Header Second Header


Content Cell Content Cell
Content Cell Content Cell

4
Vectors
x <- 3 # The <- is an assignment function that R uses as a foundation.
y <- c(2,3,4) #c for concatenate, which just combines numbers into a vector.
z <- c(10,20,30,40)

x+y

## [1] 5 6 7
x+z

## [1] 13 23 33 43
y+z

## Warning in y + z: longer object length is not a multiple of shorter object


## length
## [1] 12 23 34 42
x+c(5,5,5,5,5)

## [1] 8 8 8 8 8
z+c(1,2,3,4,5)

## Warning in z + c(1, 2, 3, 4, 5): longer object length is not a multiple of


## shorter object length
## [1] 11 22 33 44 15
LogicalVector <- (x < y)
LogicalVector

## [1] FALSE FALSE TRUE


typeof(LogicalVector)

## [1] "logical"

5
Booleans
x <- c(FALSE, TRUE, FALSE)
y <- c(FALSE, TRUE, TRUE)

x & y #are BOTH values true?

## [1] FALSE TRUE FALSE


x | y #are either of them true?

## [1] FALSE TRUE TRUE


x == y #do they equal each other? THIS IS USED A LOT IN PROGRAMMING

## [1] TRUE TRUE FALSE


x != y #do they not equal each other? ALSO USED A LOT

## [1] FALSE FALSE TRUE


print(z)
zDouble <- z
zInteger <- as.integer(zDouble)
typeof(zInteger)

## [1] "integer"
You can also change variables (as long as it makes sense) with the following commands:
as.logical()
as.numeric()
as.double()
as.complex() #any physicists here?
as.character()
as.list()

This is especially useful with data when you want to manipulate entire variables.
Perhaps the data came to you in an unclean manner..

6
Other Basic Functionality
print(1)

## [1] 1
print("You need to make sure YOU knOW wh4T YOU ARE PRintinG")

## [1] "You need to make sure YOU knOW wh4T YOU ARE PRintinG"
Printing will keep everything identical.
You can also print functions to see how many parameters it takes and other info:
print(exp)

## function (x) .Primitive("exp")


print(log)

## function (x, base = exp(1)) .Primitive("log")


Some math functions to know:
print() – prints objects
log() – computes logarithms
exp() – computes the exponential function
sqrt() – takes the square root
abs() – returns the absolute value
sin() – returns the sine
cos() – returns the cosine
tan() – returns the tangent
asin() – returns the arc-sine
factorial() – returns the factorial
sign() – returns the sign (negative or positive)
round() – rounds the input to the desired digit

If you’re interested in mathematical computations, better programming languages exist for intense stuff.

7
Random tid-bits
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal
!= not equal
& and
| or

General Help and Searching for help..


?log for help with log function, for ex.
??logit for in depth details on logit, for ex.

8
Matrices (and Arrays)
In the future, we will aim to surpass base R designations with things like Tibbles.
For now, the basics:
a 2-D array is a matrix in R.
M <- matrix(data=1:24,nrow=4, byrow=TRUE)

This is saying fill a matrix with the numbers 1 through 24, 4 rows, 6 columns, and order it by row not column.
M2 <- matrix(data=c(1,2,3,4,5,6),nrow=3, byrow=FALSE)

is.array(M2)

## [1] TRUE
is.matrix(M2)

## [1] TRUE
Now let’s practice making an actual matrix from scratch..
Given data in any format, you can manipulate it to fit the code you want to produce, i.e. by row or col
preference.
City1 has temperatures on three days of 80,70,75
City2 has temperatures on three days of 55,56,45
City3 has temperatures on three days of 20,22,31

temp.data <- matrix(c(80,70,75,55,56,45,20,22,31), nrow=3, ncol=3, byrow=TRUE,


dimnames = list(c("City1","City2","City3"), c("Day1","Day2","Day3")))
temp.data

## Day1 Day2 Day3


## City1 80 70 75
## City2 55 56 45
## City3 20 22 31
dim(temp.data)

## [1] 3 3
temp.data[2,] #2nd row, all columns

## Day1 Day2 Day3


## 55 56 45
temp.data[2,3] #2nd row, 3rd column

## [1] 45
temp.data[1, ,drop=FALSE] #drop=FALSE will keep this a matrix instead of defaulting into a vector.

## Day1 Day2 Day3


## City1 80 70 75

9
Data Frames
NumVec <- c(1,2,3,4)
CharVec <- c("a","b","c","d")
LogVec <- c("TRUE","TRUE","TRUE","FALSE")
df <- data.frame(NumVec,CharVec,LogVec)
df

## NumVec CharVec LogVec


## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
## 4 4 d FALSE
dfTibble <- as_tibble(df)
dfTibble

## # A tibble: 4 x 3
## NumVec CharVec LogVec
## <dbl> <chr> <chr>
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
## 4 4 d FALSE
In this course, we will usually be skipping the intro stuff in favor of Tidyverse options like Tibbles.
You can load data directly into R if the data set is supported, such as:
as_tibble(iris)

## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#or store it
iris_df <- as_tibble(iris)

10
Tidyverse %>% Pipes
You can read the rest of the chapter if you’d like, but most of your data analysis will be done in Tidyverse.
Let’s check out the basics that will get you on your way..
as_tibble(iris)

## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#or store it
iris_df <- as_tibble(iris)

11
Let’s see some example code:
iris_df %>%
group_by(Species) %>%
summarize(m = mean(Sepal.Length)) %>%
ungroup()

## # A tibble: 3 x 2
## Species m
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
First, %>% is called a Pipe. It is a Tidyverse shortcut that allows for easy processes to occur in intuitive
order.
In the above code, group_by() lets us take a specific variable and group it by each different level.
Then, we %>% to summarize, a function that gives convenient statistics, in this case, the Mean of a different
variable.
Last, we ungroup() to get back to our original Tibble (or specialized data frame).

12
Mutate
as_tibble(iris)

## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#or store it
iris_df <- as_tibble(iris)

iris_df_v2 <-iris_df %>% mutate(pl2 = Petal.Length ˆ 2,


four_sl = Sepal.Length * 4)

iris_df_v2

## # A tibble: 150 x 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species pl2 four_sl
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 1.96 20.4
## 2 4.9 3 1.4 0.2 setosa 1.96 19.6
## 3 4.7 3.2 1.3 0.2 setosa 1.69 18.8
## 4 4.6 3.1 1.5 0.2 setosa 2.25 18.4
## 5 5 3.6 1.4 0.2 setosa 1.96 20
## 6 5.4 3.9 1.7 0.4 setosa 2.89 21.6
## 7 4.6 3.4 1.4 0.3 setosa 1.96 18.4
## 8 5 3.4 1.5 0.2 setosa 2.25 20
## 9 4.4 2.9 1.4 0.2 setosa 1.96 17.6
## 10 4.9 3.1 1.5 0.1 setosa 2.25 19.6
## # ... with 140 more rows

13
Summarize
as_tibble(iris)

## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#or store it
iris_df <- as_tibble(iris)

Summarize will collapse all rows and return a summary statistic.


iris_df %>%
summarize(avg.sl = mean(Sepal.Length))

## # A tibble: 1 x 1
## avg.sl
## <dbl>
## 1 5.84
iris_df %>%
summarize(sd.sl = sd(Sepal.Length))

## # A tibble: 1 x 1
## sd.sl
## <dbl>
## 1 0.828

14
Filter
Choose specific ROWS
as_tibble(iris)

## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#or store it
iris_df <- as_tibble(iris)

iris_df_setosa_only <- iris_df %>% filter(Species == "setosa")

iris_df_setosa_only

## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 40 more rows

15
Select
Choose specific COLUMNS
iris_df <- as_tibble(iris)

iris_df %>% select(Sepal.Length, Sepal.Width) %>% glimpse()

## Rows: 150
## Columns: 2
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
iris_df %>% select(-Sepal.Length, -Sepal.Width) %>% glimpse()

## Rows: 150
## Columns: 3
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

16
Arrange
Use this to arrange your data within a variable
iris_df <- as_tibble(iris)

iris_df %>% arrange(Sepal.Length) %>% glimpse()

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 4.3, 4.4, 4.4, 4.4, 4.5, 4.6, 4.6, 4.6, 4.6, 4.7, 4.7, 4.~
## $ Sepal.Width <dbl> 3.0, 2.9, 3.0, 3.2, 2.3, 3.1, 3.4, 3.6, 3.2, 3.2, 3.2, 3.~
## $ Petal.Length <dbl> 1.1, 1.4, 1.3, 1.3, 1.3, 1.5, 1.4, 1.0, 1.4, 1.3, 1.6, 1.~
## $ Petal.Width <dbl> 0.1, 0.2, 0.2, 0.2, 0.3, 0.2, 0.3, 0.2, 0.2, 0.2, 0.2, 0.~
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

17

You might also like