You are on page 1of 67

MONASH

BUSINESS

Introduction to R

ETW2001 Foundations of Data Analysis and Modelling


Manjeevan Singh Seera

Accredited by: Advanced Signatory:


Outline

❑ Introduction to R
❑ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists

MONASH
BUSINESS
Introduction

You can’t be a master of data science from reading a book. The next few weeks
will give you a foundation in the important tools.

In a typical data science project, the model of tools looks something like this:

MONASH
Source: https://r4ds.had.co.nz/ BUSINESS
Introduction

How do we get started?

First, you need to have data. You can either use data stored in a file (typically
csv), database, or web application programming interface (API), and load it into a
data frame. If you don’t have data loaded in R, then you can’t work on it.

Next, after importing data, it’s good to tidy it. Tidying means storing in a
standard form that matches the dataset, with way it is stored. When your data is
tidy, every column is a variable and each row is an observation.

Tidy data is important as a consistent structure lets you focus on questions you
have for the data, not worrying about the data being in the correct format.

MONASH
BUSINESS
Introduction

Once your data is tidy, you need to transform it. The transform process includes:
• Narrowing in on observations of interest (like all people in one city, or all data
from the last year),
• Creating new variables that are functions of existing variables (like computing
speed from distance and time),
• Calculating a set of summary statistics (like counts or means).

Tidying + transforming = wrangling


Why wrangling? Getting it in a form that’s natural to work feels like hard work.

There are two main items in knowledge generation: visualization and modelling.
Each of them complement each other.

MONASH
BUSINESS
Visualization

MONASH
Source: https://www.mydigitalfootprint.com/2014/05/my-digital-footprint-data-sorted.html BUSINESS
Visualization

Visualization is a fundamentally human activity. A good visual:


• Will show you things that you did not expect,
• Raise new questions about the data,
• Hint that you’re asking the wrong question,
• Need to collect different data.

Models are complementary tools to visualization. Once you have made your
questions sufficiently precise, you can use a model to answer them. Models can
be fundamentally mathematical or computational. Remember that each model
makes assumptions, and with that a model cannot question its own assumptions.
Results from a model shouldn’t surprise you.

MONASH
BUSINESS
Visualization

The last step involves communication, an absolutely critical part of any data
analysis project. It doesn’t matter how well your models and visualization have
led you to understand the data unless you can also communicate your results to
others.

All these tools have one thing in common: programming. Programming is a cross-
cutting tool that you use in every part of the project. You don’t need to be an
expert programmer to be a data scientist, but learning more about programming
pays off because becoming a better programmer allows you to automate common
tasks, and solve new problems with greater ease.

In this unit, we will be using R. You may not learn everything you need, but you
will surely learn all the fundamentals.

MONASH
BUSINESS
The tidyverse

After you have your data, you will need to install some R packages.

What is a package? It is a collection of functions, data, and documentation that


extends the capabilities of base R. Using packages is key to the successful use of
R. The majority of the packages that you will learn in the notes are part of the so-
called tidyverse. The packages in the tidyverse share a common philosophy of
data and R programming, and are designed to work together naturally.

You can install the complete tidyverse with a single line of code:

MONASH
BUSINESS
The tidyverse

You will not be able to use the functions, objects, and help files in a package until
you load it with library(). Once you have installed a package, you can load it
with the library() function:

This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and
dplyr packages. These are considered to be the core of the tidyverse because
you’ll use them in almost every analysis.

MONASH
BUSINESS
Other packages

Tidyverse is an example of a package. There are many other packages, which


have the same aim of solving problems in the different domains. Throughout this
unit, you will be learning different packages to go with the problems that are
posed.

For a start, we will use three data packages from outside the tidyverse:

These packages provide data on airline flights, world development, and baseball
that we’ll use to illustrate key data science ideas.

MONASH
BUSINESS
Introduction to R

Now what we have got some introduction to the basics, let’s see what is R.

R is a language and environment for statistical computing and graphics. It is


similar to the S language developed at Bell Laboratories (formerly AT&T, now
Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.

R provides a wide variety of statistical tools (linear and nonlinear modelling,


time-series analysis, classification, clustering) and graphical techniques, and is
highly extensible. R is available as Free Software under the terms of the Free
Software Foundation's GNU General Public License in source code form.

MONASH
BUSINESS
How it works?

R makes use of the # sign to add comments, so that you and others can
understand what the R code is about. This is not your hashtag, instead when you
type # Comment, the Comments will not run as a R code. It is to help you to
remember what you typed, or for others to understand your code.

For example, calculate 2 + 3 in the editor.


Comments (no effect on code)
Your input to R

Output from R

Throughout the slides, text with grey background is what you key in the console,
while in the resulting output is in white background.

MONASH
BUSINESS
Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the
following arithmetic operators:

Addition +
Subtraction -
Multiplication *
Division /
Exponentiation ^
Modulo %%

The ^ operator raises the number to its left to the power of the number to its
right: for example 3^2 is 9.

MONASH
BUSINESS
Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 42) or an object (e.g. a function
description) in R. You can then later use this variable's name to easily access the
value or the object that is stored within this variable.

MONASH
BUSINESS
Variable assignment

We need 5 apples and 6 oranges in our fruit basket. You first assign the values of
5 and 6 to apples and oranges. To get the total fruit, you just add both of them.
To create a new variable of my_fruit, you can add both apples and oranges
variable together.

MONASH
BUSINESS
Apples and oranges

Wait! You got an error? If you really tried to add "apples" and "oranges", and
assigned a text value to the variable my_oranges, you would be trying to assign the
addition of a numeric and a character variable to the variable my_fruit. This is not
possible.

MONASH
BUSINESS
Basic data types in R

R works with numerous data types. Some of the most basic types to get started
are:
• Decimal values like 4.5 are called numerics.
• Whole numbers like 4 are called integers. Integers are also numerics.
• Boolean values (TRUE or FALSE) are called logical.
• Text (or string) values are called characters.
Note how the quotation marks in the editor indicate that "some text" is a string.

MONASH
BUSINESS
What's that data type?

Do you remember that when you added


5 + "six", you got an error due to a
mismatch in data types?

You can avoid such embarrassing


situations by checking the data type of a
variable beforehand.

You can do this with the class()


function.

MONASH
BUSINESS
Outline

✓ Introduction to R
❑ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists

MONASH
BUSINESS
Create a vector

Vectors are 1-dimension arrays that can hold numeric data, character data, or
logical data. In other words, a vector is a simple tool to store data. For example,
you can store your daily gains and losses in the casinos.

In R, you create a vector with the combine function c(). You place the vector
elements separated by a comma between the parentheses. For example:

Once you have created these vectors in R, you can use them to do calculations.

MONASH
BUSINESS
Create a vector

In this example, we will be looking at John’s casino winnings and losses for the
past week. Assume that John only played poker and roulette, since there was a
delegation of mediums that occupied the craps tables.

To be able to use this data in R, you decide to create the variables poker_vector
and roulette_vector.

Source: https://indiesunlimited.com/2013/01/10/roulette-or-poker/
MONASH
BUSINESS
Create a vector

For poker_vector:
Monday won $140, Tuesday lost $50, Wednesday won $20, Thursday lost $120,
Friday won $240

For roulette_vector:
Monday lost $24, Tuesday lost $50, Wednesday won $100, Thursday lost $350,
Friday won $10

MONASH
BUSINESS
Naming a vector

As a data analyst, it is important to have a clear view on the data that you are
using. Understanding what each element refers to is therefore essential.

In the previous exercise, we created a vector with John’s winnings over the week.
Each vector element refers to a day of the week but it is hard to tell which
element belongs to which day. Isn’t it nice to show it in the vector itself?

MONASH
BUSINESS
Naming a vector

Elements of a vector can be given a name using the names() function, as shown:

MONASH
BUSINESS
Naming a vector

You’ve already types the days in a vector. Too much work to retype them for
poker and roulette? Don’t worry!

Just like you did with your poker and roulette returns, you can also create a
variable that contains the days of the week. This way you can use and re-use it.

MONASH
BUSINESS
Calculating total winnings

Now that you have the poker and roulette winnings nicely as named vectors, you
can start doing some data analytical magic.
• How much has been John’s overall profit or loss per day of the week?
• Did John lost money over the week in total?
• Is John winning/losing money on poker or on roulette?

It is important to know that if you sum two vectors in R, it takes the element-wise
sum. For example, the following three statements are completely equivalent:

MONASH
BUSINESS
Selection by comparison

By making use of comparison operators, we can approach the previous question


in a more proactive way.

The (logical) comparison operators known to R are:


< for less than
> for greater than
<= for less than or equal to
>= for greater than or equal to
== for equal to each other
!= not equal to each other

As seen in the previous chapter, stating 6 > 5 returns TRUE. You can use these
comparison operators also on vectors. For example:

MONASH
BUSINESS
Selection by comparison

This command tests for every element of the vector if the condition stated by the
comparison operator is TRUE or FALSE.

MONASH
BUSINESS
Outline

✓ Introduction to R
✓ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists

MONASH
BUSINESS
What's a matrix?

A matrix is a collection of elements of the same data type (numeric, character, or


logical) arranged into a fixed number of rows and columns. Since we are working
with rows and columns, a matrix is called 2-dimensional.

You can construct a matrix in R with the matrix() function, as follows.

MONASH
BUSINESS
What's a matrix?

In the matrix() function:


• The first argument is collection of elements that will arrange into the rows
and columns of the matrix. 1:9 is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
• The argument byrow indicates that the matrix is filled by the rows. If we want
the matrix to be filled by the columns, we just place byrow = FALSE.
• The third argument nrow indicates that the matrix should have three rows.

MONASH
BUSINESS
Analyze matrices

In the next slides, we will analyze the box office numbers of the Star Wars
franchise.

Source: https://en.wikipedia.org/wiki/List_of_Star_Wars_characters

In the editor, three vectors are defined. Each one represents the box office
numbers from the first three Star Wars movies. The first element of each vector
indicates the US box office revenue, the second element refers to the Non-US box
office (source: Wikipedia).

MONASH
BUSINESS
Analyze matrices

MONASH
BUSINESS
Naming a matrix

To help you remember what is stored in star_wars_matrix, you would like to add
the names of the movies for the rows. Not only does this help you to read the
data, but it is also useful to select certain elements from the matrix.

Similar to vectors, you can add names for the rows and the columns of a matrix
rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector

Take a look at the codes on the next slide.

MONASH
BUSINESS
MONASH
BUSINESS
Calculating the worldwide box office

The single most important thing for a movie in order to become an instant legend
in Tinseltown is its worldwide box office figures. To calculate the total box office
revenue for the three Star Wars movies, you have to take the sum of the US
revenue column and the non-US revenue column.

In R, the function rowSums() makes it easy to calculate the totals for each row of a
matrix. This function creates a new vector:

MONASH
BUSINESS
Adding a column for the Worldwide box
office
In the previous exercise you calculated the vector that contained the worldwide
box office receipt for each of the three Star Wars movies. However, this vector is
not yet part of star_wars_matrix.

You can add a column or multiple columns to a matrix with the cbind() function,
which merges matrices and/or vectors together by column. For example:

MONASH
BUSINESS
Adding a row

As with cbind(), there is also rbind().

MONASH
BUSINESS
The total box office revenue for the entire
saga
Just like colSums(), there is also
rowSums().

As constructed previously, type


all_wars_matrix to have
another look.

Let's now calculate the total


box office revenue for the
entire saga.

MONASH
BUSINESS
A little arithmetic with matrices

Similar to what you have learned with vectors, the standard operators like +, -, /,
*, etc. work in an element-wise way on matrices in R.

For example, 2 * my_matrix multiplies each element of my_matrix by two.

Given a situation where you are the data analyst for the film producer, it is your
job to find out how many visitors went to each movie for each geographical area.
You already have the total revenue figures in all_wars_matrix.

Now let’s assume that the price of a ticket was $5. Simply dividing the box office
numbers by this ticket price gives you the number of visitors.

MONASH
BUSINESS
MONASH
BUSINESS
Outline

✓ Introduction to R
✓ Vectors
✓ Matrices
❑ Factors
❑ Data frames
❑ Lists

MONASH
BUSINESS
What's a factor and why would you use it?

The term factor refers to a statistical data type used to store categorical variables.
Difference between a categorical variable and a continuous variable is:
• Categorical variable can belong to a limited number of categories,
• Continuous variable can correspond to an infinite number of values.

It is important that R knows whether it is dealing with a continuous or a


categorical variable as it treats it differently.

A good example of a categorical variable is sex. In most circumstances, you can


limit it to “Male” or “Female”, however you can have more categories (again
depending on the situation).

MONASH
BUSINESS
What's a factor and why would you use it?

To create factors in R, you make use of the function factor(). First thing that you
have to do is create a vector that contains all the observations that belong to a
limited number of categories. For example, sex_vector contains the sex of 5
different individuals:
sex_vector <- c("Male","Female","Female","Male","Male")

It is clear that there are two categories, or in R-terms 'factor levels', at work here:
"Male" and "Female".

MONASH
BUSINESS
What's a factor and why would you use it?

The function factor() will encode the vector as a factor:

MONASH
BUSINESS
Summarizing a factor

You would like to know how


many "Male" responses you
have in your study, and how
many "Female" responses.

The summary() function gives


you the answer to this
question.

MONASH
BUSINESS
Ordered factors

speed_vector should be converted to an ordinal factor since its categories have a


natural ordering. By default, the function factor() transforms speed_vector into
an unordered factor.

To create an ordered factor, you have to add two additional arguments: ordered
and levels.
factor(some_vector,
ordered = TRUE,
levels = c("lev1", "lev2" ...))

Let’s see the example on the next slide.

MONASH
BUSINESS
Ordered factors

By setting the argument ordered to TRUE in the function factor(), you indicate
that the factor is ordered. With the argument levels you give the values of the
factor in the correct order.

MONASH
BUSINESS
Outline

✓ Introduction to R
✓ Vectors
✓ Matrices
✓ Factors
❑ Data frames
❑ Lists

MONASH
BUSINESS
What's a data frame?

In the previous examples, you’ve seen all the elements in matrix are the same
type. However, when doing a market research survey, you have more than the
same type of variables, such as:
• 'Are you married?' or 'yes/no' questions (logical)
• 'How old are you?' (numeric)
• 'What is your opinion on this product?' or other 'open-ended' questions
(character)

The output, namely the respondents' answers to the questions formulated above,
is a data set of different data types. You will often find yourself working with
data sets that contain different data types instead of only one.

MONASH
BUSINESS
Quick, have a look at your data set

Working with large data sets is not uncommon in data analysis. When you work
with (extremely) large data sets and data frames, you need to develop a clear
understanding of its structure and main elements.

So how to do this in R?
• head() enables you to show the first observations of a data frame,
• tail() prints out the last observations in your data set.

Both head() and tail() print a top line called the 'header', which contains the
names of the different variables in your data set.

MONASH
BUSINESS
Have a look at the structure

Another method that is often used to get a rapid overview of your data is the
function str(), which shows you the structure of your data set. For a data frame
it tells you:
• The total number of observations (e.g. 32 car types),
• The total number of variables (e.g. 11 car features),
• A full list of the variables names (e.g. mpg, cyl … ),
• The data type of each variable (e.g. num),
• The first observations.

Applying the str() function will often be the first thing that you do when
receiving a new data set or data frame. It allows you to know more on your data
before doing any analysis.

MONASH
BUSINESS
Creating a data frame

In the next slide, we will be constructing our own data frames, based on the
planets in the solar system (source: Wikipedia). Main features of a planets are:
• The type of planet (Terrestrial or Gas Giant).
• The planet's diameter relative to the diameter of the Earth.
• The planet's rotation across the sun relative to that of the Earth.
• If the planet has rings or not (TRUE or FALSE).

Source: https://en.wikipedia.org/wiki/Solar_System
MONASH
BUSINESS
Creating a data frame

You construct a data frame with the data.frame() function. As arguments, you
pass the vectors from before: they will become the different columns of your data
frame. As every column has the same length, the vectors you pass should also
have the same length. They can contain different types of data.

MONASH
BUSINESS
Selection of data frame elements

Similar to vectors and matrices, you select elements from a data frame with the
help of square brackets []. By using a comma, you can indicate what to select
from the rows and the columns respectively. As an example:
• my_df[1,2] selects the value at the first row and second column in my_df,
• my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df,
• my_df[1, ] selects all elements of the first row.

MONASH
BUSINESS
Only planets with rings

You will often want to select an entire column, namely one specific variable from
a data frame. If you want to select all elements of the variable diameter, for
example, both of these will do the trick:

MONASH
BUSINESS
Only planets with rings

However, there is a short-cut. If your columns have names, you can use the $ sign:

MONASH
BUSINESS
Sorting

Making and creating rankings is one of mankind's favorite affairs. These rankings
can be useful, such as best universities in the world or top movies in 2020.

In data analysis you can sort your data according to a certain variable in the data
set. In R, this is done with the help of the function order(), which is a function
that gives you the ranked position of each element when it is applied on a
variable, such as a vector for example:

MONASH
BUSINESS
Sorting

10, which is the second element in a, is the smallest element, so 2 comes first in
the output of order(a). 100, which is the first element in a is the second smallest
element, so 1 comes second in the output of order(a).

This means we can use the output of order(a) to reshuffle a:

MONASH
BUSINESS
Outline

✓ Introduction to R
✓ Vectors
✓ Matrices
✓ Factors
✓ Data frames
❑ Lists

MONASH
BUSINESS
Lists, why would you need them?

Vectors (1-dimensional array): can hold numeric, character or logical values. The
elements in a vector all have the same data type.

Matrices (2-dimensional array): can hold numeric, character or logical values.


The elements in a matrix all have the same data type.

Data frames (2-dimensional objects): can hold numeric, character or logical


values. Within a column all elements have the same data type, but different
columns can be of different data type.

MONASH
BUSINESS
Creating a list

To construct a list you use the function list():


my_list <- list(comp1, comp2 ...)

The arguments to the list function are the list components. These components
can be matrices, vectors, other lists.

MONASH
BUSINESS
Creating a named list

Just like on your to-do list, you want to avoid not knowing or remembering what
the components of your list stand for.

That is why you should give names to them:


my_list <- list(name1 = your_comp1,
name2 = your_comp2)

MONASH
BUSINESS
Creating a list

This creates a list with components that are named name1, name2, and so on. If
you want to name your lists after you've created them, you can use the names()
function as you did with vectors.

MONASH
BUSINESS
MONASH
BUSINESS
THANK YOU

FIND OUT MORE AT MONASH.EDU.MY


LIKE @MONASH UNIVERSITY MALAYSIA ON FACEBOOK
FOLLOW @MONASHMALAYSIA ON TWITTER

You might also like