Professional Documents
Culture Documents
BUSINESS
Introduction to R
❑ Introduction to R
❑ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists
MONASH
BUSINESS
Introduction
You can’t be a master of data science from reading a book. The next few weeks
will give you a foundation in the important tools.
In a typical data science project, the model of tools looks something like this:
MONASH
Source: https://r4ds.had.co.nz/ BUSINESS
Introduction
First, you need to have data. You can either use data stored in a file (typically
csv), database, or web application programming interface (API), and load it into a
data frame. If you don’t have data loaded in R, then you can’t work on it.
Next, after importing data, it’s good to tidy it. Tidying means storing in a
standard form that matches the dataset, with way it is stored. When your data is
tidy, every column is a variable and each row is an observation.
Tidy data is important as a consistent structure lets you focus on questions you
have for the data, not worrying about the data being in the correct format.
MONASH
BUSINESS
Introduction
Once your data is tidy, you need to transform it. The transform process includes:
• Narrowing in on observations of interest (like all people in one city, or all data
from the last year),
• Creating new variables that are functions of existing variables (like computing
speed from distance and time),
• Calculating a set of summary statistics (like counts or means).
There are two main items in knowledge generation: visualization and modelling.
Each of them complement each other.
MONASH
BUSINESS
Visualization
MONASH
Source: https://www.mydigitalfootprint.com/2014/05/my-digital-footprint-data-sorted.html BUSINESS
Visualization
Models are complementary tools to visualization. Once you have made your
questions sufficiently precise, you can use a model to answer them. Models can
be fundamentally mathematical or computational. Remember that each model
makes assumptions, and with that a model cannot question its own assumptions.
Results from a model shouldn’t surprise you.
MONASH
BUSINESS
Visualization
The last step involves communication, an absolutely critical part of any data
analysis project. It doesn’t matter how well your models and visualization have
led you to understand the data unless you can also communicate your results to
others.
All these tools have one thing in common: programming. Programming is a cross-
cutting tool that you use in every part of the project. You don’t need to be an
expert programmer to be a data scientist, but learning more about programming
pays off because becoming a better programmer allows you to automate common
tasks, and solve new problems with greater ease.
In this unit, we will be using R. You may not learn everything you need, but you
will surely learn all the fundamentals.
MONASH
BUSINESS
The tidyverse
After you have your data, you will need to install some R packages.
You can install the complete tidyverse with a single line of code:
MONASH
BUSINESS
The tidyverse
You will not be able to use the functions, objects, and help files in a package until
you load it with library(). Once you have installed a package, you can load it
with the library() function:
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and
dplyr packages. These are considered to be the core of the tidyverse because
you’ll use them in almost every analysis.
MONASH
BUSINESS
Other packages
For a start, we will use three data packages from outside the tidyverse:
These packages provide data on airline flights, world development, and baseball
that we’ll use to illustrate key data science ideas.
MONASH
BUSINESS
Introduction to R
Now what we have got some introduction to the basics, let’s see what is R.
MONASH
BUSINESS
How it works?
R makes use of the # sign to add comments, so that you and others can
understand what the R code is about. This is not your hashtag, instead when you
type # Comment, the Comments will not run as a R code. It is to help you to
remember what you typed, or for others to understand your code.
Output from R
Throughout the slides, text with grey background is what you key in the console,
while in the resulting output is in white background.
MONASH
BUSINESS
Arithmetic with R
In its most basic form, R can be used as a simple calculator. Consider the
following arithmetic operators:
Addition +
Subtraction -
Multiplication *
Division /
Exponentiation ^
Modulo %%
The ^ operator raises the number to its left to the power of the number to its
right: for example 3^2 is 9.
MONASH
BUSINESS
Variable assignment
A variable allows you to store a value (e.g. 42) or an object (e.g. a function
description) in R. You can then later use this variable's name to easily access the
value or the object that is stored within this variable.
MONASH
BUSINESS
Variable assignment
We need 5 apples and 6 oranges in our fruit basket. You first assign the values of
5 and 6 to apples and oranges. To get the total fruit, you just add both of them.
To create a new variable of my_fruit, you can add both apples and oranges
variable together.
MONASH
BUSINESS
Apples and oranges
Wait! You got an error? If you really tried to add "apples" and "oranges", and
assigned a text value to the variable my_oranges, you would be trying to assign the
addition of a numeric and a character variable to the variable my_fruit. This is not
possible.
MONASH
BUSINESS
Basic data types in R
R works with numerous data types. Some of the most basic types to get started
are:
• Decimal values like 4.5 are called numerics.
• Whole numbers like 4 are called integers. Integers are also numerics.
• Boolean values (TRUE or FALSE) are called logical.
• Text (or string) values are called characters.
Note how the quotation marks in the editor indicate that "some text" is a string.
MONASH
BUSINESS
What's that data type?
MONASH
BUSINESS
Outline
✓ Introduction to R
❑ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists
MONASH
BUSINESS
Create a vector
Vectors are 1-dimension arrays that can hold numeric data, character data, or
logical data. In other words, a vector is a simple tool to store data. For example,
you can store your daily gains and losses in the casinos.
In R, you create a vector with the combine function c(). You place the vector
elements separated by a comma between the parentheses. For example:
Once you have created these vectors in R, you can use them to do calculations.
MONASH
BUSINESS
Create a vector
In this example, we will be looking at John’s casino winnings and losses for the
past week. Assume that John only played poker and roulette, since there was a
delegation of mediums that occupied the craps tables.
To be able to use this data in R, you decide to create the variables poker_vector
and roulette_vector.
Source: https://indiesunlimited.com/2013/01/10/roulette-or-poker/
MONASH
BUSINESS
Create a vector
For poker_vector:
Monday won $140, Tuesday lost $50, Wednesday won $20, Thursday lost $120,
Friday won $240
For roulette_vector:
Monday lost $24, Tuesday lost $50, Wednesday won $100, Thursday lost $350,
Friday won $10
MONASH
BUSINESS
Naming a vector
As a data analyst, it is important to have a clear view on the data that you are
using. Understanding what each element refers to is therefore essential.
In the previous exercise, we created a vector with John’s winnings over the week.
Each vector element refers to a day of the week but it is hard to tell which
element belongs to which day. Isn’t it nice to show it in the vector itself?
MONASH
BUSINESS
Naming a vector
Elements of a vector can be given a name using the names() function, as shown:
MONASH
BUSINESS
Naming a vector
You’ve already types the days in a vector. Too much work to retype them for
poker and roulette? Don’t worry!
Just like you did with your poker and roulette returns, you can also create a
variable that contains the days of the week. This way you can use and re-use it.
MONASH
BUSINESS
Calculating total winnings
Now that you have the poker and roulette winnings nicely as named vectors, you
can start doing some data analytical magic.
• How much has been John’s overall profit or loss per day of the week?
• Did John lost money over the week in total?
• Is John winning/losing money on poker or on roulette?
It is important to know that if you sum two vectors in R, it takes the element-wise
sum. For example, the following three statements are completely equivalent:
MONASH
BUSINESS
Selection by comparison
As seen in the previous chapter, stating 6 > 5 returns TRUE. You can use these
comparison operators also on vectors. For example:
MONASH
BUSINESS
Selection by comparison
This command tests for every element of the vector if the condition stated by the
comparison operator is TRUE or FALSE.
MONASH
BUSINESS
Outline
✓ Introduction to R
✓ Vectors
❑ Matrices
❑ Factors
❑ Data frames
❑ Lists
MONASH
BUSINESS
What's a matrix?
MONASH
BUSINESS
What's a matrix?
MONASH
BUSINESS
Analyze matrices
In the next slides, we will analyze the box office numbers of the Star Wars
franchise.
Source: https://en.wikipedia.org/wiki/List_of_Star_Wars_characters
In the editor, three vectors are defined. Each one represents the box office
numbers from the first three Star Wars movies. The first element of each vector
indicates the US box office revenue, the second element refers to the Non-US box
office (source: Wikipedia).
MONASH
BUSINESS
Analyze matrices
MONASH
BUSINESS
Naming a matrix
To help you remember what is stored in star_wars_matrix, you would like to add
the names of the movies for the rows. Not only does this help you to read the
data, but it is also useful to select certain elements from the matrix.
Similar to vectors, you can add names for the rows and the columns of a matrix
rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector
MONASH
BUSINESS
MONASH
BUSINESS
Calculating the worldwide box office
The single most important thing for a movie in order to become an instant legend
in Tinseltown is its worldwide box office figures. To calculate the total box office
revenue for the three Star Wars movies, you have to take the sum of the US
revenue column and the non-US revenue column.
In R, the function rowSums() makes it easy to calculate the totals for each row of a
matrix. This function creates a new vector:
MONASH
BUSINESS
Adding a column for the Worldwide box
office
In the previous exercise you calculated the vector that contained the worldwide
box office receipt for each of the three Star Wars movies. However, this vector is
not yet part of star_wars_matrix.
You can add a column or multiple columns to a matrix with the cbind() function,
which merges matrices and/or vectors together by column. For example:
MONASH
BUSINESS
Adding a row
MONASH
BUSINESS
The total box office revenue for the entire
saga
Just like colSums(), there is also
rowSums().
MONASH
BUSINESS
A little arithmetic with matrices
Similar to what you have learned with vectors, the standard operators like +, -, /,
*, etc. work in an element-wise way on matrices in R.
Given a situation where you are the data analyst for the film producer, it is your
job to find out how many visitors went to each movie for each geographical area.
You already have the total revenue figures in all_wars_matrix.
Now let’s assume that the price of a ticket was $5. Simply dividing the box office
numbers by this ticket price gives you the number of visitors.
MONASH
BUSINESS
MONASH
BUSINESS
Outline
✓ Introduction to R
✓ Vectors
✓ Matrices
❑ Factors
❑ Data frames
❑ Lists
MONASH
BUSINESS
What's a factor and why would you use it?
The term factor refers to a statistical data type used to store categorical variables.
Difference between a categorical variable and a continuous variable is:
• Categorical variable can belong to a limited number of categories,
• Continuous variable can correspond to an infinite number of values.
MONASH
BUSINESS
What's a factor and why would you use it?
To create factors in R, you make use of the function factor(). First thing that you
have to do is create a vector that contains all the observations that belong to a
limited number of categories. For example, sex_vector contains the sex of 5
different individuals:
sex_vector <- c("Male","Female","Female","Male","Male")
It is clear that there are two categories, or in R-terms 'factor levels', at work here:
"Male" and "Female".
MONASH
BUSINESS
What's a factor and why would you use it?
MONASH
BUSINESS
Summarizing a factor
MONASH
BUSINESS
Ordered factors
To create an ordered factor, you have to add two additional arguments: ordered
and levels.
factor(some_vector,
ordered = TRUE,
levels = c("lev1", "lev2" ...))
MONASH
BUSINESS
Ordered factors
By setting the argument ordered to TRUE in the function factor(), you indicate
that the factor is ordered. With the argument levels you give the values of the
factor in the correct order.
MONASH
BUSINESS
Outline
✓ Introduction to R
✓ Vectors
✓ Matrices
✓ Factors
❑ Data frames
❑ Lists
MONASH
BUSINESS
What's a data frame?
In the previous examples, you’ve seen all the elements in matrix are the same
type. However, when doing a market research survey, you have more than the
same type of variables, such as:
• 'Are you married?' or 'yes/no' questions (logical)
• 'How old are you?' (numeric)
• 'What is your opinion on this product?' or other 'open-ended' questions
(character)
The output, namely the respondents' answers to the questions formulated above,
is a data set of different data types. You will often find yourself working with
data sets that contain different data types instead of only one.
MONASH
BUSINESS
Quick, have a look at your data set
Working with large data sets is not uncommon in data analysis. When you work
with (extremely) large data sets and data frames, you need to develop a clear
understanding of its structure and main elements.
So how to do this in R?
• head() enables you to show the first observations of a data frame,
• tail() prints out the last observations in your data set.
Both head() and tail() print a top line called the 'header', which contains the
names of the different variables in your data set.
MONASH
BUSINESS
Have a look at the structure
Another method that is often used to get a rapid overview of your data is the
function str(), which shows you the structure of your data set. For a data frame
it tells you:
• The total number of observations (e.g. 32 car types),
• The total number of variables (e.g. 11 car features),
• A full list of the variables names (e.g. mpg, cyl … ),
• The data type of each variable (e.g. num),
• The first observations.
Applying the str() function will often be the first thing that you do when
receiving a new data set or data frame. It allows you to know more on your data
before doing any analysis.
MONASH
BUSINESS
Creating a data frame
In the next slide, we will be constructing our own data frames, based on the
planets in the solar system (source: Wikipedia). Main features of a planets are:
• The type of planet (Terrestrial or Gas Giant).
• The planet's diameter relative to the diameter of the Earth.
• The planet's rotation across the sun relative to that of the Earth.
• If the planet has rings or not (TRUE or FALSE).
Source: https://en.wikipedia.org/wiki/Solar_System
MONASH
BUSINESS
Creating a data frame
You construct a data frame with the data.frame() function. As arguments, you
pass the vectors from before: they will become the different columns of your data
frame. As every column has the same length, the vectors you pass should also
have the same length. They can contain different types of data.
MONASH
BUSINESS
Selection of data frame elements
Similar to vectors and matrices, you select elements from a data frame with the
help of square brackets []. By using a comma, you can indicate what to select
from the rows and the columns respectively. As an example:
• my_df[1,2] selects the value at the first row and second column in my_df,
• my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df,
• my_df[1, ] selects all elements of the first row.
MONASH
BUSINESS
Only planets with rings
You will often want to select an entire column, namely one specific variable from
a data frame. If you want to select all elements of the variable diameter, for
example, both of these will do the trick:
MONASH
BUSINESS
Only planets with rings
However, there is a short-cut. If your columns have names, you can use the $ sign:
MONASH
BUSINESS
Sorting
Making and creating rankings is one of mankind's favorite affairs. These rankings
can be useful, such as best universities in the world or top movies in 2020.
In data analysis you can sort your data according to a certain variable in the data
set. In R, this is done with the help of the function order(), which is a function
that gives you the ranked position of each element when it is applied on a
variable, such as a vector for example:
MONASH
BUSINESS
Sorting
10, which is the second element in a, is the smallest element, so 2 comes first in
the output of order(a). 100, which is the first element in a is the second smallest
element, so 1 comes second in the output of order(a).
MONASH
BUSINESS
Outline
✓ Introduction to R
✓ Vectors
✓ Matrices
✓ Factors
✓ Data frames
❑ Lists
MONASH
BUSINESS
Lists, why would you need them?
Vectors (1-dimensional array): can hold numeric, character or logical values. The
elements in a vector all have the same data type.
MONASH
BUSINESS
Creating a list
The arguments to the list function are the list components. These components
can be matrices, vectors, other lists.
MONASH
BUSINESS
Creating a named list
Just like on your to-do list, you want to avoid not knowing or remembering what
the components of your list stand for.
MONASH
BUSINESS
Creating a list
This creates a list with components that are named name1, name2, and so on. If
you want to name your lists after you've created them, you can use the names()
function as you did with vectors.
MONASH
BUSINESS
MONASH
BUSINESS
THANK YOU