You are on page 1of 38

MDS5103/ MSBA5104 Segment 03

CONTROL FLOW AND USER


-DEFINED FUNCTIONS IN R -
TUTORIAL
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

Table of Contents
1. Logical Operators and Logical Expressions 4

1.1 Relational Operator 4


1.2 Logical Operators 5
1.2.1 Logical AND 5
1.2.2 Logical OR 5
1.2.3 Logical NOT 6
1.2.4 Long-Form Logical AND 6
1.2.5 Long-Form Logical OR 7
1.2.6 Value Matching Using %in% 7
1.2.7 Combining Operators 8
1.3 Special Functions 8
1.3.1 The ‘all ()’ Function 8
1.3.2 The ‘any()’ Function 9
1.3.3 The ‘isTRUE()’ and ‘isFALSE()’ Functions 9
1.3.4 The ‘all.equal()’ Function 10
2. If Statement 11

2.1 Simple If 11
2.2 The If-Else Statement 12
2.2 Multiple If Statement 13
2.3 Vectorised If Statement 14
3. For Statement 15

4. While-Statement 17

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

Introduction
Until now, codes were executed sequentially; that is, the codes are executed in the way they
are written, and there are no conditional or recurrent executions. When there is a restriction
or control over the sequence of execution of a code, it is called ‘Control Flow’.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

1.Logical Operators and Logical Expressions


At the heart of control flow, are logical expressions. A logical expression is something that
outputs TRUE or FALSE. For example, a statement can be 5 > 6. Then the output of this
statement is FALSE. Logical expressions are created using the operators. Operators can be
• Relational Operators: In the example, 5 > 6, the greater-than symbol is a relational
operator.
• Logical Operators: In the example, “Does Kilimanjaro have an altitude higher than
Mount Everest ‘AND’ is the Indian Ocean deeper than the Pacific Ocean?” The output
in this case is also TRUE or FALSE. The ‘AND’ that connects these two statements is
known as a logical operator.

1.1 Relational Operator


The simplest relational operator is the greater-than operator.

x=4
x>5

Output
TRUE

In the code above, x > 5 is the logical expression and the greater-than symbol (>) is the
relational operator. The output displays the logical truth value in R – TRUE. Note that the
letters are capital letters. Similarly, the less-than-relational operator can also be used. This
will give the output FALSE which will be in capital letters as well.
To check the equality between two entities, the == operator is used. Note that this has two
‘equal to’ symbols.

x=4
x == 5

Output
FALSE

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

In the above code, ‘=’ is the assignment operator and ‘==’ is the relational operator.

1.2 Logical Operators


Logical Operators consist of keywords like ‘AND’, ‘OR’ or symbols like ‘!’ etc.

1.2.1 Logical AND


Consider the example of two lists or two vectors, l1 and l2 as shown below.

# Logical expressions- 2: logical operators


l1 = c(TRUE, FALSE, FALSE, TRUE)
l2 = c(TRUE, TRUE, FALSE, FALSE)

# Logical AND
l1 & l2

Output
[1] TRUE FALSE FALSE FALSE

In the code above, l1 is a vector of logical TRUE and FALSE values. The elements of l1 are
TRUE and FALSE, all capital letters. Similarly, the elements of l2 are all capital letters. The
logical AND is denoted by the ampersand (&) symbol and it can be used in between two
vectors, for example, l1 & l2. This will check every pair of elements in l1 and l2 according to
the logical condition AND. The first elements of both vectors are compared. Since both are
TRUE, the first element of the output is TRUE. Next, the second elements are compared.
Since one of them is FALSE, the second element in the output is FALSE and so on. This is
like a vectorised logical expression. The logical operator ‘AND’ operates on each pair of
elements from the vectors and it will output TRUE or FALSE based on the condition.

1.2.2 Logical OR
The logical OR operator uses the ‘piping symbol’ ( | ). This operator evaluates to TRUE if at
least one of the input arguments is TRUE. The example below, first checks if at least one of
the first elements of l1 and l2 is TRUE. In this case, both are TRUE. It sets TRUE for the first
element of the output. It then proceeds to the next elements. For the third element, both are

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

FALSE. Thus, it outputs FALSE for the third element. This is again a vectorised operation,
meaning it applies to each element of the vectors, l1 and l2.

l1 | l2

Output
[1] TRUE TRUE FALSE TRUE

1.2.3 Logical NOT


Logical NOT is represented by an exclamation mark.

!(l1)

Output
[1] FALSE TRUE TRUE FALSE

Logical Not flips the truth value of the elements. In the code above, ‘!(l1)’ turns the TRUE to
FALSE and the FALSE to TRUE for the elements in l1. The logical NOT operator is used often.
For example, to check if a name is in a list and so on.

1.2.4 Long-Form Logical AND


Logical AND is referred to as Long-Form Logical AND, if two ampersand symbols (&&) are
used instead of one.
l1 && l2

Output
Warning in l1 && l2 : 'length(x) = 4 > 1' in coercion to 'logical(1)'
Warning in l1 && l2 : 'length(x) = 4 > 1' in coercion to 'logical(1)'
[1] TRUE

Note that there is output along with a warning. The warning indicates that the length of the
input supplied is greater than one. It is not an error, but a warning. Long-form is expected to
be used with arrays of one element only. The Long-form logical AND compares only the first
element and ignores the rest. It ignores the remaining elements of the array. If one of the

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

first elements in l1 or l2 were FALSE, the output would be FALSE. In this case, changes to
any other elements except the first will have no impact on the output.

1.2.5 Long-Form Logical OR


Long-form logical OR is represented using ‘double pipe’ ( || ).

# Logical expressions-2: logical operators


l1 = c(FALSE, TRUE, FALSE, TRUE)
l2 = c(FALSE, TRUE, FALSE, FALSE)

l1 || l2

Output
Warning in l1 || l2 : 'length(x) = 4 > 1' in coercion to 'logical(1)'
Warning in l1 || l2 : 'length(x) = 4 > 1' in coercion to 'logical(1)'
[1] FALSE

Like the Long-form AND operation, only the first elements are considered. Hence, there is a
warning along with the output.
The Short-form is typically used when dealing with vectors and Long-form is used when we
have a single element.

1.2.6 Value Matching Using %in%


Value matching is used to check for a string in a vector of strings. This is done using a
special notation - %in%.
# Logical expressions-3: value matching
names = c('Ajith', 'Priya', 'Gabriel')
‘Ajith’ %in% names

Output
[1] TRUE

In the above code, the vector ‘names’ have 'Ajith', 'Priya', and 'Gabriel' as their elements. The
notation %in% checks if the string provided on the left matches with any elements in the
vector provided on the right side of the notation. Since the string ‘Ajith’ matches with one of
the elements, the output is TRUE. If the spelling is changed or the case is changed, the

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

output would be FALSE. The entire sentence, “‘Ajith’ %in% names” is considered a logical
expression. This is not only applicable to vectors, but also to more complicated objects.

1.2.7 Combining Operators


The logical operators can be combined or used in combination with one another. For
example, the ‘Logical Not’ with ‘value matching’ is shown below.

!('Ajith' %in% names)

Output
[1] FALSE

This will check if ‘Ajith’ is not present in the vector called ‘names’.

1.3 Special Functions


There are special functions available in R that can be used in logical operators with vectors.

1.3.1 The ‘all ()’ Function


Very often, there is a need to check if all the elements of a vector have the Boolean value
‘TRUE’ or not. These elements could have been generated as a result of another operation.
In the example, “Is the age of every individual greater than 30?” We get the Boolean value
corresponding to each individual as TRUE, FALSE and so on. The function ‘all()’ can be
applied to this to check if all the values in the vector have the value TRUE in it. The code
below demonstrates the functionality of the ‘all()’ function on the vector l1.

all(l1)

Since not all the elements are TRUE in l1, we get the output as FALSE.

Output
[1] FALSE

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

1.3.2 The ‘any()’ Function


The ‘any()’ function checks if at least one of the elements of the vector has a Boolean value
of TRUE. In the example, “is any of the elements of l1 true?” Or, “Among the four, does
anybody have an age greater than 30?”

any(l1)

Output
[1] TRUE

Since l1 contains at least one TRUE value, the output of any() is TRUE.

1.3.3 The ‘isTRUE()’ and ‘isFALSE()’ Functions


In real-life situations, the vectors will contain missing values represented as ‘NA’. In such
cases any() or all() functions cannot be used. In the code shown below, one of the elements
is ‘NA’ for the vector l1. Applying the ‘all()’ function on this will result in the output NA.

l1 = c(TRUE, TRUE, NA, TRUE)


all(l1)

Output
NA

When it comes to the missing value it cannot make any judgment about this. So 'R' simply
prints 'NA'. In such situations, the methods ‘isTRUE()’ and ‘isFALSE()’ are used. ‘isTRUE()’
will return TRUE if all the elements are TRUE and there is no ‘NA’ or missing elements. If
there are missing elements, it returns FALSE and not ‘NA’. isFALSE() will return TRUE if all
elements are FALSE and there is no ‘NA’ or missing elements.

isTRUE(l1)

Output
[1] FALSE

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

1.3.4 The ‘all.equal()’ Function

Consider an example in which x is an array containing values from 1 to 4. Raise x to the


power one-third, and then to the power 3rd. This could also be a square root followed by a
square. Though the print function does not display any difference, internally when you take
the square root and then square, there are round-off errors. The expression x^(1/2), which is
the square root of x to the power two is not exactly equal to x.

x= c(1:4)
y= (x^1/2)^2
x==y

Output
[1] FALSE FALSE FALSE TRUE

In the code above, the check, x == y results in FALSE. Square root is a floating-point
operation. Because of round-off errors and the limited precision of the computer, the
resulting expression when squared does not give back the same value. The function
‘all.equal()’ can be used in such cases.

all.equal(x,y,0.5)

Output
[1] TRUE

In the above code, x is technically not equal to y because of the finite precision arithmetic
but they are almost close to each other. The function ‘all.equal()’ checks if two quantities
are almost close to each other and within the tolerance specified.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

2. If Statement

2.1 Simple If
Conditional execution of statements is done using the ‘if’ statement. The basic syntax is
given below:

Syntax
If (logical expression) {
# Statements to execute if the logical expression evaluates to TRUE.
}

We start with the if keyword and then the logical expression enclosed within parentheses.
The body of the ‘if statement’ contains the set of statements to execute if that logical
condition is true. The body of the if statement starts and ends with curly braces.

# if statement
x=4
if (x %% 2 == 0 ) { print (‘even’) }

Output
[1] "even"

The code above simply prints the string “even” as 4 modulo 2 is 0 and the logical expression
is TRUE. If the code is changed as shown below no output is generated.

# if statement
x=4
if (x %% 2 != 0 ) { print (‘odd’) }

Output

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

This is because the logical expression evaluates to false. The statements within the if
condition is not executed. The statements can be written in the same line or they can be
written in the next line. However, these will have to be enclosed within curly braces. It is
always recommended to write the code in the next line with an indentation so that the
conditional statements are distinguishable.

# if statement
x=4
if (x %% 2 != 0 ) {
print (‘odd’)
}

2.2 The If-Else Statement


If-Else is a natural extension of the If statement. The alternate set of statements to be
executed if the expression evaluates to false, can be provided along with a set of statements
that needs to be executed when the expression evaluates to true.

Syntax
If (logical expression ) {
# Statements to execute if the logical expression evaluates to TRUE.
} else {
# Statements to execute if the logical expression evaluates to FALSE.
}
A sample code to demonstrate if-else statement is shown below:

# if statement
x=4
if (x %% 2 != 0 ) {
print (‘odd’)
} else {
print (‘even’)
}

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

Output
[1] "even"

In the above code, the logical condition evaluates to false. In this case, the statements within
the else part are executed. Hence, the output “even” is displayed.

2.2 Multiple If Statements


There can be multiple if statements or nested if statements. For example, in the code below,
the first if statement checks if the number x is greater than zero. If the condition evaluates
to false, the next ‘if condition’ is checked. In this case, the condition x less than zero is
checked for. If this also evaluates to false, then the statements within the final else are
executed.

# if statement
x=0
if ( x > 0 ) {
print (‘positive’)
} if ( x < 0 ) {
print (‘negative’)
} else {
print( ‘x is zero’)
}

Output
[1] "x is zero"

In the above output, the statement in the final ‘else’ is executed. Hence, ‘x is zero’ is
displayed.

In the codes above, simple use cases have been used to demonstrate the concepts.
However, for more complicated examples, the structure will remain the same.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 13/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

2.3 Vectorised If Statement


When we use the phrase "vectorised," it implies manipulating several components of a
vector-like object at once to speed up computation. An example to create an array is given
below.

x = c(1:10)
print(x)

Output
[1] 1 2 3 4 5 6 7 8 9 10

When x is printed, a vector whose elements are from 1 to 10 is obtained. To check if an


element is odd or even, we can check each element of the vector ‘x’, and check if it is even
or odd. This method is not useful to execute sequentially, because an element being even
or not has no relation to another element being even or not. In other words, the check can
happen parallelly in a vectorised fashion. The ‘ifelse()’ function can be used in such cases.

# Vectorized if-statement
x = c(1:10)
print(x)
ifelse(x %% 2 == 0, 'even', 'odd')

Output
[1] 1 2 3 4 5 6 7 8 9 10
[1] "odd" "even" "odd" "even" "odd" "even"
[7] "odd" "even" "odd" "even"

The ifelse() function has three arguments:


• The first argument is the logical condition to be evaluated. For example, x %% 2 == 0.
• The second argument contains the expression if the condition evaluates to True.
• The third argument contains the expression if the condition evaluates to False.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 14/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

In this case, the execution on each element of ‘x’ occurs simultaneously and is called
vectorised operation. This ‘ifelse()’ function is used to check a condition on multiple
elements on a vector or vector-like object.

3.For Statement
To do an operation a certain number of times, for-statement can be used. For example, to
print a set of quantity ‘x’ ten times.
Syntax
for (value in sequence){

#Set of statements to be executed

The for-statement starts with the keyword ‘for’, followed by an expression. This determines
the number of times a particular operation must be performed. For example, ‘for (val in x)’,
means that the execution is going to take place for all the values of the vector ‘x’. The code
below demonstrates the use of ‘for’ to calculate the square of x for all elements from 1 to
10. Please note the keyword ‘in’ is not enclosed within the symbol ‘%’.

# For-statement
x = c(1:10)
for (val in x){
y= x^2
print(y)
}

Output
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100
[1] 1 4 9 16 25 36 49 64 81 100

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 15/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

In the above output, squares of ‘x’ are printed multiple times. This is because x^2 operates
on the entire vector. The loop variable ‘val’ takes values from 1 to 10, and for each value, the
square of the entire vector x is calculated and displayed.
The code to display squares of individual elements is given below. In the first version, the
loop variable, ‘i', takes the values from 1, 2, 3 and so on, until the length of the vector x. We
access the element as x[i]. In the second version, the loop variable ‘val’ directly takes on the
values of the elements of the vector x whose squares are then printed.

# For-statement
x = c(1:10)
# version-1
for (i in 1:length(x)){
y[i]= x[i]^2
print(y[i])
}

# version-2
for (val in x){
print(val^2)
}

Output
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
[1] 100

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 16/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

As shown in version-2 above, a loop variable can traverse over a vector of names as well.
An example is given below. The vector contains three strings. We can display the vector
using a print statement.

names = c('Ajith', 'Priya', 'Gabriel')


print(names)

Output
"Ajith" "Priya" "Gabriel"

However, to access the names individually, we use the ‘for’ statement.

# For-statement
participants = c('Ajith','Priya','Gabriel')
for (names in participants) {
print(names)
}

Output
[1] "Ajith"
[1] "Priya"
[1] "Gabriel"

In this case, the looping variable ‘names’, take the values ‘Ajith’, ‘Priya’, and ‘Gabriel’ in each
iteration of the loop, respectively.

4.While-Statement
The ‘for loop’ is used to execute a statement or a particular sequence of statements a
specific number of times. ‘While-statements’ are used when a sequence of statements must
be executed as long as some condition is satisfied.
For example: as long as the age is greater than 30, keep repeating this operation.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 17/20
CONTROL FLOW AND USER-DEFINED FUNCTIONS IN R

Syntax
while ( logical expression is true )
{
# Set of statements to be executed
}
The code below demonstrates the use of while. In this case, the function, ‘runif()’ is used.
The ‘runif()’ function generates a uniform random variable. We generate a new random
number between 0 and 1 in a loop and we repeat the loop as long as the newly generated
value is greater than 0.3. Note that the number x must be initialised to a number greater than
0.3 so that the loop is executed for the first time.

# While-statement
x=1
while (x > 0.3) {
x = runif(1)
print(x)
}

Output
[1] 0.8326397
[1] 0.8946775
[1] 0.6325222
[1] 0.8007339
[1] 0.3405287
[1] 0.4995432
[1] 0.1090243

The output shown above may vary across executions as ‘runif()’ generates a random number
each time. Each time, x is assigned a new value and displayed. The condition is checked and
if TRUE, the loop is executed again. In the above output, once x takes the value 0.1090243
since it is less than 0.3, the control is transferred out of the loop. A while-statement is just
like a for statement except that it is based on a logical expression.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 18/20
MDS5103/ MSBA5104 Segment 03

SIMULATION USING R
IN-BUILT FUNCTIONS -
TUTORIAL
SIMULATION USING R IN-BUILT FUNCTIONS

Table of Contents
1. Functions for Simulating Random Experiments 4

1.1 Random Experiment 4


1.2 Simulating a Simple Random Experiment 4
1.3 Calculating the Probability of an Event 7
2. Functions for Simulating Random Variables in R 10

2.1 Binomial Random Variables 10


2.1.1 Simulating Binomial Random Variables 11
2.1.2 Visualising a Binomial Random Variable 12
2.2 Simulating Poisson Variable 13
2.3 Continuous Random Variable 14
2.3.1 Simulating a Continuous Random Variable 14
2.3.2 Visualisation 17

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/20
SIMULATION USING R IN-BUILT FUNCTIONS

Introduction
In real-world situations, the occurrence of an event cannot entirely be predicted. The outputs
can be random. R provides methods and functions to simulate such random experiments
and to generate random numbers. This topic covers the various options available in R to
perform such operations and visualise them.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/20
SIMULATION USING R IN-BUILT FUNCTIONS

1. Functions for Simulating Random Experiments

1.1 Random Experiment


A random experiment is something in which the outcome of the experiment is random. For
example, if we roll a fair die numbered from 1 to 6 on six sides, there is no certainty on what
the outcome would be. It could be any number from 1 to 6. We can simulate random
experiments such as these using R. In this case, simulation, is the programmatic recreation
of the rolling of dice once, twice, or 1000 times. We can simulate more complicated random
experiments, but the basic building block will remain the same.

1.2 Simulating a Simple Random Experiment


The code below demonstrates the simulation of the rolling of a fair dice.

# A fun exercise to simulate the rolling of a fair die


# Sampling space for rolling a pair of fair dice
s = c(1:6)

# Corresponding probabilities
p = (1/6) * replicate(6, 1)

# The sampling process


sample ( s, 1, replace = TRUE, prob = p)

The sample space, that is, all the possible outcomes in one run of the experiment is
initialised to the variable ‘s’. In this case, the possible outcomes in one run of rolling or one
time rolling of the fair die is any number between 1 to 6. Hence, a vector with values from 1
to 6 is created and assigned to ‘s’. The likelihood or the probability of 1 appearing after rolling
of a fair die is 1/6. This is the case for each of the numbers on the die. Ideally, to specify the
probabilities for each number, we provide 1/6 six times. However, we can use the replicate
function as shown in the code above. The replicate function has multiple purposes. The
simplest one is to replicate a number or a string multiple times in a vector. In this case, it
creates a vector of six 1s. It is then multiplied by 1/6 to obtain the probabilities of each
number on the die. To simulate the experiment, we use the sample() function.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/20
SIMULATION USING R IN-BUILT FUNCTIONS

Syntax
sample(x, size, replace = FALSE, prob = NULL)

In the above syntax, ‘x’ refers to the vector on which sampling must be performed; ‘size’
refers to the number of selections in each iteration, and ‘replace’ refers to a Boolean that
indicates if the item selected previously is put back for the next selection. This value is
relevant only if the size is greater than one. ‘prob’ provides the probability for picking each
of the items in the sample space. The probability is provided as a vector.
In the code, the sample space and probability are provided to the sample() function. To begin
with, only 1 output is required. Hence, the second parameter is set to 1.
The output of the code is shown below:

Output
[1] 4

The sample may return any number between 1 and 6. In the output shown above, the result
is 4 since the sample size is specified as 1. The result may be different for each execution.
To simulate the rolling of a pair of dice, we can specify the size (second parameter of the
sample() function) as 2 as shown below. Note that we can specify the ‘replace’ as TRUE
since the two dice are independent of each other.

# A fun exercise to simulate the rolling of a fair die


# Sampling space for rolling a pair of fair dice
s = c(1:6)

# Corresponding probabilities
p = (1/6) * replicate(6, 1)

# The sampling process


sample ( s, 2, replace = TRUE, prob = p)

Output
[1] 4 2

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/20
SIMULATION USING R IN-BUILT FUNCTIONS

In this case, the output is a pair of outcomes each between 1 and 6.

To repeat this experiment—rolling a die or rolling a pair of dice several times—we can use
the replicate method.

Syntax
replicate(n, expression)

Here ‘n’ is the number of times and ‘expression’ is the function to be executed.
The code below shows the sampling done with a pair of dice executed 10 times.

# A fun exercise to simulate the rolling of a fair die


# Sampling space for rolling a pair of fair dice
s = c(1:6)

# Corresponding probabilities
p = (1/6) * replicate(6, 1)

# The sampling process


nsimulations = 10
replicate (nsimulations, sample ( s, 2, replace = TRUE, prob = p))

Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 6 4 5 4 2 3 5 1 1
[2,] 3 2 6 1 6 5 3 2 1 2

The output is a matrix or a 2D array. The outcomes corresponding to each run of the random
experiment appear column-wise. The first simulation returns 4 and 3, the second 6 and 2
and, so on. Since the ‘nsimulations’ is set to 10, there are 10 pairs of entries.

In the code below, the number of simulations is increased to 100 and the result is stored in
a variable named ‘simulated_data’. The structure of the same is displayed below.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/20
SIMULATION USING R IN-BUILT FUNCTIONS

# A fun exercise to simulate the rolling of a fair die


# Sampling space for rolling a pair of fair dice
s = c(1:6)

# Corresponding probabilities
p = (1/6) * replicate(6, 1)

# The sampling process


nsimulations = 100
simulated_data = replicate (nsimulations, sample ( s, 2, replace = TRUE, prob = p))
str(simulated_data)

Output
int [1:2, 1:100] 6 5 6 3 4 2 6 2 6 5 ...

The resulting structure, as shown above, is a matrix with 100 columns and 2 rows.

1.3 Calculating the Probability of an Event


Simulation can be used to calculate the probability of some event. For example, to answer
questions like, “How likely or how probable is it for the sum of the rolls to be at least seven
when we roll a pair of dice?”

As shown in the code below, we set the number of simulations to 10 or 1e1 and replicate
the simulations. The result, a 2D array, is stored in the variable named simulatedData. Now,
the frequency for each sum must be calculated. Frequency is the number of times some
event occurs. To do this, we create a user-defined function named checkEvent. The syntax
for a user-defined function is given below

Syntax
function_name = function(arguments) {
# Statements
}

The checkEvent function receives a 1D array corresponding to the outcomes of a single


simulation. We can check if the sum of the numbers is greater than or equal to 7; if so, return
1, and return 0 otherwise. To invoke this function for all the combinations of the
simulatedData, the apply() function is used.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/20
SIMULATION USING R IN-BUILT FUNCTIONS

Syntax
apply( x, row-wise(1) or column-wise(2), function )

apply() invokes the function specified as argument for all the combinations of rows or
columns on data x. The row-wise application or the column-wise application is determined
by the second parameter of the apply() function. In this case, the function must be applied
column-wise since the column contains the output of a single simulation. Hence, the apply()
function is applied on simulatedData with the second parameter as 2 and the function as
‘checkEvent’.

nsimulations = 1e1
simulatedData = replicate (nsimulations, sample ( s, 2, replace = TRUE, prob = p))

# Function to check if the sum of the rolls is at least 7


checkEvent= function(data) {
if( sum(data) >= 7) {
return(1)
} else {
return(0)
}
}
print(simulatedData)
# Probability that the sum of the rolls is at least 7
apply(simulatedData, 2 , checkEvent)

Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 4 4 2 1 5 3 6 1 4
[2,] 6 3 6 1 2 1 3 1 1 6
[1] 1 1 1 0 0 0 0 1 0 1

The output displays the generated numbers and the result. The result is a series of 1 and 0,
which is the result of applying the ‘checkEvent’ function for each column in simulated data.
For example, the first outcomes are 1 and 6 (values of the 1st column). The sum of this is

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/20
SIMULATION USING R IN-BUILT FUNCTIONS

greater than 7, hence the result is 1. The fourth outcome is 2 and 1. The sum is less than 7,
hence the output is 0 and so on.

To find the probability, we need to calculate the events that were successful (greater than
or equal to 7) and divide them by the total simulation. Alternatively, we can apply the mean()
function to the result.

In the code shown below, the number of simulations has been increased to 100,000 (1e5)
and the mean() function has been applied to the result of the simulations. Typically, the
simulations have to be applied a significant number of times to get the probability.

nsimulations = 1e5
simulatedData = replicate (nsimulations, sample ( s, 2, replace = TRUE, prob = p))

# Function to check if the sum of the rolls is at least 7


checkEvent= function(data) {
if( sum(data) >= 7) {
return(1)
} else {
return(0)
}
}

# Probability that the sum of the rolls is at least 7


mean( apply(simulatedData, 2 , checkEvent))

Output
[1] 0.58498

The output will always be close to 0.58 indicating that the probability of getting the sum of
two die rolls greater than or equal to 7 is about 0.58.

The code given below is another example of finding the probability of an event. In this case,
the probability of getting an even number in the first die of a roll needs to be found. As with
the previous case, the data received by the function checkEvent1 is a 1D array containing
the result of a single experiment. We can check the 1st element and return 1 if even and 0

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/20
SIMULATION USING R IN-BUILT FUNCTIONS

otherwise. We apply simulatedData, calculated earlier, to the function checkEvent1, column-


wise and calculate the mean().

# Function to check if the first roll is even

# Probability that the first roll is even


checkEvent1 = function(data) {
if (data[1] %% 2 == 0 ) {
return (1)
} else {
return (0)
}
}
mean ( apply(simulatedData, 2 , checkEvent1) )

Output
[1] 0.4977

The output as expected would be around 0.5, indicating that there is a 50% chance of
getting an even number.

2. Functions for Simulating Random Variables in R


R also has functions to simulate well known standard random variables.

2.1 Binomial Random Variables


One of the most widely used discrete random variable is the binomial random variable. As
the name indicates, the outcome is based on success or a failure. For example, consider 10
objects with four white balls and six black balls. Suppose that picking a black ball is
considered a success. We know that the likelihood or the probability of picking a black ball,
in this case, is six out of 10 because there are six (out of 10) black balls. So, there is a
particular probability associated with success. The binomial random variable captures the
randomness in the number of successes when we draw ‘x’ times with replacement from a
box that contains objects (of two types – success and failure) when the probability of
picking a success is specified like the value 0.6 in this case.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/20
SIMULATION USING R IN-BUILT FUNCTIONS

2.1.1 Simulating Binomial Random Variables


To create random numbers based on a binomial condition, we use the rbinom() function
Syntax
rbinom(n, size, prob)

In this, ‘n’ is the number of observations in a single trial, ‘size’ is the number of times the
experiment needs to be repeated, and ‘prob’ is the probability of success.

The code below demonstrates the generation of random variables using rbinom(). We are
going to draw 10 balls. In this, six black ones are considered success (probability 0.6) and
we are going to repeat this experiment 10 times.

# Simulating a discrete random variable (n=10, p=0.6)


n = 10
p = 0.6
nsimulations = 1e05
simulatedData = rbinom(nsimulations, n, p)
print(simulatedData)

Output
[1] 8 7 4 7 5 8 6 4 6 9

The output is a set of random variables between 1 and 10 indicating how many black balls
were found in each iteration. There are 8, 7, and 4 black balls in the first 3 iterations
respectively.

This can be run several times (e.g., 100000), and the result can be stored in the variable
simulatedData. The frequency of occurrence of each number can be calculated using the
table() function as shown below. This can then be converted into a dataframe using the
as.data.frame() method.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/20
SIMULATION USING R IN-BUILT FUNCTIONS

# Simulating a discrete random variable (n=10, p=0.6)


n = 10
p = 0.6
nsimulations = 1e05
simulatedData = rbinom(nsimulations, n, p)
df = as.data.frame(table(simulatedData))
colnames(df) = c("Value","Frequency")
head(df)

Output

Value Frequency
0 13
1 161
2 1005
3 4258
4 11268

The output displays the count of times a particular number of black balls were taken. That
is, the number of times zero black balls were taken is 13 (out of 100000 times the experiment
was repeated). One black ball was taken 161 times, two black balls were taken 1005 times,
and so on.

2.1.2 Visualising a Binomial Random Variable


We can plot the frequency graph using ggplot as shown below. We initialise a plot object.
The bar chart is created using the function geom_col() with aesthetics as columns ‘value’
and ‘frequency’ on the x and y axis. The bar plot is filled with the colour ‘steelblue’ and the
width is set to 0.7. The title is set to “Simulating a binomial random variable” and the labels
for the x and y axis are defined. The theme is set to minimal as well.

p=ggplot(data=df) +
geom_col(aes(x=Value, y = Frequency), width=0.7, fill='steelblue') +
ggtitle("Simulating a binomial random variable") +
labs(x="Values", y="Frequency") +
theme_minimal()
p

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output

The output is shown above. The result is an almost symmetrically distributed graph around
the value 6.

2.2 Simulating Poisson Variable


There is another random variable that can be simulated called Poisson random variable. For
example, we can simulate the random number of customers that would arrive at a shop in
the next five minutes. Based on past experiences, between 10:00 and10:05, say four
customers show up on average. We can simulate this and then calculate how many
customers are expected to turn up in the next five-minute interval. These types of random
variables are referred to as Poisson random variables. The parameter ‘lambda’ denotes how
many customers (on average) show up over a unit time interval; a unit time-interval in this
case is 5 minutes.

rpois(100, lambda=10)

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 13/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output
[1] 7 5 9 4 10 6 9 13 10 11 11 9 12 9 15 7 14
[18] 4 8 9 9 8 13 14 8 12 14 15 13 12 10 10 10 6
[35] 12 6 9 14 8 11 9 4 12 9 10 6 12 11 12 10 16
[52] 15 14 11 6 6 9 11 7 8 5 6 13 7 10 11 13 10
[69] 14 12 9 10 4 6 10 12 9 11 14 11 10 15 8 10 6
[86] 13 10 12 7 9 11 12 11 8 11 13 9 8 13 6

The output shows how many customers showed up in the five minutes slot. Most of the
numbers are around the value 10. This has very interesting practical applications. For
example, we can also simulate the number of people arriving at a bus stop in the next minute,
or the number of photons that strike a pixel on a sensor and so on.

2.3 Continuous Random Variable


The two random variables executed above count the number of successes or the number of
customers, which is a discrete quantity. Unlike those, a continuous random variable will take
a spectrum of values. For example, pick a random number of students and measure their
heights. The number can be between 140 cm to 220 cm. The outcome can have an infinite
spectrum of values.

2.3.1 Simulating a Continuous Random Variable


We can generate a set of continuous random variables using R. Consider a set of 100,000
students and their heights. On average, the students are 170 cm tall, and they have a
standard deviation of 8 cm. Assume that the data follows a normal distribution; that is, there
is a symmetric distribution of people with heights around the average height. This type of
data can be simulated using the rnorm() function in R.

Syntax
rnorm(n, mean, sd)

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 14/20
SIMULATION USING R IN-BUILT FUNCTIONS

In this syntax, ‘n’ represents the count of numbers to be generated, ‘mean’ the expected
mean of the entire data to be generated, and ‘sd’ the standard deviation.
A sample code to generate 10 continuous random variables with mean as 170 and standard
deviation as 8 is given below.

# Simulating a continuous random variable


mu = 170
sigma = 8
nsimulations = 10
rnorm(nsimulations, mean = mu, sd = sigma)

Output
149.4088 159.3580 181.0153 166.8341 165.4002
171.8139 163.5023 172.2372 170.2813 171.7389

The output is a set of 10 numbers distributed around 170.


In a typical experiment, the number of iterations is very large. For example, 100,000. The
data generated can be converted into a dataframe and stored for further processing. Note
that, in this case, the table() function is not needed as we are not calculating the frequency.
The name of the columns can be changed to reflect the data contained in the column.
The code to convert the 100,000 random normally distributed variable into a dataframe is
shown below. The column name has been changed to ‘Height’.

# Simulating a continuous random variable


mu=170
sigma = 8
nsimulations = 1e5
simulatedData = rnorm(nsimulations, mean = mu, sd=sigma )
df = as.data.frame(simulatedData)
colnames(df) = ‘Height’
head(df)

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 15/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output
Height

[2,] FALSE
2 181.4812
3 175.4368
4 166.8967
5 169.0739

The data thus generated can be used to calculate the probability of occurrence of events.
For example, the data can be used to calculate the probability that a random person’s height
is between 170-171 cm. The code below displays TRUE if the number is greater than or equal
to 170 and less than or equal to 171 for all the 100000 simulated values.

# Simulating a continuous random variable


mu=170
sigma = 8
nsimulations = 1e5
simulatedData = rnorm(nsimulations, mean = mu, sd=sigma )
df = as.data.frame(simulatedData)
colnames(df) = ‘Height’
head(df)
(df['Height'] >= 170) & (df['Height'] <=171)

Output
Height
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] TRUE
[5,] FALSE
[6,] FALSE
[7,] FALSE
[8,] FALSE
[9,] FALSE
[10,] FALSE

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 16/20
SIMULATION USING R IN-BUILT FUNCTIONS

To determine the probability, we must calculate the fraction of total success to the total
count. Alternatively, we can use the mean() function to do the same. ‘TRUE’ is considered
as 1 and ‘FALSE’ is considered as 0 in the calculation of the mean().

# Simulating a continuous random variable


mu=170
sigma = 8
nsimulations = 1e5
simulatedData = rnorm(nsimulations, mean = mu, sd=sigma )
df = as.data.frame(simulatedData)
colnames(df) = 'Height'
mean((df['Height'] >= 170) & (df['Height'] <=171))

Output
[1] 0.05

The output implies that for a normally distributed data with mean as 170 and standard
deviation as 8, the probability of value being between 170 and 171 is 0.05.

2.3.2 Visualisation
The random continuous values can be visualised using histograms. The sample code is
shown below. We first initialise the plot object. To this, a geom_histogram layer is added.
The x-axis is set as the ‘Height’ column of the new dataframe and the y-axis is initially set to
the internal variable – ‘..count..’. Since this is a histogram, we need to specify the width of
each bar. This is done using the seq() function. This generates a sequence of numbers. This
generates number from (mean-4*standard deviation) in steps of 2 till (mean+4*standard
deviation). This is provided as the value for the argument ‘breaks’ in the geom_histogram()
function. Other attributes like colour, fill, alpha, and labels can be specified as shown below.

delta = 2
p1 = ggplot(df) +
geom_histogram(aes(x=Height, y = ..count.. ), breaks = seq(mu-4*sigma, mu+4*sigma,
by=delta), color = 'black', fill= 'steelblue', alpha = 0.4 ) +
labs (x='Height', y= ‘Count’ )
p1

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 17/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output

The height of the histogram indicates the count of the simulated values that fall within the
range specified in the x-axis. For example, around 10,000 simulated values fall between 170
and 172, etc.

The code below displays the relative frequency. In this, the y-axis is ‘..count../sum(..count..)’.

delta = 2
p1 = ggplot(df) +
geom_histogram(aes(x=Height, y = ..count../sum(..count..) ), breaks = seq(mu-4*sigma,
mu+4*sigma, by=delta), color = 'black', fill= 'steelblue', alpha = 0.4 ) +
labs (x='Height', y= ‘Count’ )
p1

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 18/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output

In the above output, the histogram height is normalised between 0 and 1 for each range.

We can also use the ‘..density..’ variable on the y-axis. This will divide the height with the
width of each bar. In this case, the histogram height is divided by 2.

delta = 2
p1 = ggplot(df) +
geom_histogram(aes(x=Height, y = ..density.. ), breaks = seq(mu-4*sigma, mu+4*sigma,
by=delta), color = 'black', fill= 'steelblue', alpha = 0.4 ) +
labs (x='Height', y= ‘Count’ )
p1

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 19/20
SIMULATION USING R IN-BUILT FUNCTIONS

Output

In the above output, since ‘..density..’ is used, the relative frequency is divided by 2.

©COPYRIGHT 2022 (Ver. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 20/20

You might also like