You are on page 1of 296

MST-015

Introduction to R Software
Indira Gandhi National Open University
School of Sciences

BLOCK 1

Fundamentals of R Language 3
BLOCK 2
Functions, Conditional Statements, Loops
and Descriptive Statistics with R 179
INTRODUCTION TO R SOFTWARE
BLOCK 1: Fundamentals of R Language
Unit 1: Introduction to R

Unit 2: Nitty-Gritty of R

Unit 3: Membership Testing, Coercion and Lists in R

Unit 4: Data Frames, Reading and Writing in R


Unit 5: Graphical Representation of Data with R

BLOCK 2: Functions, Conditional Statements, Loops and


Descriptive Statistics with R
Unit 6: Functions in R

Unit 7: Control-Flow Constructs of R

Unit 8: Apply Family in R

Unit 9: Descriptive Statistics and Correlation with R


MST-015
Introduction to R Software
Indira Gandhi National Open University
School of Sciences

Block

1
FUNDAMENTALS OF R LANGUAGE
UNIT 1
Introduction to R 9

UNIT 2
Nitty-Gritty of R 29

UNIT 3
Membership Testing, Coercion and Lists in R 69

UNIT 4
Data Frames, Reading and Writing in R 101

UNIT 5
Graphical Representation of Data with R 141
Curriculum and Course Design Committee
Prof. Sujatha Varma Prof. Rakesh Srivastava
Former Director, SOS Department of Statistics
IGNOU, New Delhi M. S. University of Baroda, Vadodara (GUJ)

Prof. Diwakar Shukla Prof. Sanjeev Kumar


Department of Mathematics and Statistics Department of Statistics
Dr. H. S. Gaur Central University, Sagar (MP) Banaras Hindu University, Varanasi (UP)

Prof. Gulshan Lal Taneja Prof. Shalabh


Department of Mathematics Department of Mathematics and Statistics
M. D. University, Rohtak (HR) Indian Institute of Technology, Kanpur (UP)

Prof. Gurprit Grover Prof. V. K. Singh (Retd.)


Department of Statistics Department of Statistics
University of Delhi, New Delhi Banaras Hindu University, Varanasi (UP)

Prof. H. P. Singh Prof. Manish Trivedi, SOS, IGNOU


Department of Statistics
Vikram University, Ujjan (MP) Dr. Taruna Kumari, SOS, IGNOU

Prof. Rahul Roy Dr. Neha Garg, SOS, IGNOU


Mathematics and Statistics Unit
Indian Statistical Institute, New Delhi Dr. Rajesh, SOS, IGNOU

Prof. Rajender Prasad Dr. Prabhat Kumar Sangal, SOS, IGNOU


Division of Design of Experiments,
IASRI, Pusa, New Delhi Dr. Gajraj Singh, SOS, IGNOU

Course Preparation Team


Course Editor Course Writer
Prof. Anoop Chaturvedi (Units 1-9) Dr. Taruna Kumari (Units 1-9)
Retired from Department of Statistics, School of Sciences, In
Indira
ndira Gandhi National Open
University of Allahabad University,
e sity, New Delhi, Delhi
Univer
Prayagraj, Uttar Pradesh

Formatted and CRC Prepared by Dr. Taruna Kumari and Ms Preeti, SOS, IGNOU
Course Coordinator: Dr. Taruna Kumari
Programme Coordinators: Dr. Neha Garg and Dr. Prabhat Kumar Sangal

Print Production
Mr. Rajiv Girdhar Mr. Hemant Parida
Assistant Registrar Section Officer
MPDD, IGNOU, New Delhi MPDD, IGNOU, New Delhi

Acknowledgement: From the depth of my heart I render my gratitude to my family, specially, my father Mr. Puran
Chand, my mother Mrs. Raj Rani, my husband Mr. Anupam Pathak and my son Prithu for providing me necessary
comfort to overcome the ups and downs during the development of this material. Also, I extend my thanks to my
former graduate and post graduate students for their feedbacks and questions, which enabled me to get into
detailed explanation.
April;, 2023
© Indira Gandhi National Open University, 2023
ISBN-978-81-266-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the Indira Gandhi National Open University
Further information on the Indira Gandhi National Open University may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110068 or visit University’s website http://www.ignou.ac.in
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by the Director, School
of Sciences.
INTRODUCTION TO R SOFTWARE
R is a high level language. A language whose popularity is increasing day by day. It can also
be referred as an environment specially used for statistical analysis of the data and graphics
facilities. You may feel astonish to know that, R language has been around us since 1993.
The R language is dialected from the S language. 1The S language was developed at Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. The evolution of the S
language is described by the four books of John Chambers and coauthors. 2For John
Chambers efforts the Association for Computing Machinery (ACM) awarded him with its
Software System Award, that mentioned that this languge is “forever altered how people
analyze, visualize and manipulate data”. R was written by Ross Ihaka and Robert
Gentleman at the Department of Statistics of University of Aukland in New Zealand.
There are several reasons for the popularity of R. We are stating some of them here:
R is an interpreted language, which is free.
An outstanding and magnificent software, which is easy to use as well.
Work on Windows, Unix, Mac and Linux.
A number of statistical packages are available for handling statistical data analysis.
Comes with several data sets.
Quality of support and back-up available (via web-pages, R documents and books) on
functions and packages.
Widely accepted by many researchers, industralists and professors for the data
analysis purpose.
The main reason for impressive growth in the popularity of the R language now a days is,
emergence of data science as a career because data is everywhere and experts are needed
to sort and anlayze that day. So, together with the knowledge of computing, the knowledge of
the statistical methods and machine learning are also required.
This course is mainly written for the learners who are beginners in R computing g software.
Throughout the development of this course the emphasis are given to the packages which
comes with base distribution (i.e., precompiled binary di d stributions of the base sy
distributions ssystem)
stem)
during installation. It is essential for the learners to understand the basics of R b efore,
before,
switching to more complicated problems, such as discussed in the lab courses, i.e., MSTL- MSTS L-
011: Statistical Computing Using R-I, MSTL-012: Statistical Computing Using R-II, MSTL-
013: Statistical Computing Using R-III and MSTL-015: Statistical Computing Using R-V. The
content of this course is organized into self-explainatory 9 units. First five units are the part of
the Block 1 (Fundamentals of R Language) and next 4 units are the part of the Block 2
(Functions, Conditional Statements, Loops and Descriptive Statistics with R). These units
can be summarized as follows:
Unit 1 (Introduction to R): It comprises of installation procedure, methods of seeking help
and details on basic terminologies of R
Unit 2 (Nitty-Gritty of R): The second unit discuss about the R objects such as different types
of vectors, matrices, factors and arrays. It also throw light on missing values, arithmetic and
logical operations.
Unit 3 (Membership Testing, Coercion and Lists in R): As clear from the name in this unit
discuss membership: testing and coercion of R objects. Additionally, the lists objects are also
discussed in this unit.
Unit 4 (Data Frames, Reading and Writing in R): This unit given extensive details on data
frames objects, methods of reading and writing from/to a file and formatting commands.
1 Refer “An Introduction to R” manual by R Core Team
2
Refer“R Language Definition” manual by R Core Team
Unit 5 (Graphical Representation of Data with R): Different types of graphical functions that
are used to create plots of Scatterplot, Boxplot, Histogram, Barplot, Stripchart, Stem and
Leaf plot, Pie chart, pairs plot, coplot, cloud plot etc are discussed in this unit.
Unit 6 (Functions in R): The method of creating your own function is discussed in this unit by
taking some suitable examples.
Unit 7 (Control-Flow Constructs of R): Control-flow constructs such as conditional
statements, different types of loops and method of putting additional control on the loops
using the next and breaks statements are discussed in this unit with examples.
Unit 8 (Apply Family in R): This unit comprises of details on the usage and importance of the
apply family functions.
Unit 9 (Descriptive Statistics and Correlation with R): Unit 9 comprises of details on
measures of central tendency and dispersion together with examples on correlation
computations with R.
To develop this course, we have used Window operating system and the R commands
written in this course are run on R version 4.1.1. In a Window system, we interact with R
through the R console. Futhermore, the written commands can be easily saved. More details
on it are given in Unit 1 of this course.
In this course, the written codes, associated outputs and names of the functions, R objects,
packages, operators are written in ‘Lucida Console’ font type and theory is written in ‘Arial’
font type. Additionally, the R commands are written in bold and associated outputs are
unbold. Note that, the lines starting with ‘ # ’ written before the R commands are the
unexecuted commands, written to give clear understanding of the code part. Furthermore,
while studying this course do all the illustrations on the computers, preferably by writing the
commands in R script files (in an integrated editor) available on R Graphical User Interface
((RGui).
(RGui)). Then do all the SAQs and TQs, without using g computers.
It is important to note that, if you use any R function in your research/publications for data
analysis purpose then you should cite that package, in you written w work. example
ample to
ork. Say for exa
cite the used package base firstly get the citation details
e ails using the citation() function
det
and then use the obtained reference for citation purpose as follows:

In case, if the citation details are accessible (or available) via citation() function at the
prompt them learners may visit the CRAN (Comprehensive R Archive Network) page to get
the details of the contributors (such as author’s names, year and title) for citation purpose.
Lastly, in this introduction page I would like to express my deepest gratitude and thanks to
the R core team, Bill Venables, David M. Smith, John Chambers, Robert Gentleman, Ross
Ihaka, Martin Maechler and other contributors for providing access to enormous R sources
and for their substantial contribution in R language, which has extremely benefited the world.
The MST-015 (Introduction to R Software) is a 2 credit self-explained course, which is
developed for self-study. But still if you want to refer to additional books or references on
discussed topics you may refer to the following books and references.
Suggested Further Reading
1. Braun, W. j. & Murdoch, D. J. (2007). A First Course in Statistical Programming with R.
Cambridge.
2. Crawley, M. J. (2012). The R book. John Wiley & Sons.
3. Albert, J. & Rizzo, M. (2012). R by Example. Springer
4. Teetor, P. (2011). R Cookbook. O’REILLY.
5. Lafaye de Micheaux, P., Drouilhet, R., & Liquet, B. (2013). The R software:
Fundamentals of programming and statistical analysis. Springer.
6. Zuur, A., Ieno, E. N., & Meesters, E. (2009). A Beginner's Guide to R. Springer Science
& Business Media.
7. Heumann, C., Schomaker, M. & Shalabh (2016). Introduction to statistics and data
analysis: With Exercises, Solutions and Applications in R. Springer International
Publishing Switzerland.
8. Dalgaard, P. (2002) Introductory Statistics with R. New York: Springer- Verlag.
References
The packages used for the development of this course matrial can be referred from the
following references:
1. R Core Team (2021). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
2. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.
Springer, New York. ISBN 0-387-95457-0.
3. Mirai Solutions GmbH (2023). XLConnect: Excel
x ell Connector for R. R package version
Exc
1.0.7. https://CRAN.R-project.org/package=XLConnect
https://CRAN.R-project.org/pa
p ckage=
e=XLLCoonn
nnec
ectt
4. Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New
York. ISBN 978-0-387-75968-5
5. Lukasz Komsta and Frederick Novomestky (2022). moments: Moments, Cumulants,
Skewness, Kurtosis and Related Tests. R package version 0.14.1. https://CRAN.R-
project.org/package=moments
Expected Learning Outcomes
After completing this course, you should be able to:
Install R, take helps on functions and data sets, create R scripts and learn some basic
aspects of R;
create R objects and know the different data types and learnt to use membership:
testing and coercion functions;
read and write from/to a file;
do graphic representation of data with R;
do looping, create control statements and functions in R; and
compute descriptive statistics and correlation with R.

Feedback Link: https://forms.gle/SZZ23dxBEDJJdEGt9

Course Preparation Team


BLOCK 1 FUNDAMENTALS OF R LANGUAGE
Block 1 of the MST-015 (Introduction to R Software) course provides a brief self-
explanatory introduction about the R language, types of data, objects of R language,
membership: testing and coercion functions and graphical representation of data. This
block comprises for Units 1 to 5. The detail on each unit is as follows:
Unit 1: This unit is written to make you aware about the downloading, installing and running
R software. Some important aspects such as case sensitivity of the language, getting help,
sources available like R manuals, incomplete commands etc are also discussed in this unit.
Unit 2: The second unit comprises of detailed discussion on vectors, matrices, factors and
arrays objects. In addition of this logical and relational operations; and handling of missing
values, i.e., NA and NAN values are also discussed in this unit.
Unit 3: In the third unit the brief introduction of a list object and associated details are
discussed. The membership: testing and coercion function and their outputs by taking
suitable examples are also explained. Additionally, some of the attributes of R objects are
also discussed in this unit.
Unit 4: Reading and writing data from/to files plays a very important role in computing and
when we talk about reading data, we can’t miss one of the most useful and convenient object
of R language, which is data frame. So, in this unit we have focused on the reading and
writing of data together with extensive discussion on data frame and some suitable functions.
Additionally, formatting commands along with date and time objects are also discussed in
this unit.
Unit 5: This last unit of Block 1 consists of extensive detail on graphical representations of
the statistical data, which helps us to present the data in a meaningful way. It consists of
detailed discussion on the creation of different types of plots and graphs in R, usage of
different graphical arguments of the graphical functions, way of changing the appearance of
a created plot or graph
g and method of saving g a created plot.
This material has been developed for self-study.
f study. We hope you will enjoy studying this block.
self-
Expected Learning Outcomes
After completing this block, you should be able to:
Install R, create R scripts, seek help on built-in functions and data sets;
learn some important aspects of R language, such as, case sensitivity, objects, class,
incomplete commands, assignment operators etc.
create different objects of R, such as vector, matrices, arrays, data frames and lists.
Also, you will learn the difference between different types of data types, such as
integer, numeric, character etc.
learn and use the membership: testing and coercion functions;
read data from different types of files, such as .txt, .csv, .xlsx and also be able to write
data to them;
Perform different types of mathematical, relational and logical operations; and
create different type of plots/graphs in R. Also, you will be able to present them in more
fascinating manner by changing there default look.

Block Preparation Team


UNIT 1
INTRODUCTION
N TO
OR

Structure
1.1 Introduction Recalling Previous Commands

Expected Learning Outcomes Listing and Removing R Objects

1.2 Downloading and Installing R Listing, Installing and Removing


R Packages
1.3 Running and Quitting R
The typeof() and class()
1.4 Case Sensitivity of the
Functions
Language and Help on R
R Manuals
1.5 Some Other Important
Aspects of R Commands History

Assignment Operators 1.6 Summary


Writing Comments 1.7 Terminal Questions
Commands Separators 1.8 Solutions/Answers
Incomplete Commands

1.1
1 .1
1 INTRODUCTION
INTRODUCTIO
ON
This unit provides an introduction to the main features of R language. Also, we
do not assume any familiarity of the learner with the computer programming
while learning from this unit. The present unit sets the ground work for the
other units. It explains the procedures of downloading, installing and running
R. It also explains some of the important basic concepts related to R language
which are objects, classes of objects, case sensitivity of the language etc.
Most importantly, it explains how to find help on R constants, reserved words,
data sets and functions, which leads to the path of getting the answer to your
queries.
Many books on R programming language assumes that you are familiar with R
fundamentals, such as syntax, functions, operators, data types and so on. The
speciality of MST-015 (Introduction to R Software) course is that, it does not
require prior knowledge of any computing software. The R programming
language is discussed here from the scratch.
Note that, R is a free (open source) interpreted language. It is specially
designed for handling statistical computations and for graphical representation
of data. It also provides interface to other languages and debugging facilities. 9
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
Nowadays R is used by enormous people daily to perform data analysis. R
has now become a tough competitor to almost all the commercial statistical
software’s.

We would recommend that you to study the course introduction pages to get
aware with the development of R language and its contributors, which has
tremendously benefited the world.

Expected Learning Outcomes


After completing this unit, you should be able to:
download, install and run the R programming software;
find help on the R functions and constants;
understand R objects and classes;
understand case sensitivity of the language;
differentiate between complete and incomplete R commands;
list and remove the R objects used in your R session; and
access R manuals.

1.2
1 .2 D
DOWNLOADING
OWN
NLO
OADIN
NG A
AND
ND IINSTALLING
NSTALLING R
The most convenient way to download R in your system is to obtain base
distribution from the R Website, which is as follows:
https://www.r-project.org/
g
When you will go to the above link (assuming you have access
acce
ess to the
e internet)
it will take you to the following page:

10
Introduction to R

From the screenshot you can observe that several important information’s
related to Download, R Project, R Foundation, Help with R, FAQs
(Frequently asked questions), R Manuals and others are available on this
Website.
To download R, click on CRAN (Comprehensive R Archive Network) (under
Download), then you will be directed to a list consisting of CRAN mirror site
organized by country. You need to select a site near to you.

In some of the books, you may find to download R directly from the following
link, which direct you to the download R by selecting CRAN
A mirror site as
Austria.
https://cran.r-project.org/
After selecting the CRAN Mirrors, you will be directed to the following
downloading page:

11
Fundamentals of R Language
On the CRAN page you will find some precompiled binary distributions of the
base system and contributed packages for Linux, macOS and Windows
operating systems. Choose one of the suitable options from the available
options under “Download and Install R” to download R. Here, we are
explaining the method of downloading R for Windows, as all the commands
written in this course are executed (or evaluated) on Windows.

Then click on base under subdirectories to download R. The base R


distribution consists of many classical and modern statistical techniques. The
statistical techniques which are not supplied as part of base R (maybe
referred as base R distribution) may be downloaded with the help of several
packages available. A number of packages are supplied
suppl h base R known
p ied with
as “standard” and “recommended” packpackages.
kages. That is why in this course main
importance is given to these packages and in the e MSTL-011 (Statistical
Computing Using R-I) course we tried our best to solve the problems using
these R base distribution packages.
So, when you click on base under su
subdirectories,
s bdirectories, you will be directed
e to the
following page. Then you can click on “Download R-4.3.1 for Windows” to
download R R.

12
Introduction to R

This page gets updated time to time and you will always find the latest version
of R to download on your system (right now R-4.3.1 is the latest version of R,
which is available to download).
After downloading R (see the location where downloads are saved), run the
setup program on PC by double-clicking on the downloaded application .exe
file (see the screenshot of the downloaded application with opened properties
shown below). Then follow the instructions and wait to get it installed
successfully (click on Finish to complete the installation process).
Note: Alternatively, you can get the set-up from your friends or known persons
and run on your PC to install it.

In case, if you are installing R for the macOS, then click on “Download R for
macOS” under “Download and Install R”. Then click on the .pkg file for the
latest version of R, download it and install it by double-clicking on the .pkg file.
Or otherwise, if you are installing R for the Linux, then click on “Download R
for Linux” under “Download and Install R”. The major Linux distributions like
Debian, Redhat, Ubuntu etc have packages for installing R. You just need to
use the system’s package manager to download and install the package.
Note: To download R, you can also type on Google “Download
“D
Dowwnload R” and get all
the important links, which helps you to download R on yo
your
our system.
m

SSAQ
SA
AQ 1
What is the basic difference between an interpreted language and a compiled
language. Also, give an example of each one of them.

1.3 RUNNING AND QUITTING R


To run the R software, you just need to double-click on the R icon on your
desktop (during installation you were asked whether to create an R icon on the
desktop or not), while in Unix versions you type R at the operating system
prompt. Both will bring up the R console with its prompt. In Window system the
prompt is ‘ > ’, it may be different in other operating systems.
Note that any syntactically error free command written in front of prompt ‘ > ’ in
R console gets evaluated, when we press the ENTER key. R Graphical User
Interface (RGui) also supports a test editor, which can be access as follows:
Click on the menu bar → click on File → click on New script.
When you click on the New script, a Untitled - R Editor will be opened which
is known as R script in which you can write the R commands that you want to
evaluate. After writing the commands you can save it (the cursor should be on
the R script while saving the file) by clicking on File and thereafter Save as or
13
Fundamentals of R Language

Save to save the file (save it as you save any other file, like word, excel etc).
The created file will be saved with .R extension. It can be later accessed as
follows:
Click on File → click on Open script → go to the location (where it is saved)
and select the required file to open.
Furthermore, a number of commands written on the R script editor can be
evaluated by firstly selecting them and then pressing Ctrl+R, which means
pressing control key with R. Or otherwise, if you want to run only a single
command then you can put the cursor at that R command (which you want to
evaluate), then press Ctrl+R. The R script editor is mainly useful when you
want to save retyping, and these files are easily manageable.
Note: The icon on desktop will be visible with its version. If you have not opted
for the creation of icon on desktop, then you can go to Programs and then to
R and thereafter find R icon and double-click on it to run R. Furthermore, it is
always better to visit the CRAN page to get latest version of R.
Note that in a Window system users interact with R through R console. When
we double-click on the R icon, the following page will appear:

The information written on the R Console pages is extremely important, as it


consists of information about its licence, contributors, getting citations details,
demonstration, way of seeking help and quitting R. The license() or
license() functions are used to get details on R distribution licence details.
The details of the contributors can be accessed using the contributors()
function (we recommend accessing the contributors page to see the complete
information). For the illustration purpose, we now run this contributors()
function on R console as follows:
14
Introduction to R

As the paragraph also suggests the method to get the citation details on any
used package. We now try to get citation details for the base package using
citations() ) function by supplying base
e in double quotes (as character
b se
ba
string) in the following manner:

Other functions which are shown in the paragraph written on R console page
are demo(), q(), help() and help.search(). The demo() function is a
user-friendly interface for running some demonstrated R scripts, thus as the
name suggest it is for demonstration purpose. For more clarification, you can
run the following command:
#Getting demonstration on graphics package
> demo("graphics")
)

Quitting R:
We can quit R, by writing the command q() at the prompt. As you press enter
you will be asked if you want to save the current workspace or not (you can 15
Fundamentals of R Language
respond yes, no or cancel). If you want to resume your current work later at
the point you are leaving it then you can select yes otherwise no. You can
also cancel the quitting request by selecting cancel option.
Note: An alternative way to interact with R is using RStudio, which can be
downloaded from the following link:
https://www.rstudio.com
The RStudio can be downloaded and installed for all the operating system for
which R software is downloadable. Like R software, it also supports a script
editor where we can write complex programs. But for this course, we
recommend the use of the R Software.

SAQ
Q2
Write a command to get the citation details on the lattice package.

1.4 CASE SENSITIVITY OF THE LANGUAGE AND


HELP ON R
In this section, we shall discuss the case sensitivity of the R language and the
methods of taking help on any constant, reserved words, data sets and
functions.
Case Sensitivity:
R is a case sensitive language. By case sensitivity, we mean the ability to
distinguish between lower and upper version of letters. Due to the case
sensitivity, A and a, T and t are different letters. With the same argument, the
variable names GOOD and GoOD, Blessed, Bless
s ed, blessed and BLESSED are
different. For the illustration purpose, we now assign
asssign a value
e 10 to a variable A
and print its value using upper case and lower-case letters as follows:
#Assigning a value 10 to A
> A <- 10

#Printing A
> print(A)
[1] 10

#Printing a
> print(a)
)
Error in print(a) : object 'a' not found

Hence, it is verified that upper- and lower-case letters are different in R, i.e., R
is a case sensitive language. Consider another example in which we assign a
character string "OM" to a variable named name and print it by combining
upper and lower characters of the variable name as follows:
#Assigning a character string
> name
e <-
- "OM"
"

#Printing name
> print(name)
)
[1] "OM"
16
Introduction to R

#Printing nAME and NaMe


> print(nAME)
)
Error in print(nAME) : object 'nAME' not found

> print(NaMe)
)
Error in print(NaMe) : object 'NaMe' not found

Hence, name, nAME and NaMe are not same due to the case sensitivity of the
language.
Help on R:
Recall that in the written paragraph on the R console it was mentioned that get
“ help() for online help, or help.start() for an HTML browser interface to
help”. Actually, R has a built-in help facility, which can be easily accessed
using the help() function or by using the ‘ ? ’ symbol. For the illustration
purpose suppose that we are interested in finding help on a function named
prod(), then it can be achieved either by using the help(prod) command
or by writing ?prod as follows:
#Seeking help
> help(prod)
> ?prod #An alternative

Note: To get help using ‘ ? ’ write the name of the function without parenthesis
‘ ( ) ’ after ‘ ? ’.
When the help()
help() ) function command is executed, the R Documentation page
consisting of the details on the function and its arguments together with the
examples and other necessary details will pop up as follows:

Hence, from the help page we get that the prod() function is available in the
base package and it is used to compute the product of all the elements
present in its arguments.
Next, we seek help on reserved word (maybe referred as keywords), R
constants and data sets using help() function as follows:
When you want to take help on a data set, say USArrests data set, then it
can be done by writing the following command: 17
Fundamentals of R Language

#Seeking help on a data set


> help(USArrests)
)
If you want to take help on R constants, say pi, LETTERS, letters,
month.abb and others, then it can be done by supplying the name of the
R constants as an argument to the help function as follows:
#Seeking help on a R constant
> help(LETTERS)
)

In case if you want to take help on any R operators or control flow


constructs such as +, -, [, ]], if, for and others. Then it can be done by
writing these operators, symbols and reserved words in double quotes as
follows:
#Seeking help on an operator
> help("[[")
) #In double quotes

Note: Even if you are not connected to the internet, then also you can access
R Documentation pages via help.
Next, we discuss the use of help.search()
help
he l .search() function. This function is
particularly useful, when we do not know the exact name of the function and
only recall a subpart of the function or data set or keywords names. This
function only accepts a character string as its argument. As an alternative to
this, we can also use a more convenient way of finding help, which is using ‘
?? ’ in front of the subpart of name of the function. For the illustration purpose,
suppose we want to seek help on the rowMeans()
r wM
ro ans() function, but we can only
wMea
recall a subpart rowMea
rowMMea of it, so we proceed to take help in the following
manner:
#Seeking help
> ??rowMea

#An alternative approach


> help.search("rowMea") #In double
l quotes

When the h help.search()


elp.searc ch() function command is executed, the search result
consisting of the details of the packages, which contains functions having
substring "rowMea" will be displayed. The same can be verified from the
following screenshot.

18
Introduction to R

SAQ
Q3
Write a command to get help on if reserved word (used in conditional
statements).

1.5 SOME OTHER IMPORTANT ASPECTS OF R


In this section we shall discuss variables and objects, assignment operators,
incomplete commands, packages in R, R manuals and other important
aspects of R.
In any computer programming language, variables provide the way to access
the data stored in memory. Data stored in the memory can not be directly
accessed, we need data structures to store and access the data. R provides a
number of data structures referred to as objects to assign (or save) and
access the stored data. R supports a number of objects namely, vectors,
matrices, arrays, lists, data frames, functions, expressions etc. In addition to
this the data type
yp of an R object
j ((for vectors, matrices and arrays)
y ) can be
"numeric", "integer", "character", "logical", "loogical",
gi
i "complex",
"factor". The vectors, matrices and arrays objects consist of data of a
single type, say either numeric or character or any other type. But in lists and
data frames, different types of data can be saved under one name. The
function and expression objects (Session 2) are used in MSTL-011 lab course.
The details on each one the objects are discussed in coming units.
Note: R also supports a special type of object called NULL,
NU which is used to
indicate that the object is absent. The NULL
L object has no type and is different
NULL
from a vector or list of zero length.
Furthermore, any name assign the R objects or specific variables in R should
consist of only A–Z (capital letters), a-z (small letters) and 0-9 (digits), ‘ . ’ (a
dot), and ‘ _ ’ (underscore). The name of the variable should always start with
a letter or underscore. If it starts with underscore then try to use the second
character as a letter. There is no limitation on the length of the R names. The
file names in R can be any valid R name. Next, we discuss other important
aspects one-by-one.

1.5.1 Assignment Operator


Assignment operators are used to assign a value to a variable name. The
following three different assignment operators are available in R:
<- Left assignment operator
= Leftwards assignment operator
-> Right assignment operator
You can use any of these assignment operators to assign the values or data
to any R objects and variables. For the illustration purpose, we now assign the
values 5, 10 and 15 to x, y and z variables using the all the three operators
and print them using the print() function as follows:
#Assigning the values, 5, 10 and 15 to x, y and z
> x <-
- 5
19
Fundamentals of R Language
> y = 10
0
> 15
5 ->
> z

#Printing the assigned values


> print(x)
)
[1] 5
> print(y)
)
[1] 10
> print(z)
)
[1] 15

Note: (i) The print() function is used to print an R object and it is discussed
in detail in the Unit 4 of MST-015 course.
(ii) If an expression is evaluated in R, say x+y (which is 15), until unless its
value is assigned to some variable the value will be lost. So, if you want to
reuse any value further, better to assign it to some variable.
(iii) The two assignment operators, ‘ <-
< ’ and ‘ = ’ are used interchangeably. In
this course we will use ‘ <- ’ assignment operator for the assignment purpose.
As ‘ <- ’ operator is quite convenient and preferred by many books, therefore,
from this point onwards, we use this operator for variable assignment purpose.
(iv) The ‘ <- ’ assignment operator consists of two characters ‘ < ’ (less than)
and ‘ - ’ (minus), occurring strictly side-by-side. It should be remembered that
there should not be any blank space in-between both the characters.
(v) In some reference books you may find ‘ = ’ as assignment operator. But do
not confuse between ‘ = ’ and ‘ == ’ operators. The first one is the assignment
operator and the second one is the relational operator.

1.5.2
1 .5
5.2
2 Writing
Wriiting Comments
Com
mmentts
Comments in any programming language plays the following two ver
very
ey
important roles.
1. It helps the user in explaining the R code to other people. Analogously, it
facilitates the programmer to make the R code more readable.
2. There may be situations in which the user would like to prevent the
execution of certain code parts or executable statements (Generally
while testing the code), that time as well comments play a very important
role.
Comments in R starts with ‘ # ’ and can be put anywhere in the programme.
When any R code gets executed, it ignores the line or R statements which
starts with ‘ # ’ (hence prevents execution). For the illustration purpose, we
now create a variable named pincode to save the pin code of head office of
IGNOU, then we can use a comment in front of it to specify the location to
which it belongs as follows:
#Assigning pin code
> pincode
e <-
- 110068
8 #IGNOU headquarter pin code

#Printing pin code


> print(pincode)
)
20 [1] 110068
Introduction to R

It is important to note that in the beginning of the output [1] is written, it can be
read as “first value of the first line of the output”. It is generally useful when we
have vectors of several elements, which you will observe in coming units and
lab sessions.
Furthermore, the statement written after ‘ # ’ is not executed. Similarly, ‘ # ’
can be used before any R command as follows:
#Preventing execution of a assignment statement using ‘ # ’
> #x
x ->1
1
> x
Error: object 'x' not found

We get an error message as the assignment command is not evaluated and


therefore x is not found.
Note: Throughout the MST-015 and MSTL-011 courses, the lines starting with
‘ # ’ written before the R commands are the unexecuted commands, written to
give
g clear understanding g of the code p
part.

1.5.3 Commands Separator


Two or more commands in R programme can be separated by using the
following two ways:
1. Using the Enter key.
2. Using semi-colon ‘ ; ’
In the previous subsection, we have used the enter key to separate the R
commands. Now we show how semi-colon can be used as command separator.
See the following example.
#Assigning and printing the variables
> x <- 5; y = 10; 15 ->z
> print(x); print(y); print(z)
[1] 5
[1] 10
[1] 15

1.5.4 Incomplete Commands


If the R command written by the user is syntactically incomplete at the end of a
line, then R will by default gives ‘ + ’ prompt instead of ‘ > ’ prompt on second
and subsequent lines. This continues until the command is syntactically
completed by you. This ‘ + ’ prompt can be skipped by pressing Esc key in the
keyboard.
Consider the following example, in which the written statement is not
complete. The right parenthesis is missing. Until the parenthesis is added to
complete the statement, we will get different ‘ + ’ prompt.
#Illustrating the incomplete command
> (4+6-1
1
+
+ )
[1] 9
21
Fundamentals of R Language

1.5.5 Listing and Removing R objects


The R objects, which are currently available in the R workspace can be listed
using ls() function. For example, after assigning the variables x, y and z
now we use this function to get the list of used objects as follows:

There may be situations when we would like to remove some specific or may
be all objects used in R workspace. This can be achieved using the rm()
function. Note that, all the objects from the work space can be removed using
the following command:
#To remove all the objects
j available to use in the workspace.
p
rm(list = ls())

To remove specific objects, say objects named x and y,y, we supply their
names as arguments to the r () function as follows:
rm()
m(
#Removing x and y
> rm(x, y)

Next, we list the remaining objects (leftovers).


#Lising all object in the workspace
> ls()
[1] "z"

Hence, using the rm()


rm() function the x and y objects are now removed
successfully and only z object is left.

1.5.6
1 .5.6
6 Recalling
Recalling Previous
Prreviou
us Commands
Commands
We can recall the R commands using the vertical forward, vertical backward,
horizontal right and horizontal left arrows as follows:
1. Vertical forward and backward keys (↑ and ↓) can be used to scroll
forward and backward through a command history to locate a particular
command.
2. After locating the command, the horizontal right and left arrows (→ and
←) can be used to move the cursor within the command for editing
purpose.
It should be noted that command can be edited either by deleting characters
with DEL key and or adding other characters.

1.5.7 Listing, Installing and Removing R Packages


A number of packages are supplied as part of base R distributions. You do not
load them (or call them) while using any function or data sets available in
these packages. Generally, all R functions and datasets comes as part of R
packages. These functions and datasets will be available to use if and only if a
22 package is completely loaded. Sometimes the packages have dependencies,
Introduction to R

which means installation of specific packages need the installation of the other
dependent packages first. So, no specific commands need to be given to
download dependencies. They will be downloaded automatically, when a
specific package installation command is executed.
Next, we discuss the method of installing a new package in your R software. If
you are connected to the internet, then the package installation task can be
completed by using the install.packages() command. In the parenthesis
‘ ( ) ’ of this function, we should write the name of the package as character
string (in double quotes), which we are interested to install.
For the illustration purpose, we explain the method of installing the MASS
package. To do so, we should write "MASS" as a function argument to
install.packages() function as follows:
#Installing a package named MASS
> install.packages("MASS")
)

Alternatively, we can also use the R menu bar to install a package. To do so,
we use the following path
Go to menu bar → click on Pakages → click on install.package(s) → double-
click to select a CRAN mirror for use in this session (A place close to your place)
→ double click on the package which you want to install.
The number of packages installed in your R software can be viewed using the
insstallle
ed.pa c ages() function. We can also see the available packages
p ck
installed.packages()
from the menu bar of the RGui as follows:
Go to menu bar → click on Packages → click on Load packages → Select a
package which is to be loaded from the list.

Alternatively, an already installed package can be loaded using the


library() or require() function. Now we illustrate the method of loading
a package name MASS (which is already installed) using these two functions:
#Loading the MASS package
> library(MASS)
)
Or,
# Loading the MASS package
> require(MASS)
)

Note: (i) Any data set or function available in a specific package can also be
accessed using the double colon ‘ :: ’ operator. For example the ships data
set available in the MASS package can be accessed aa MASS::ships.
(ii) The currently loaded package in your session can be accessed using the
search() function. 23
Fundamentals of R Language
Moreover, we can remove a installed package from the library (where packages
are stored) using the remove.packages() function, which is available in the
utils package.

1.5.8 The typeof() and class() Functions


Type of a R object can be seen using the typeof() function. It determines
the storage type or R internal type of a R object. The class() function is
used to get the class of an R object. It is especially useful in object-oriented
style of programming. It is sometimes necessary to check the class of an
encountered object, just to check whether it can be used as an argument to a
particular function or not. Many functions accept a particular type of objects
only. These two functions come as part of base package.
To have more clarity of typeof() and class() functions, let us see the
class and type of different types of objects discussed in upcoming units to
have a bird eye view:
Object Class Type or Storage mode
numeric vector "numeric"
numeric "double"
double
integer vector "integer" "integer"
character vector "character" "character"
logical vector "logical" "logical"
Vary according to type of matrix
matrix
"matrix" elements. It can be
"array" "integer", "numeric",
"logical", "character"
list "list" "list"
data frame "data.frame" "list"
function "function" "closure"
expression "expression"
"express
sioon" "expression"

Note: The mode()


mode() function give information about the mode of an object in
ode
the sense of Becker et al. (1988). It is compatible with other implementations
of the S language. The storage.mode()
s orag
st ge.mom de e() function (same as typeof())
type
ty peof
pe ()) is
of()
of ()
much useful as compared to mode()
mo
ode
de(() function, as it returns the storage
storag
ge mode
of its arguments in the sense of Becker et al. (1988).

1.5.9 R Manuals
There are several manuals available on R language written by R core team,
which can be accessed from the menu bar of the R software as follows:
Go to menu bar → click on Help → click on Manuals (in PDF)

1 Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), The New S Language.
24 Chapman & Hall, New York. This book is often called the “Blue Book”.
Introduction to R

So, the first manual is “An Introduction to R”, which will give you an
introduction to the R language, its objects, data types, function and other
important information. Each manual consists of some important aspects of the
R language, which can be accessed according to the requirement of the
learner.
The Manuals on R can also be accessed from the CRAN page using the
following link:
https://cran.r-project.org/manuals.html

Note: In addition to manuals, the menu bar and CRAN page can also be
accessed to read the “FAQ on R” and “FAQ on Windows” (since I am
working on Window operating system). Here, FAQ stands for Frequently
Asked
A k dQ Questions.
ti Note
N that,
t th t Rh has th
the ffollowing
ll i th three collection
ll ti off answers tto
FAQ, which can be access using the following link:
https://cran.r-project.org/faqs.html

25
Fundamentals of R Language

1.5.10 Commands History


The history() function available in the utils package can be used to get
commands history. It shows only last 25 commands from the command history
as the default value of its max.show argument is 25. If you wish to see the all
the previous commands then you can assign it as Inf (reserved word for
infinity) as follows:
#Accessing history
> history(max.show
w = Inf)
)

Additionally, there are functions such as savehistory() and


loadhistory() available in the same package, which can be used to save
and load the commands history. For the illustration purpose a file name
filename with extension .Rhistory can be save and load as follows:
#To save the commands history
> savehistory(file
e = "filename.Rhistory")

#To load the commands history


> loadhistory(file = "filename.Rhistory")

SSAQ
SA
AQ 4
Let us suppose that when you run the ls() function command, you get the
following objects in your working environment. Write a command remove the
data and Name objects:
> ls()
[1] "A" "data" "Name" "x" "xy"

Note: Learners are advised to visit and read the CRAN page carefully to get R
history and other important details.

1.6
1 .6
6 SUMMARY
SUMMARY
Y
The main points discussed in this unit are as follows:
We have discussed the method of downloading, installing, running and
quitting R.
Methods of taking help on R reserved words, functions, data sets and R
constants are discussed.
Case sensitivity of the language and way of accessing the contributions
of R core team is discussed in this unit.
Other important aspects such as assignment operators, way of writing a
comment, editing a written command, R packages etc, are also
discussed in this unit.
Points to remember when working on R console:
The enter key is used to run or evaluate a typed command (after prompt
‘ > ’) on R console.
Semi-colon (‘ ; ’) is the command separator.
26
Introduction to R

Commands are grouped together using braces ( { } ).


The Up-arrow key ( ↑ ) is used to recall previously used commands and
the down-arrow (↓) key is used to get back.
The Escape key (Esc) is used to cancel a command. It is generally used
as a saviour from incomplete commands.
Getting ‘ + ’ sign after pressing the enter key indicates that the command
or a statement is not complete.
Comments in R starts with Hashmark (‘ # ’).
The following operators serves as assignment operators.
‘ <- ’, ‘ -> ’ and ‘ = ’
For the development of this course we have used ‘ <- ’ as an
assignment operator.

1.7 TERMINAL QUESTIONS


1. State whether the followings statements are TRUE or FALSE:
(i) Once a package is installed in your R Software, it cannot be
uninstalled.
(ii) The class() and typeof() functions are used for same purpose.
(iii) “ == ” is an assignment operator.
(iv) The Escape key to used to cancel a incomplete command.
2. Fill in the blanks:
(i) The typeof() function determines the ….. of any object.
(ii) An installed package can be removed using the ………function.
(iii) Comments in R starts with ….. .
3. How we can access the R manuals?
4. Write the purpose for which the citation() and contributors()
functions are used.
5. How we can load an installed package in R.

1.8 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. The main difference between the Interpreted languages and Compiled
languages is that, interpreted language converts the commands (source
code) into machine code line by line. So, it means a single can be run in
an interpreted language but in compiled language you need to write
entire program first, then the entire source code (as a program) will be
run in a single command (By source code we mean set of commands
written in any programming language).
The C programming language is an example of compiled language and
R is an example of an interpreted language.
27
Fundamentals of R Language
2. The citation details on the lattice package can be obtained using the
following command:
citation(lattice)
3. Help on the reserved word if can be obtain using the following
command:
help("if")
4. We can use the rm() function to remove the objects named data and
Name as follows:
rm(data, Name)
Terminal Questions (TQs)
1. (i) FALSE
(ii) FALSE
(iii) FALSE
(iv) TRUE
2. (i) R internal type or storage mode
(ii) remove.packages()
(iii) #
3. Refer Subsection 1.5.9.
4. Refer Section 1.3.
5. To load a package named pack, we use the library()
y ) and
library(
require() functions as follows:
library(pack)
require(pack) #Alternatively

28
UNIT 2
NITTY-GRITTY
Y OFF R

Structure

2.1 Introduction Extraction of Subvectors and


Submatrices
Expected Learning Outcomes
Matrix Functions
2.2 Vectors
2.5 Arrays
Numeric and Integer Vectors
Extraction of Subsections of an
Logical Vectors
Array
Character Vectors
2.6 Factors
Elements Extraction from a
2.7 Missing Values
Vector
2.8 Relational and Logical
Appending Elements to a Vector
Operators
2.3 Arithmetic Operations with
2.9 Summary
Scalars and Data Vectors
2.10 Terminal Questions
2.4 Matrices
Matrix Addition, Subtraction and 2.11 Solutions/Answers
Multiplication

2.1
2 1 INTRODUCTION
INT
TRODUCTION
In Unit 1 of MST-015 (Introduction to R Software) course, you have learnt the
installation procedure of R Software, taking help on built-in data sets,
constants, reserved words and functions using help(), ‘ ? ’ and ‘ ?? ’; and
some important fundamental aspects of R. In this unit, we shall make you
familiar with the nitty-gritty of R, such as, how to create vectors, matrices,
arrays and factors in R. Additionally, we shall discuss the vector operations,
matrix operations, logical operators and relational operators. Further, we shall
discuss the extraction of elements of vectors. In addition to this, we shall
discuss the extraction of sub-vectors and sub-matrices from matrices and
arrays. Furthermore, we shall illustrate the method to handling the missing
values in R. Two types of missing values are there in R. First one is NA, the
values which are not available. The second type of missing value is NAN, the
values which are not numbers.
Before studying this unit of Block 1, we expect that you have studied Unit 1 of
MST-015 thoroughly. 29
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language

Expected Learning Outcomes


After completing this unit, you should be able to:
create vectors and perform vector operations;
create matrices and perform matrix operations;
create arrays and factors;
handle missing values;
extract elements, drop elements and append elements of a vector;
extract sub-vectors and sub-matrices from matrices and arrays; and
learn the usage of logical and relational operators in R.

2.2 VECTORS
In this section, we first define a vector and then discuss some of the commonly
used R function on vectors. So, let us define a vector. ‘It is the basic type of
data structure/object in R, which is a sequence of elements of the same class
class’..
In R there are six types of vectors, namely, numeric, integer, character, logical,
complex and raw. Next, the question arises, ‘How is it created in R’? and the
answer is, vectors in R can be created by several methods. We now discuss
the most commonly used methods to create different types of vectors.

2.2.1 Numeric
Num
meric and
an
nd Integer
In
nte
ege
er Vectors
Vecto
ors
One of the simplest method of creating a vector is using the concatenation
(). This function creates a vector by concatenating the
function, i.e., c(
c().
elements or vector objects together.
For the illustration purpose, let us create
creatte a vector with elements 0.4, 0.6, -0.8
-0.
08
and 22.7 using the c() function as follows:
#Creating a numeric vector
> c(0.4, 0.6, -0.8, 22.7)
[1] 0.4 0.6 -0.8 22.7
Note: All the elements of this vector are of the same type
type, limited to one
decimal place and are separated by comma. That is why, in the output all the
elements are printed till one decimal places. Also, this vector is of numeric
type as all of its elements are of real/decimal type and called a numeric vector.
Additionally, note that the numeric vectors are treated as double precision real
numbers.
Furthermore, note that the c() function can also be used to concatenate two
or more vectors or elements. For the illustration purpose, we next create a
vector with elements c(0.4,0.6), c(-0.8,22.7) and 12.3 using the c()
function as follows:
#Concatenating two vectors with a numeric element
> c(c(0.4,
, 0.6),
, c(-0.8,
, 22.7),
, 12.3)
)
[1] 0.4 0.6 -0.8 22.7 12.3

In both the created vectors the elements were of the same precision. Let us
now create a numeric vector using c() function, whose elements are of
30 different precision, with elements 0.13, 0.3102, -0.110002 and 13.1.
Nitty-Gritty of R
#Creating a numeric vector with different precision elements
> c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
[1] 0.130000 0.310200 -0.110002 13.100000

From this output it is clear that, by default, a numeric vector will be printed with
the same precision as of the highest precision element.
If you are interested in saving the recent created vector with name x. Then it
can be done by assigning the vector to x using the assignment operator ‘ <- ’ ,
which is already discussed in the Unit 1 of MST-015 as follows:
#Assigning a vector to x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)

After assigning the vector, we now check whether the vector is successfully
assigned to x or not, by printing x. For printing, either we can use the
print() function or simply can write the name of the created vector as
follows:
#Explicit printing
> print(x)
[1] 0.130000 0.310200 -0.110002 13.100000

#Auto printing
> x
[1] 0.130000 0.310200 -0.110002 13.100000

Note that, when prp t ) function is used to print any R object, the process of
print()
in
nt(
printing is called explicit printing and if we only write the name of an object for
printing, this process is called auto printing. For more details on printing
printin
ng
functions, you can refer to Unit 4 of MST-015. Next, we ccheck heck the class()
clas
cl a s(
as s()
)
and typeof()
typpeof ( of a numeric vector as follows:
f()
#Checking the class and type of a numeric vector
> class(x)
[1] "numeric"
> typeof(x)
[1] "double"

Thus, the class of a numeric vector is "numeric" and type is "double".


Before we start discussing some of the commonly used functions on vector, it
is important you to know, what are function arguments? So, we first discuss
the function arguments and then discuss the commonly used vector functions.
Function arguments:
The function arguments are the variables or information used to compute a
function and are written in the parenthesis after the name of the function. So,
the print(x) function command has only one function argument which is x,
written in parentheses.
Next, we discuss two commonly used vector functions, namely, round() and
length(). The round() function is used to round its x argument to the
number of decimal places specified by the digits argument. Let us first see
the internal structure of this function using the str() function available in the
utils package as follows: 31
Fundamentals of R Language

#Internal structure
> str(round)
)
function (x, digits = 0)

Thus, by using the round() function, an object with name x, is rounded till
the number of decimal places specified by digits, whose default value is 0.
Note: The str() function is used to compactly display the internal structure
of an R object. It can be consider as an alternative to the summary() function
(which will be discussed later).
Let us now round off the earlier created vector x to 2 and 1 decimal places as
follows:
#Rounding x till 2 decimal places
> round(x,
, 2)
[1] 0.13 0.31 -0.11 13.10

#Rounding x till 1 decimal places


> round(x, 1)
[1] 0.1 0.3 -0.1 13.1

The second function leng


length()
n thh()) is used to get the number of elements in a
vector x and its internal structure is as follows:
#Internal structure
> str(length)
function (x)

Let us get the length of the earlier created vector x using the length()
leng
gth()
(
function as follows:
#Getting the length of the vector
o x
> length(x)
[1] 4

As x vector has 4 elements, therefore, we get the length of x as 4.


Note: Using the length() function, we can also set the length of a vector.
For the illustration purpose, consider the following examples in which we first
create two arbitrary vectors named y and z of lengths 4 and 3, respectively.
Then set their lengths using the length() function as follows:
#Setting the smaller length
> y <-
- c(2,
, 4,
, 6,
, 8)
)
> length(y)
) <-
- 3;
; y
[1] 2 4 6

#Setting the larger length


> z <-
- c(1,
, 3,
, 5)
)
> length(z)
) <-
- 5;
; z
[1] 1 3 5 NA NA

From these outputs, it is clear that the length() function also counts NA
(missing) values. Also, the length can be set as smaller or larger than the
original size of the vector. More details on the length() function can be seen
32 in the Unit 3 of MST-015. Next, we discuss integer vectors.
Nitty-Gritty of R
An integer vector in R can be created by several ways, the most popular way
is by appending L at the end of each element of the vector. Consider the
following example for illustration purpose, in which we check the class and
type of a vector whose elements are written by appending L.
#Checking the class of a vector
> class(c(1L,
, 2L))
)
[1] "integer"

#Checking the type of a vector


> typeof(c(1L,
, 2L))
)
[1] "integer"

From this output it is clear that, c(1L, 2L) is an integer vector. Let us next
see, what happens, if we do not append L at the end of each element of it.
#Checking the class of a vector
> class(c(1,
, 2))
)
[1] "numeric"
> class(c(1, 2L))
[1] "numeric"

#Checking the type of a vector


> typeof(c(1, 2))
[1] "double"
> typeof(c(1, 2L))
[1] "double"

Hence, if we do not append L at the end of each element, then the created
vector will be of numeric type.
An integer vector can also be created using the colon ‘ : ’ operator. So, if you
want to generate a vector using the colon operator, then you need to specify
the first and the last values of the sequence or vector, which you are intended
to create.
#Generating a sequence from 0 to 10
> 0:10
0
[1] 0 1 2 3 4 5 6 7 8 9 10

#Generating a sequence from -7 to 13


> -7:13
3
[1] -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Note that, if the first value is smaller than the last value of the sequence, then
the generated sequence will be an increasing sequence and if the first value is
larger than the last value, then the generated sequence will be a decreasing
sequence. Moreover, both increasing or decreasing sequence will be
generated in steps of 1 and the starting and ending values will be separated
by a colon. Consider a few more examples for understanding purpose.
#Generating a decreasing sequence from 6 to -6
> 6:-6
6
[1] 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6
33
Fundamentals of R Language

#Generating an increasing sequence from 0.6 to 8.6


> 0.6:8.6
6
[1] 0.6 1.6 2.6 3.6 4.6 5.6 6.6 7.6 8.6

Note that either the sequence is increasing or decreasing, the values are
increased or decreased by one. Thus, whenever a vector is created using the
colon ‘ : ’ operator, the steps of the generated sequence will be always 1 (by
default). If you want the steps of the generated sequence to be other than 1.
Then you should use some other method to create a vector. One of the
commonly used method is by using the seq() function.
#The seq() function
seq(from, #starting value of the sequence
to, #last value of the sequence
by, #steps or increment/decrement
length, #desired length of the sequence
along, #a vector whose length is to be used
...) #other arguments

It should be noted here that, if the starting value assigned to the from
argument is smaller (larger) than the last value assigned to the toto argument,
then a positive (negative) value should be assigned to the byby argument of the
seq()
seq( () function. For example:
#Generating a sequence from 1 to 2 with an increment of 0.1
> seq(from=1, to=2, by=0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
#Generating a sequence from 5 to 1 with a decrement of 2
> seq(from=5, to=1, by=-2)
[1] 5 3 1

Next, we illustrate the use of the le gth and al


length
eng alon
along
ong
on g arguments off the se
seq(
seq()
q )
function, which in some situations may be helpful for us. For exexample,
e amplp e, there
may be situations in which we wantt to generate a sequence of length equals
to the length of an existing vector. In such cases, the length and along
arguments t can be
b used d efficiently.
ffi i tl ForF the
th illustration
ill t ti purpose, letl t us generatet
a sequence of length 5 starting from 2 and with an increment of 0.2 using the
seq() function as follows:
#Generating a sequence using length argument
> seq(from=2,
, by=0.2,
, length=5)
)
[1] 2.0 2.2 2.4 2.6 2.8

Next, we illustrate the use of the along argument of the seq() function.
Consider a vector x with elements 10, 22, 14, 40, 98 and 11. Clearly, the
length of x is 6. In case, if we want to generate a sequence of same length,
i.e., 6, starting from 1 with an increment of 0.1, then we can use the along
argument of the seq() function as follows:
#Generating a sequence using the along argument
> x <-
- c(10,
, 22,
, 14,
, 40,
, 98,
, 11)
)
> seq(from=1,
, by=0.1,
, along=x)
)
[1] 1.0 1.1 1.2 1.3 1.4 1.5
34
Nitty-Gritty of R
From this output it is clear that the created vector is of length 6, which is same
as the length of x.
Note: If the length or along argument is specified in seq() function, then
we do not need to assign the to argument of the function.
Furthermore, we often want to create special type of vectors, whose elements
are repeats of some specific number(s) or character(s). In such situations, the
rep() function available in the base package can be used efficiently. The
main arguments of interest of the rep() function are as follows:
#The rep() function
rep(x, #a vector which is to be replicated
times, #number of times elements are to be repeated
each, #the repetition of each element of x
length) #length of the output vector

In the rep() function, the x argument is a vector whose elements are to be


replicated. The times argument is an integer vector, used to specify the
number of times each element of x is to be repeated. The each argument of
the function is an integer, which is used to specify the fix repetition of each
element of x. Lastly, the length argument is used to specify the desired
length of the output vector. For the illustration purpose, consider the following:
Let us generate a vector of 4’s, in which 4 is to be repeated 10 number of
times, using the rep() function as follows:
#Replicating x ten number of times
> rep(x=4, times=10) # rep(x=4, length=10)
[1] 4 4 4 4 4 4 4 4 4 4

Consider another example in which the elements of x are repeated by the


number of times specified the corresponding elements of the times argument
of re () function. For example
rep()
rep(
#Replicating x using times argument
> rep(x=1:3, times=c(1,2,3))
[1] 1 2 2 3 3 3

Next, we create a vector in such a way that, each element of vector x is


repeated 5 number of times. This can be achieved by using the each
argument of the rep() function as follows:
#Replicating x using each argument
> rep(x=1:3,
, each=5)
)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Next, we discuss the method of creating a zero vector of a specific length, and
of numeric and integer type using the vector() function. To create it, we
mainly use two arguments of this function, namely, mode and length. To
create a numeric zero vector of length eleven, we assign the mode argument
as “numeric” and the length argument as 11 in the vector() function as
follows:
#Creating numeric vector of zeros
> vector(mode
e = "numeric",
, length
h = 11)
)
[1] 0 0 0 0 0 0 0 0 0 0 0 35
Fundamentals of R Language

Similarly, we can create an integer zero vector of length five by assigning the
mode argument as "integer" and the length argument as 5 in the
following manner:
#Creating an integer vector of zeros
> vector(mode
e = "integer",
, length=5)
)
[1] 0 0 0 0 0

Note: The mode can be "logical" and "character" as well. For more
detail you can see help on this function.

2.2.2 Logical Vectors


A vector who’s all the elements are of TRUE or FALSE types, is known as a
logical vector. The elements of a logical vector can also be written as T and F
instead of TRUE and FALSE but this practice is not preferable as T and F are
not reserved words and can be easily overwritten by the users. We now give
some examples of logical vectors.
A logical vector with elements TRUE, TRUE, FALSE, FALSE and TRUE can
be created using the c() ) function as follows:
#Creating a logical vector
> c(TRUE, TRUE, FALSE, FALSE, TRUE)
[1] TRUE TRUE FALSE FALSE TRUE

#Alternative method
> c(T, T, F, F, T)
[1] TRUE TRUE FALSE FALSE TRUE
Next, we create a logical vector with elements
elem
e ents TRUE, TRUE, FALSE, FALSE
and FALSE using c()
c() and rep()
rep(() functions
funcctions as follows:
#Creating a logical vector
> c(rep(TRUE, 2), rep(FALSE,3))
[1] TRUE TRUE FALSE FALSE FALSE
Lastly, we check the class and type of a logical vector c(TRUE,FALSE) as
follows:
#Checking the class and type
> class(c(TRUE,
, FALSE))
)
[1] "logical"
> typeof(c(TRUE,
, FALSE))
)
[1] "logical"

2.2.3 Character Vectors


A character vector in R can be created by concatenating the characters or
character strings together. It should be noted that the character string is a
sequence of characters delimited by either double quote ( “” ) or single
quote ( ‘’ ). Also, each element of a character vector is always printed within
double quotes (by default).
For the illustration purpose, we create a character vector of week days using
36
the c() function as follows:
Nitty-Gritty of R
#Creating a character vector using double quotes
> c("Sunday",
, "Monday",
, "Tuesday",
, "Wednesday",
, "Thursday",
,
"Friday",
, "Saturday")
)
[1] "Sunday" "Monday" "Tuesday" "Wednesday"
[5] "Thursday" "Friday" "Saturday"

The same vector can also be created using single quotes as follows:
#Creating character vector using single quotes
> c('Sunday','Monday','Tuesday',
, 'Wednesday',
, 'Thursday',
,
'Friday',
, 'Saturday')
)
[1] "Sunday" "Monday" "Tuesday" "Wednesday"
[5] "Thursday" "Friday" "Saturday"

Note that, by default, the output is printed in double quotes. Consider another
example, in which we create a character vector with elements ‘A a 1’, ‘B b 2’,’
C c 3’ and ‘D d 4’ as follows:
#Creating a character vector
> c('A a 1','B b 2','C c 3', 'D d 4')
[1] "A a 1" "B b 2" "C c 3" "D d 4"

In the next example we use c() and rep()


re
ep(() function to create a character
vector with elements AB+, AB+, O+, O+ and O+.
#Creating a character vector
> c(rep('AB+',2), rep('O+',3))
[1] "AB+" "AB+" "O+" "O+" "O+"

Lastly, we check the class and type of a character vector


vecctorr as follows:
#Checking the class and type
> class(c(rep('AB+',2), rep('O+',3)))
[1] "character"

> typeof(c(rep('AB+',2), rep('O+',3)))


[1] "character"

Hence, the class and type of a character vector is "character".


Note: Consider the following:
(i) There are R constants such as LETTERS, letters, month.abb and
month.name consisting of character elements. Interested learners can
view them by simply writing the constants name on R console.
(ii) A complex vector is a vector whose elements are of complex type and a
raw vector is a vector, which is used to represent a raw sequence of
bytes. These two types of vectors are not discussed here.

2.2.4 Elements Extraction from a Vector


We may encounter situations, where we need to extract particular positioned
element(s) of a vector (of any type). The extraction of element(s) can be done
just by writing the position number (also known as index) of the element(s) (to
be extracted) in the brackets ‘ […] ’. For the illustration purpose, let us recall
the earlier created vector x.
37
Fundamentals of R Language

#Creating a vector x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)

Then the positions of the elements of x can be viewed as follows:


Position/Index 1 2 3 4
Element x[1] x[2] x[3] x[4]
Value 0.13 0.3102 -0.110002 13.1

Note: In R, position number of the elements starts with 1. Also, 0 position or


index is not defined in R. For example, x[0] is not defined.
We next illustrate the method of exacting a single element say, 3rd element
from x and the way of extracting the two elements together, say 2nd and 4th
elements together from x in a single command (this process is called indexing)
as follows:
#Extracting 3rd positioned element from x
> x[3]
]
[1] -0.110002

#Extracting 2nd and 4th positioned elements together from x


> x[c(2,4)]
[1] 0.3102 13.1000

Next, we extract elements from a constant LETTERS,


LETTTEERS, consisting of 26 upper
case letters of the Roman alphabet. Let us first view the constant LETTERS
LETT
LE TERS as
follows:
#Viewing the constant LETTERS
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
[14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Next, we discuss the method of extracting the 1st element, then 1stt and 5th
elements together, thereafter, 2nd to 4th elements together as ffollows:
ollows:
#Extracting 1st element of LETTERS
> LETTERS[1]
[1] "A"
#Extracting 1st and 5th elements of LETTERS
> LETTERS[c(1,5)]
]
[1] "A" "E"
#Extracting 2nd to 4th elements of LETTERS
> LETTERS[2:4]
]
[1] "B" "C" "D"

Most importantly, note that a negative sign with position number in the
brackets after the name of the vector is used to drop particular positioned
element(s) as follows:
#Dropping 2nd element of x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
> x[-2]
]
38 [1] 0.130000 -0.110002 13.100000
Nitty-Gritty of R
In the next subsection, we shall discuss the method of appending element(s)
in already created vector.

2.2.5 Appending Elements to a Vector


Recall that the x vector created earlier is of length 4. Now, we illustrate the
method of appending a 7th element in x, but in that case 5th and 6th elements
will not be available (NA), as their values are not assigned.
#Recalling x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)

#Appending 7th element


> x[7]
] <-
- 1
> print(x)
)
[1] 0.130000 0.310200 -0.110002 13.100000 NA
[6] NA 1.000000

Note that the elements in a vector can also be appended using the append()
function available in base package. For the illustration purpose, we append
values 1 and 2 after the 4th element as follows:
#Appending values 1 and 2 after the 4th element
> x <- c(0.13, 0.3102, -0.110002, 13.1)
> append(x, values=c(1,2), after = 4)
[1] 0.130000 0.310200 -0.110002 13.100000 1.000000
[6] 2.000000

SSAQ
SA
AQ 1
Write the output of the following statements:
x <- c(0.2, c(0.1, -1.21), c(0.2, 1.3, 1))
(i) print(x[c(2,5)])
(ii) print(x[-5])
(iii) class(x)
(iv) append(x, values=2, after=5)
(v) seq(from=1, to=2, along=x)
(vi) x[c(-2, -5)]
(vii) x[1:5]

2.3 ARITHMETIC OPERATIONS WITH SCALARS


AND DATA VECTORS
In R programming, a scalar is simply a vector which consists of exactly one
element. In this section, we first illustrate arithmetic operations between
scalars, thereafter illustrate arithmetic operations between vectors.
The most commonly used arithmetic operators in R, which are used to perform
arithmetic operations with scalars and vectors are as follows:
39
Fundamentals of R Language

+
Addition

^ -
Exponent Subtraction

* /
Multiplication Division

%% %/%
Remainder Integer division

To perform arithmetic operations such as addition, subtraction, multiplication


and division between two scalars, we assign two scalars a and b as 4 and 2 in
the following manner:
#Assigning two scalars
> a <- 4; b <- 2

After assigning the scalars, we next perform different arithmetic operations as


follows:
#Performing scalar’s addition using +
> a+b
[1] 6

#Performing scalar’s subtraction using -


> a-b
[1] 2

#Performing scalar’s multiplication using *


> a*b
[1] 8

#Performing scalar’s division using /


> a/b
[1] 2

Next, to illustrate the exponent, remainder and integer division operators, we


compute a3 using exponent operator, remainder of a divided by 2 using %%
operator and quotient obtained from integer division of a by 3 using %/%
operator as follows:
#Computing a3 using ^
> a^3
3
[1] 64
#Computing remainder of a divided by 2 using %%
> a%%2
2
[1] 0
#Performing integer division of a by 3 using %/%
> a%/%3
3
40 [1] 1
Nitty-Gritty of R
Note: The normal division of a by 3 will give 1.333333, which is different from
1 (obtained using integer division). Hence in integer division the fraction part is
truncated.
#Performing division
> a/3
3
[1] 1.333333

Next, we perform the arithmetic operations on vectors. For the illustration


purpose, we first create two vectors c(2,4,5,6) and c(6,1,4,7); and
assign them to x and y respectively. Thereafter we perform arithmetic
operations such as +, -, * and / between them as follows:
#Creating two vectors
> x <-
- c(2,
, 4,
, 5,
, 6)
)
> y <-
- c(6,
, 1,
, 4,
, 7)
)

Note: In R, arithmetic operations between two or more vectors are performed


element-by-element
element by element wise. For example, if we want to perform vector addition +
between two vectors x and y, then the ith element of x vector will be added to
the ith element of the y vector as follows:

Element → 1st 2nd … ith … nth


x vector x[1] x[2] … x[i] … x[n]
Arithmetic operator + + … + … +
y vector y[1] y[2] … y[i] … y[n]
Output x[1]+y[1] x[2]+y[2] … x[i]+y[i] … x[n]+y[n]

Similarly vectors subtraction, multiplication and division can be performed.


Also, these operations can be performed between any number of vectors.
Next, we perform addition, subtraction, multiplication, division, remainder after
division and integer division between x and y vectors as follows:
#Performing addition of two vectors
> x+y
[1] 8 5 9 13

#Performing subtraction of two vectors


> x-y
y
[1] -4 3 1 -1

#Performing multiplication of two vectors


> x*y
y
[1] 12 4 20 42

#Performing division of two vectors


> x/y
y
[1] 0.3333333 4.0000000 1.2500000 0.8571429

#Computing remainder of elements of x divided by corresponding


#elements of y
> x%%y
y
[1] 2 0 1 6
41
Fundamentals of R Language
#Performing element wise integer division
> x%/%y
y
[1] 0 4 1 0

Next, we illustrate the exponent operator. Note that whenever the exponent
operator is used on a vector, the exponent of each element of the vector is
computed. For example, let us compute x2, where x is a vector using the ‘ ^ ’
operator as follows:
#Obtaining the positive power of the elements of a vector x
> x <-
- c(2,
, 4,
, 5,
, 6)
)
> x^2
2
[1] 4 16 25 36

Next, we compute y(-1) using the ‘ ^ ’ operator as follows:


#Computing the negative power of the element of a vector
> y <-
- c(6,
, 1,
, 4,
, 7)
)
> y^(-1)
)
[1] 0.1666667 1.0000000 0.2500000 0.1428571

In addition to all these, we next perform arithmetic operations between scalars


and vectors; and observe the computed outputs. Let us first multiply a scalar 2
with a vector x, whose elements are -1, 0.4, 5.3, -2 and 8. In this case, the
scalar is multiplied to each element of the vector as follows:
#Performing multiplication of a scalar with a vector
> x <- c(-1, 0.4, 5.3, -2, 8)
> 2*x
[1] -2.0 0.8 10.6 -4.0 16.0

Note that, whenever arithmetic operators, such as +, -, /, %% % and %/% are


used to perform arithmetic operations between scalars and vectors (which is
not advisable), in that case, firstly the scalar replicate itself until it m atches the
matches
length of the vector, with which the operation is to be performed
performed. d. ThThereafter,
hereafter,
the arithmetic operations are perfor rmed.
performed.
For the illustration purpose, let us perform arithmetic operations, +, - and /
between a scalar 2 and a recently created vector x. Then in this case 2 will
replicate itself 5 times as the length of the x vector is 5, thereafter the
arithmetic operation will be performed as follows:
#Adding a scalar to a vector
> 2+x
x
[1] 1.0 2.4 7.3 0.0 10.0

#Subtracting a scalar from a vector


> 2-x
x
[1] 3.0 1.6 -3.3 4.0 -6.0

#Dividing each element of a vector from a scalar


> x/2
[1] -0.50 0.20 2.65 -1.00 4.00

Note: In general, if in an arithmetic expression (consisting of arithmetic


operators), all vectors are not of the same length, then in that case the shorter
42
Nitty-Gritty of R
vectors replicates/recycled themselves until they match the length of the
longest vector present in that expression. This process is known as recycling
rule of R.
Moreover, it should be noted that in addition to all these common arithmetic
operators, mathematical functions such as min(), max(), factorial(),
log(), exp(), sin(), cos(), tan(), etc, can also be used on vectors. All
these functions are available in the base package. We now illustrate each of
these mathematical functions with the help of a vector z with arbitrary
elements 7:9, 4, 5 and 6.
#Creating a vector
> z <-
- c(7:9,
, c(4,
, 5,
, 6))
)

We first compute the minimum and maximum of vector z using the min() and
max() functions respectively as follows:
#Computing the minimum
> min(z)
[1] 4

#Computing the maximum


> max(z)
[1] 9

Next, we compute the natural logarithm, exponential and factorial of each


element of z using lo
log(),
log( exp()
g ), ex
xp( () and factorial()
faactorial( () functions, respectively,
as follows:
#Computing the natural log of each element of z
> log(z) #log(z, 10) for log10z
[1] 1.945910 2.079442 2.197225 1.386294 1.609438 1.791759

#Computing ez for each element of z


> exp(z)
[1] 1096.63316 2980.95799 8103.08393 54.59815 148.41316
403.42879

#Computing the factorial of each element of z


> factorial(z)
[1] 5040 40320 362880 24 120 720

To illustrate trigonometric functions, let us consider the sin() function, as


other trigonometric functions can be handled on the same line. Let us first
create a vector y with elements / 6, / 4, / 3 and / 2 .

#Creating vector y
> y <-
- c(pi/6,
, pi/4,
, pi/3,
, pi/2)
)

Note: The built-in constant pi available in R can be used in place of .The


default value of R constant pi is 3.141593, which can be verified as follows:
#Built-in constant pi
> pi
i
[1] 3.141593
43
Fundamentals of R Language

Then using pi the values of sin(y) can be obtained as follows:


#Computing sin( / 6 ), sin( / 4 ), sin( / 3 ) and sin( / 2 )
> sin(y)
[1] 0.5000000 0.7071068 0.8660254 1.0000000

SAQ
Q2
Write the output of the following statements:
(i) 3+2*c(1,2)
(ii) min(c(0.2, -0.2, 0.0))
(iii) tan(c(pi/6, pi/4))
(iv) c(1,4,8,3)%/%c(2,5,2,2)

2.4 MATRICES
A matrix is a two-dimensional rectangular layout of the collection of data
elements of the same class. Matrices in R, can be created using several
methods. The most commonly used method of creating a matrix is using the
ix() function available in the ba
matrix()
m atrix
ix base
ase e package. Note that, whenever a
matrix is created using the m matrix()
atrix() ) function, the elements of the matrix, by
default, will be filled along with the column orientation. Also, the dimension of
the matrix is defined by passing or supplying appropriate values to the nrow
nrow
n ol arguments of the matrix()
and ncol
nc matri ix() ) function. These arguments are used to
specify the number of rows and columns of the matrix. The main arguments of
interest of the matrix()
matri
ma ( function are as follows:
ix()
#The matrix() function
matrix(data, #data vector of matrix elements
nrow, #number of rows of created matrix
ncol, #number of columns of created matrix
i
byrow, #control the filling of data elements in matrix
dimnames
dimnames, #gives names to rows and columns
...) #other arguments

The data argument of the matrix function is used to assign the data vector,
the nrow and ncol arguments are used to assign the number of rows and
columns of the created matrix, the byrow argument is a logical argument of
the function. If byrow =TRUE, then the elements of the data will be filled row-
wise in the created matrix, or otherwise will be filled column-wise. Lastly, the
dimnames argument is used to give names to the rows and columns of the
matrix. Note that, in dimnames a list of two components consisting of the
names of the rows and columns is assigned.
Note: Lists objects are discussed in the Unit 3 of MST-015 course.
Next, we illustrate the method of creating a matrix of dimension 2x3 with
elements -1, 3, -2, 5, 4 and 2 using the matrix() function. To do so, we
assign the data argument of the function as the vector consisting of these
44 elements, the nrow argument as 2 and the ncol argument as 3, as follows:
Nitty-Gritty of R
#Creating a matrix in which elements are filled column-wise
> matrix(data=c(-1,
, 3,
, -2,
, 5,
, 4,
, 2),
, nrow=2,
, ncol=3)
)
[,1] [,2] [,3]
[1,] -1 -2 4
[2,] 3 5 2

Note: You should note the following:


(i) The same matrix can also be created, without mentioning the arguments
names.
#Creating a matrix without mentioning the arguments names
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4,
, 2),2,3)
)
[,1] [,2] [,3]
[1,] -1 -2 4
[2,] 3 5 2

(ii) The class and type of a matrix object can be seen as follows:
#Checking the class and type
> class(matrix(c(-1, 3, -2, 5, 4, 2), 2, 3))
[1] "matrix" "array"

> typeof(matrix(c(-1, 3, -2, 5, 4, 2), 2, 3))


[1] "double"

The class of a matrix object is coming out to be "m


"matrix as well as
"matrix"
rix"
ri
"a y" because the matrices are array of two-dimensions and the type of
"array"
array
the matrix is "d
dou
o bl e .
"double".
le"
After the note, let us again come back to the creation of o a matrix. You should
note that, in the created matrix, the data elements are firstly stly filled in the first
firs
column, then in the second column and lastly in the third column. If we want
the first row is to be filled first, with the data elements and then the second row
(known as row orientation), then in that case, we must assign the by byrow
byro ow
argument of the ma matrix()
matr
tri
tr ix() function as TR RUE in the following manner:
TRUE
#Creating a matrix in which elements are filled row-wise
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4,
, 2),
, nrow=2,
, ncol=3,
, byrow=TRUE)
)
[,1] [,2] [,3]
[1,] -1 3 -2
[2,] 5 4 2

Note that, whenever both the function arguments nrow and ncol are
specified, the product of both of them should be equal to the length of the
data, or otherwise you may get a warning message. Consider the following
example, in which the length of the data is 8, which is larger than the product
of the dimensions 2x3, i.e., 6 for illustration purpose only.
#Creating a matrix
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4,
, 2,
, 4,
, 5),
, nrow=2,
, ncol=3,
,
byrow=TRUE)
)
[,1] [,2] [,3]
[1,] -1 3 -2
[2,] 5 4 2 45
Fundamentals of R Language

Warning message:

In matrix(c(-1, 3, -2, 5, 4, 2, 4, 5), nrow = 2, ncol = 3,


byrow = TRUE) :data length [8] is not a sub-multiple or
multiple of the number of columns [3]

So, from this output we can see that, whenever the length of the data is more
than the product of nrow and ncol in the matrix() function, in that case the
extra data elements will be discarded with a warning message.

Next, we illustrate, what happens if the length of the data is less than the
product of nrow and ncol in the matrix() function. In that case the data will
start to replicate itself until it matches the product of nrow and ncol with a
warning message. For example:

#Creating a matrix
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4),
, nrow=2,
, ncol=3,
, byrow=TRUE)
)
[,1] [,2] [,3]
[1,] -1 3 -2
[2,] 5 4 -1
Warning message:
In matrix(c(-1, 3, -2, 5, 4), nrow = 2, ncol = 3, byrow = TRUE)
:
data length [5] is not a sub-multiple or multiple of the number
of rows [2]

In addition to all these, for creating a matrix using matrix()


matr
matrix
ixx() fu
ffunction,
nction
o , it is
always necessary to specify at least one n of the dimension arguments, either,
ncol or nr
nc nroww. Keeping in the mind, that
nrow. a the prov
o ided or specified argument
provided
must be the multiple of the length of the data argument of the matrix. As, the
second argument will be inferred from the length of the d a. For the
ata
data. t e
th
illustration purpose, let us create a matrix with elements 2.5, 1.3,
1 3,
1. 3 -2.1, 0.0,
0.33, -0.1, 0.8, -9.8 and 2.2. Also, assign only the nc l argument as 3, of the
ol
ncol
mat
ma trix() function as follows:
matrix()

#Creating a matrix by assigning only one dimension argument


> matrix(c(2.5,1.3,-2.1,0.0,0.33,-0.1,0.8,-9.8,2.2),
, ncol=3)
)
[,1] [,2] [,3]
[1,] 2.5 0.00 0.8
[2,] 1.3 0.33 -9.8
[3,] -2.1 -0.10 2.2

In this illustration, we have created a matrix with 9 data elements, and we


have assigned only the number of columns. We can see that the number of
rows is inferred as 9/3, which is equals to 3. That is why, the created matrix
has 3 number of rows. Next, we assign names to the rows as R1, R2, R3 and
to the columns as C1, C2, C3, using the dimnames argument of the function
with the help of lists (the detailed discussion on lists can be seen in Unit 3 of
MST-015).
46
Nitty-Gritty of R
#Assigning names to rows and columns of a matrix
> matrix(c(2.5,
, 1.3,
, -2.1,
, 0.0,
, 0.33,
, -0.1,
, 0.8,
, -9.8,
, 2.2),
,
ncol=3,
, dimnames=list(c("R1","R2","R3"),c("C1","C2","C3")))
)
C1 C2 C3
R1 2.5 0.00 0.8
R2 1.3 0.33 -9.8
R3 -2.1 -0.10 2.2
Note: The dimnames argument should be assigned as a list consisting of two-
character vectors as components.

2.4.1 Matrix Addition, Subtraction and Multiplication


Matrix addition, subtraction and multiplication in R, are facilitated by the
following operators:
- %*%
Matrix Matrix Matrix
Addition Subtraction Multiplication

Before adding, subtracting or multiplying any two or more matrices, it is our


responsibility to check, whether the dimensions of the matrices are suitable for
aforementioned matrix operations or not. In order to illustrate the addition,
subtraction and multiplication of matrices in R programming, consider the
following three arbitrary matrices A, B and C of dimensions 3x3, 3x4 and 3x3,
respectively.
1 7 13 1.0 5.5 10.0 14.5 2 4 2
A= 3 9 15 , B= 2.5
2 5 7.0
7 0 11.5 16 0 and C= 3 5
11 5 16.0 1
5 11 17
17 4.0
4 0 8.5
8 5 13 0 17.5
13.0 17 5 1 1 5

#Assigning matrix A
> A <- matrix(seq(from=1, to=17, by=2), nrow=3); A
[,1] [,2] [,3]
[1,] 1 7 13
[2,]
[2 ] 3 9 15
[3,] 5 11 17

#Assigning matrix B
> B <-
- matrix(seq(from=1,
, to=18,
, by=1.5),
, nrow=3,
, ncol=4);
; B
[,1] [,2] [,3] [,4]
[1,] 1.0 5.5 10.0 14.5
[2,] 2.5 7.0 11.5 16.0
[3,] 4.0 8.5 13.0 17.5

#Assigning matrix C
> C <-
- matrix(c(2,
, 3,
, 1,
, 4,
, 5,
, 1,
, 2,
, -1,
, 5),
, nrow=3);
; C
[,1] [,2] [,3]
[1,] 2 4 2
[2,] 3 5 -1
[3,] 1 1 5 47
Fundamentals of R Language

Observe that the two matrices A and C are comfortable for matrix addition and
subtraction. So, let us perform addition and subtraction of matrices A and C
using the ‘ + ’ and ‘ - ’ operators as follows:
#Performing matrix addition
> A+C
C
[,1] [,2] [,3]
[1,] 3 11 15
[2,] 6 14 14
[3,] 6 12 22

#Performing matrix subtraction


> A-C
C
[,1] [,2] [,3]
[1,] -1 3 11
[2,] 0 4 16
[3,] 4 10 12

Next, we perform matrix multiplication using the ‘ %*% ’ operator. Since the
matrix C is comfortable for matrix multiplication with matrix B, therefore, we
next compute their product as follows:
#Performing matrix multiplication
> C%*%B
[,1] [,2] [,3] [,4]
[1,] 20.0 56 92.0 128
[2,] 11.5 43 74.5 106
[3,] 23.5 55 86.5 118

Recall that, whenever a scalar is multiplied with a matrix. All the elements of
that matrix will be multiplied by the scalar. So, if k is a scalar, whose value is 3
and A is the earlier created matrix, then we can compute their pr product
p oduct using
the ‘ * ’ operator as follows:
#Assigning scalar
> k <-
- 3

#Performing multiplication of a scalar with a matrix


> k*A
A
[,1] [,2] [,3]
[1,] 3 21 39
[2,] 9 27 45
[3,] 15 33 51

Note: These operations can be perfomed on any number of matrices of some


suitable orders.

2.4.2 Extraction of Subvectors and Submatrices


In R, the extraction of subvectors and submatrices are done by using simple
commands. To illustrate the method, let us first create a matrix L of order 4x5,
with following lower-case letters of the alphabets.
48
‘a’, ‘b’, ‘c’,…, ‘s’, ‘t’
Nitty-Gritty of R
Recall that these lower-case letters can be easily extracted from the built-in
constant letters. So, we use it to create a matrix named L as follows:
#Creating a matrix L of first 20 letters of alphabet

> L <-
- matrix(letters[1:20],
, nrow=4,
, ncol=5);
; L

[,1] [,2] [,3] [,4] [,5]

[1,] "a" "e" "i" "m" "q"


[2,] "b" "f" "j" "n" "r"

[3,] "c" "g" "k" "o" "s"

[4,] "d" "h" "l" "p" "t"

Note that, the built-in constant letters consist of 26 lower-case letters of the
Roman alphabet. We have used only first 20 for the illustration purpose.
Before, we illustrate the method of extraction, it is important for you to
understand the terms, ‘row indices’ and ‘column indices’. The first place in
brackets ‘ […] ’ after the name of the matrix is known as the place for row
indices (also referred as margin 1) and the second place, which is separated
by a comma from row indices is known as the place for column indices (also
referred as margin 2). The row and column indices are used to specify
particular row(s) or column(s) or both of a considered matrix.

Now, we show the method of extraction of 2nd row of the matrix L. To extract
the 2nd row from L
L,, we write 2 at the row indices place and leave the column
indices place empty in brackets as follows:

#Extracting the 2nd row of the matrix L


> L[2,]
[1] "b" "f" "j" "n" "r"

Similarly, we can also extract the 3rd column of the matrix L by leaving the row
indices place empty and writing 3 at the column indices
indice
es place in brackets as
follows:

#Extracting the 3rd column of the matrix L


> L[,3]
]
[1] "i" "j" "k" "l"

In case, if you are interested in extracting the 4th element of the 3rd row of
matrix L, then, it can be extracted by writing 3 at the row indices place and 4 at
the column indices place in brackets as follows:
#Extracting the 4th element of the 3rd row of L
> L[3,4]
]
[1] "o"

Most importantly, note that, leaving any indices place (row or column) empty
leads to selection of the full range of that subscript. Moreover, in general, the
extraction of a sub-vector and a particular positioned element from a given
matrix can be done, by writing the following after the name of the matrix:
49
Fundamentals of R Language

[i,
, ] For extracting the ith row vector

[ ,j] For extracting the jth column vector

[i,j]
] For extracting the (i, j)th element

Till now in matrices, we have only discussed the method of extraction of a


scalar and a particular row vector or a column vector from a given matrix.
Next, we discuss the method of extraction of sub-matrix from a given matrix.
For the illustration purpose, we now illustrate the method of extraction of the
matrix of 4 elements shown in a rectangular box from the matrix L.

" a " " e " "i" "m" " q"


"b " " f " " j" "n" "r "
L
"c " "g
g" "kk " " o " " s "
" d" "h" "l" "p
" " "t"

You can note that the matrix shown in the rectangular box is appearing in the
1st and 2nd rows of L. In addition to this, its columns are appearing in the 3rd
and 4th columns of L, which means, we can easily extract the matrix shown in
the rectangular box by writing the row indices as c(1,2)
(1,2) and the column
c(
indices as c(
c 4) in the following manner:
c(3,4)
3,4
#Extracting the submatrix shown in the rectangular box
> L[c(1,2), c(3,4)]
[,1] [,2]
[1,] "i" "m"
[2,] "j" "n"

An alternative method of obtaining the submatrix is that, we can


a dro
drop
op those
rows and columns from the original matrix, in which the submatrix elements
are not appearing. This exercise is left for the learners.
Next, we illustrate how, an element or a subvector can be replaced by another
element or subvector. To illustrate it, let us replace the element present at the
2nd column of the 2nd row, i.e., "f" by letter "R" in the matrix L. To do this, we
must know the method to access that element, which is L[2,2], then it can
be overwritten with the help of assignment operator as follows:
#Overwritting an element of matrix L
> L[2,2]
] <-
- "R"
"

#Verifying whether replacement is successfully performed or not


> print(L)
)
[,1] [,2] [,3] [,4] [,5]
[1,] "a" "e" "i" "m" "q"
[2,] "b" "R" "j" "n" "r"
[3,] "c" "g" "k" "o" "s"
[4,] "d" "h" "l" "p" "t"
50
Nitty-Gritty of R
From this output, it is clear that the replacement of a matrix element is
successfully performed, i.e., the 2nd element of the 2nd row, i.e., "f" is
replaced by "R". Note that, due to this replacement, the L matrix is now
changed. So, to illustrate the replacement of any column or row of a matrix, we
again assign L first. Then do the replacement.
We next illustrate the method of replacement of an entire column of a matrix,
say for example, we shall replace the 2nd column of the original matrix L, i.e.,
T T
" e " " f " " g" "h" by "R " "M" "N" " O " . So, it can be done as
follows:
#Recalling the matrix L
> L <-
- matrix(letters[1:20],
, nrow=4,
, ncol=5)
)

#Replacing the 2nd column of L


> L[,2]
] <-
- c("R",
, "M",
, "N",
, "O")
)

#Verifying whether replacement is successfully performed or not


> print(L)
[,1] [,2] [,3] [,4] [,5]
[1,] "a" "R" "i" "m" "q"
[2,] "b" "M" "j" "n" "r"
[3,] "c" "N" "k" "o" "s"
[4,] "d" "O" "l" "p" "t"

From this print statement, it is clear that the replacement of the 2nd column of
the matrix L is successfully
y performed.

2.4.3
2.
.4.3 Matrix
Ma
atrix Functions
Fun
ncttion
ns
In this subsection, we shall discuss some important matrix functions and
illustrate the execution of each one of them one-by-one by giving some
suitable examples. A list of most popular matrix functions (with their objective
in front) are as follows:

Matrix
Objective
Function
t()
) Obtain the transpose of a matrix.

nrow()
) Obtain the number of rows of a matrix.

ncol()
) Obtain the number of columns of a matrix.

dim()
) Obtain the dimension of a matrix.

rowSums() Obtain the vector of row sums.


rowMeans() Obtain the vector of row means.
colSums() Obtain the vector of column sums.
colMeans() Obtain the vector of column means.

rbind()
) Combine vectors/matrices vertically.

cbind() Combine vectors/matrices horizontally.

51
Fundamentals of R Language

det()
) Compute the determinant of a matrix.

solve()
) Obtain the inverse of a matrix.

diag()
) For multiple purpose depending on argument supplied to this
function. The arguments can be scalar, vector or a matrix.

To illustration the execution of dim(), nrow(), ncol(), t(), rowSums(),


colSums(), rowMeans(), colMeans(), det() and solve() functions, we
create a 3x3 non-singular square matrix A with elements 3, 4, 2, 1, -4, 2, 1, -3
and 4 as follows:
#Creating a matrix A
> A <-
- matrix(c(3,
, 4,
, 2,
, 1,
, -4,
, 2,
, 1,
, -3,
, 4),
, ncol=3);
; A
[,1] [,2] [,3]
[1,] 3 1 1
[2,] 4 -4 -3
[3,] 2 2 4

After creating a matrix A,


A, we next illustrate the execution of aforementioned
matrix functions one-by-one as follows:
#Getting the dimension of A
> dim(A)
[1] 3 3
#Getting the number of rows of A
> nrow(A)
[1] 3
#Getting the number of columns of A
> ncol(A)
[1] 3
#Obtaining the transpose of A
> t(A)
[,1] [,2] [,3]
[1,] 3 4 2
[2,] 1 -4 2
[3,] 1 -3 4

Now, we illustrate the use of the rowSums() and rowMeans() functions. It


can be seen that, the sum of elements of first row of the matrix A is 3+1+1=5.
Also, the sum of the elements of second and third rows of the matrix A is
4-4-3=-3 and 2+2+4=8, respectively. Hence, the vector of row sums is (5, -3,
8). As, each row is having 3 elements, therefore, the means of the elements of
first, second and third rows are obtained as 5/3, -3/3 and 8/3, respectively.
Thus, the vector of row means is (1.666667, -1.000000, 2.666667). All, these
calculations in R can be done in R, using the rowSums() and rowMeans()
functions as follows:
#Obtaining a vector of row sums
> rowSums(A)
)

52 [1] 5 -3 8
Nitty-Gritty of R
#Obtaining a vector of row means
> rowMeans(A)
)
[1] 1.666667 -1.000000 2.666667

Similarly, the sum of elements of the first, second and third columns of the
matrix A are 3+4+2, 1-4+2, 1-3+4, i.e., 9, -1 and 2, respectively. Which yields
the column means as 9/1, -1/3 and 2/3, respectively. Thus, the vectors of
column sums and column means are (9, -1, 2) and (3, -0.3333333,
0.6666667), respectively. The same can be obtained by using the colSums()
and colMeans() functions as follows:
#Obtaining a vector column sums
> colSums(A)
)
[1] 9 -1 2

#Obtaining a vector of column means


> colMeans(A)
)
[1]
[ ] 3.0000000 -0.3333333 0.6666667

Next, we discuss the execution of the det(),


det( lve() and diag()
t ), solve()
sol
functions. For the illustration purpose, let us compute the inverse of the earlier
created matrix A with elements 3, 4, 2, 1, -4, 2, 1, -3 and 4 in R. Recall that,
before obtaining the inverse of any matrix A, it is our responsibility to first
check whether the matrix is singular or non-singular (as we can only obtain the
inverse if the matrix is non-singular) by computing the determinant of the
matrix as follows:
#Computing the determinant of a matrix A
> det(A)
[1] -36

Note: If the determinant of a matrix is zero then the matrix is called singular
(which means inverse does not exist), and if the determinant of a matrix is non-
zero then the matrix is called non-singular (which means inverse exists).
As the computed value of the determinant is non-zero, therefore, the matrix A
hence, its inverse exists.
is non-singular and hence Next, we compute the inverse of
exists Next
A using the solve() function as follows:
#Computing the inverse of a matrix A
> solve(A)
)
[,1] [,2] [,3]
[1,] 0.2777778 0.05555556 -0.02777778
[2,] 0.6111111 -0.27777778 -0.36111111
[3,] -0.4444444 0.11111111 0.44444444

You can verify whether the computed inverse is correct or not by verifying the
following result:

Next, we illustrate the use of the diag() function. This function can take 3
types of arguments, namely, scalar, vector and a matrix. Whenever, a scalar k
is supplied as a function argument to diag() function, it creates a kxk
53
Fundamentals of R Language

identity matrix. Further, whenever a vector is supplied as its argument, it


creates a diagonal matrix, with the elements of the vector. Moreover,
whenever a matrix is supplied as its argument, it returns a vector consisting of
the elements of the principal diagonal. For the illustration purpose, consider a
scalar k as 4, a vector x with elements 1 to 4 and earlier created matrix A. To
support the given statement, we supply each one of them as function
argument one-by-one in the diag() function as follows:
#Assigning a scalar k
> k <-
- 4

#Supplying a scalar to diag() function


> diag(k)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1

#Assigning a vector
> x <- c(1, 2, 3, 4) # x <- 1:4

#Supplying a vector to diag() function


> diag(x)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 2 0 0
[3,] 0 0 3 0
[4,] 0 0 0 4

#Supplying a matrix A to diag() function


> diag(A)
[1] 3 -4 4

Lastly, in this section, we illustrate the method of execution of the rbind()


and cbind() functions. Note that, these two functions take two or more
matrices/vectors as function arguments. So, for illustration purpose, we need
one more matrix, say, B of some suitable order (such that, it can be combined
row wise and column wise with matrix A). Let us create an arbitrary matrix B of
order 3x3 of 1’s as follows:
#Creating a matrix B
> B <-
- matrix(rep(1,9),
, ncol=3);
; B
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1

Note that any number of matrices or vectors can be combined vertically (row
wise) or horizontally (column wise), using the of rbind() and cbind()
functions, respectively. Another, important point is whether the matrices are
54
Nitty-Gritty of R
combined or the vectors are combined the obtained output will always be a
matrix object.
#Combining two matrices A and B row wise
> rbind(A,B)
)
[,1] [,2] [,3]
[1,] 3 1 1
[2,] 4 -4 -3
[3,] 2 2 4
[4,] 1 1 1
[5,] 1 1 1
[6,] 1 1 1

These two matrices can also be combined column wise using the cbind()
function as follows:
#Combining two matrices A and B column wise
> cbind(A,B)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 1 1 1 1
[2,] 4 -4 -3 1 1 1
[3,] 2 2 4 1 1 1

SSAQ
SA
AQ 3
(a) Write the output of the following code:
A <- matrix(1:4,nrow=2); A
B <- matrix(5:8, nrow=2, ncol=2); B
C <- matrix(rep(1,4),ncol=2); C
A-C+B%*%C
(b) Define matrix in R and create a matrix with follo
following
owing elements.
1 0 3
0 1 2
2 4 0

2.5 ARRAYS
From previous sections, you can observe that a vector is a one-dimensional
arrangement of data elements and a matrix is a two-dimensional arrangement
of data elements, i.e., when data are presented in rows and columns. Arrays in
R provides the more generalized way of presenting the data in one, two or
more than two dimensions. In fact, an array with one and two dimensions are
same as a vector and a matrix, respectively. Arrays in R can be created using
the array() function available in the base package.
#The array() function
array(data, #data vector of elements
dim, #to specify dimension
...) #other arguments
55
Fundamentals of R Language

For the illustration purpose, create a vector of elements from 1 to 18 and


assign it to x as follows:
#Assigning a vector x
> x <-
- 1:18
8

Then a one-dimension array can be created by assigning the data argument


as x vector and the dim argument (used to specify dimension) as the length of
the x vector as follows:
#creating an array of one dimension
> array(data
a = x,
, dim
m = length(x))
)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

For creating a matrix using the array() function, we should assign the data
argument as vector x and the number of rows and columns of the created
matrix to the dim argument such that the product of the number of rows and
columns should be equal to the number of elements in data argument as
follows:
#Creating an array of two-dimension
> array(data=x, dim=c(3,6))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 10 13 16
[2,] 2 5 8 11 14 17
[3,] 3 6 9 12 15 18

Note: In the function argument dim


m the first place is used to specify number of
rows and the second place is used to specify number e of columns written
wrritten using
the c() function.
Next, for creating a three-dimensional array using the x vector, we again
data argument as x and the di
assign the da m argument of the ar
dim array()
arraay(
y()
function with 3 indices. The product of these 3 indices should be
e equal to the
number of elements in the data as follows:
#Creating an array of three-dimension
> array(data=x,
(d t , di
dim=c(3,2,3))
(3 2 3))
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
, , 3
[,1] [,2]
[1,] 13 16
[2,] 14 17
56 [3,] 15 18
Nitty-Gritty of R
It is important to note that, the created array has 18 elements, which is equals
to the product of 3x2x3 (dimensions). The indices written in the dim
arguments shows that, we are arranging the elements in 3 matrices, each of
order 3x2.

2.5.1 Extraction of Subsections of an Array


Individual elements of an array can be extracted on the same lines as in the
case of matrices. In arrays, leaving any dimension place or any indices place
empty leads to selection of full range of that subscript. For the illustration
purpose, let us create an arbitrary array of three-dimension and assign it to
Arr as follows:
#Creating an array of three-dimension
> Arr
r <-
- array(data=seq(1,
, 8.5,
, 0.5),
, dim=c(4,
, 2,
, 2))
)
> print(Arr)
)

, , 1
[,1] [,2]
[1,] 1.0 3.0
[2,] 1.5 3.5
[3,] 2.0 4.0
[4,] 2.5 4.5

, , 2
[,1] [,2]
[1,] 5.0 7.0
[2,] 5.5 7.5
[3,] 6.0 8.0
[4,] 6.5 8.5

Extraction of elements in an array corresponding to different indices can be


learnt from the following examples:
#Extracting the first matrix of order 4x2 from Arr
> Arr[,,1]
[,1] [,2]
[1,] 1.0 3.0
[2,] 1.5 3.5
[3,] 2.0 4.0
[4,] 2.5 4.5

#Extracting the first row of each matrix of an array


> Arr[1,,]
]
[,1] [,2]
[1,] 1 5
[2,] 3 7

#Extracting the first row of first matrix of an array


> Arr[1,,1]
]
[1] 1 3
57
Fundamentals of R Language

SAQ
Q4
Create an array of two dimension with the following elements:
2 0 1
1 1 2
3 0 1
4 1 0
1 2 1
1 0 1

After creating it save it under the name B. Also, extract the row shown in the
rectangle.

2.6 FACTORS
A factor in R, is a special type of object, which provides an easy way to specify
discrete classification of the elements of vectors of the same length. It
provides an easy way of handling categorical (nominal) data. The possible
values it can take can be seen from its levels. For example, in general, any
statistical data may always have categorical variables, which indicates the
subdivision of the data under consideration on the basis of social class, cancer
stage etc. Factors in R, are created using the factor()
fa
actor() ) function. For
example, consider an illustration, in which we create a factor object consisting
of the data of social status of 7 individuals using the factor() function as
follows:
#Creating a factor
> factor(c("Medium", "Low", "Medium", "High", "High", "Low",
"Medium"))
[1] Medium Low Medium High High Low Medium
Levels: High Low Medium

Note that, whenever a factor is created, the levels of that factor are displayed
in alphabetical order (the Levels: High, Low, Medium are in alphabetical
order). The levels argument of the function factor() can be used to set
the order of level of factors as follows:
#Setting the order of levels of a factor
>factor(c("Medium",
, "Low",
, "Medium",
, "High",
, "High",
, "Low",
,
"Medium"),
, levels=c("Low",
, "Medium",
, "High"))
)
[1] Medium Low Medium High High Low Medium
Levels: Low Medium High

You can clearly observe the difference between the presentation of the
Levels. In the first example the levels are printed in alphabetical order, but in
the second example Levels are printed in the order assigned by us.
Let us next discuss another method of creating a factor in R. Factor in R, can
also be created using the gl() function available in the base package. The
main three arguments of interest of the gl() function are n, k and labels.
The n argument is an integer used to assign the number of levels, the k
58 argument is used to assign the number of replications of each level and the
Nitty-Gritty of R
labels argument is used to set the labels, that are to be given to the
Levels. To understand it more clearly, let us create a vector with levels Low,
Medium and High, such that each level is replicated 2 times using the gl()
function as follows:
#Generating a factor of length 6 with 3 levels
> gl(n=3,
, k=2,
, labels=c("Low",
, "Medium",
, "High"))
)
[1] Low Low Medium Medium High High
Levels: Low Medium High

Hence it can be seen that in the created factor, there are 3 levels and each
level is replicated 2 number of times and the labels are "Low", "Medium",
"High".

SAQ
Q5
Generate a factor of length 10 with 2 levels YES and NO. Each level should be
replicated 5 number of times.

2.7 MISSING VALUES


When any component or data values of an R object are not known, the
unknown values are called as missing values. Missing values are also called
as ‘Not Available’ values. Obtaining the data corresponding to each sample
unit of a variable is not always possible. Therefore, some of the values are
sometime marked as missing values. In R, the place of missing values is
reserved by assigning that missing place with a special value NA,
NA, where NA
stands for “Not Available”.
Further, any operation implemented of a data consisting of NA (missing
values) becomes NA,NA as the specification of the operation becomes
incomplete. All types of vectors such as numeric, integer, character and logical
vectors can have NA values. For the illustration purpose, consider the following
different types of vectors with missing values:

#An integer vector with NA’s


> c(1:4,
, NA,
, NA,
, 7:10)
)
[1] 1 2 3 4 NA NA 7 8 9 10

#A character vector with NA’s


> c(NA,
, "GIRL",
, NA,
, "BOY")
)
[1] NA "GIRL" NA "BOY"

#A logical vector with NA’s


> c(TRUE,
, NA,
, FALSE,
, FALSE,
, NA,
, NA)
)
[1] TRUE NA FALSE FALSE NA

A testing function is.na() available in the base package, can be used to


check whether an R object is consisting of NA values or not. This also facilitate
the extraction of non-missing values from an object under consideration. The
output of is.na() always consist of TRUE and FALSE elements. This function
results, TRUE corresponding to those elements which are missing and FALSE
corresponding to those elements which are available. See for example:
59
Fundamentals of R Language

#Performing testing for missing values using is.na() function


> is.na(c(NA,
, "GIRL",
, NA,
, "BOY"))
)
[1] TRUE FALSE TRUE FALSE

As mentioned earlier, any operation performed on the data consisting of NA


always results NA. In order to save ourselves from obtaining NA results or
output, we should use na.rm argument of the considered function (or by using
the na.omit() function). These two assures the removal of the missing
values before the computation’s proceeds.
For the illustration purpose, we again consider the earlier created integer
vector c(1:4, NA, NA, 7:10) with missing values. Now, we try to
compute the sum of the elements of this vector using sum() function as
follows:

#Computing the sum of the elements of a vector


> sum(c(1:4,
, NA,
, NA,
, 7:10))
)
[1] NA

Observe that the sum of the elements of the vector is coming out as NA, as the
vector consists of NA
A values. So, to compute the sum of non-missing values of
the vector, we can use the na.rm argument of the sum()
sum(() function as follows:
#Computing the sum by using the na.rm argument
> sum(c(1:4, NA, NA, 7:10), na.rm=TRUE)
[1] 44

It can be verified that the sum of the non-missing va


values
alues is 44
4 only (1+2+3+
4+7+8+9+10=44). Thus the na.rm
rm argument works good in presence of
na.r
missing values.
Next, we discuss NaN values together with NA NA values. Note that, NA is the first
kind of missing values and NaN N is the second kind of missing va
values which is
produced by numerical computation. Here, NaN stands for ‘Not a Number’. For
example, 0/0, Inf/Inf, Inf-Inf, all these computations will produce NaN
values, since the result can’t be defined sensibly.
We can use the testing functions is.nan() to check whether a given R
object consists of NaN values or not. It should be noted that, a NaN value is
also NA, but the converse is not true, which means that when is.na() testing
function is used on the NaN values, you will get the result as TRUE, but when
is.nan() testing function is used on the NA values, you will get the result as
FALSE. See the following examples for more clarification:

#Testing for NA values


> is.na(c(10,
, -3,
, 0,
, NA,
, 5,
, NaN))
)
[1] FALSE FALSE FALSE TRUE FALSE TRUE

#Testing for NaN values


> is.nan(c(10,
, -3,
, 0,
, NA,
, 5,
, NaN))
)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
60
Nitty-Gritty of R

SAQ
Q6
Write the output of the following:
(i) is.na(c(NA,NaN))
(ii) is.nan(c(NA,NaN))

2.8 RELATIONAL AND LOGICAL OPERATORS


The following relational operators are available for use in R:

==
Equals to

< >
Less than Greater than

<= >=
Less than or Greater than or
equal to equal to

!=
Not equals to

In order to illustrate each one of the relational operators, let us create 2


scalars a and b; and two vectors x and y as follows:
#Assigning scalars a and b
> a <- 10.34; b <- 20.45

#Assigning arbitrary vectors x and y


> x <- c(2, 4, -4, 12)
> y <- c(1, 10, 0, 5)

Next we first do the comparison between two scalars a and b using the
Next,
relational operators as follows:
#Checking for inequality
> a!=b
b #10.34 != 20.45
[1] TRUE

#Checking whether a is less than b or not


> a<b
b #10.34 < 20.45
[1] TRUE

#Checking whether a is greater than equal to b or not


> a>=b
b
[1] FALSE

On the similar lines other relational operators can also be used.


Next, we illustrate the method of checking the relation between the elements
of two vectors using the relation operator ‘ > ’ (to check whether each element
61
Fundamentals of R Language

of x is greater than the corresponding positioned element of y or not) as


follows:
#Checking greater relation using a relational operator
> x>y
y
[1] TRUE FALSE FALSE TRUE

Similarly other lines other relational operators can be used between x and y.
Also, whenever a relational operator is applied between two vectors, the
obtained result will always come out to be in TRUE and FALSE, which is
computed by element wise comparison of the vectors. The relation of a vector
can also be checked with a scalar as well, in that case each element of the
vector will be compared with the scalar and the obtained result will be a vector
of TRUE and FALSE, i.e., a logical vector (the scalar will replicate itself until its
length becomes equal to the length of the vector). For the illustration purpose
consider the following example:
#Comparing each element of a vector with 20
> c(31.45, 40.23, -14.230, 20) <= 20
[1] FALSE FALSE TRUE TRUE

Next, we discuss the following logical operators, which are available for use in
R programming:

!
Logical NOT

|
||
Element-wise
logical OR Logical OR

&
&&
Element-wise
Logical AND
logical AND

Note that, these logical operators can only be applied on an expression which
results in TRUE and FALSE. Also, by default, a non-zero number means TRUE
and zero value means FALSE. For the illustration purpose, let us assign a
scalar c as -10 first, then by using different expressions we shall show the
execution of these logical operators as follows:
#Assigning c
> c <-
- -10
0

#Logical NOT
> !(c<1) #As c<1 is True and !TRUE is FALSE
[1] FALSE

#Logical OR
> (c<1)
) ||
| (c>2) #As c<1 is True || c>2 is FALSE
[1] TRUE #TRUE || FALSE
62
Nitty-Gritty of R
#Logical AND
> (c<1)
) &&
& (c>2) #TRUE && FALSE
[1] FALSE

Let us consider two arbitrary logical vectors c(TRUE, FALSE, TRUE,


FALSE) and c(TRUE, TRUE, FALSE, FALSE) to illustrate the execution of
OR ‘ | ’ and AND ‘ & ’ operators on vectors as follows:

#Element wise OR
> c(TRUE,
, FALSE,
, TRUE,
, FALSE)
) | c(TRUE,
, TRUE,
, FALSE,
, FALSE)
)
[1] TRUE TRUE TRUE FALSE

#Element wise AND


> c(TRUE,
, FALSE,
, TRUE,
, FALSE)
) & c(TRUE,
, TRUE,
, FALSE,
, FALSE)
)
[1] TRUE FALSE FALSE FALSE

In general, the logical operator results as follows:


Elements Operators
p
ex1 ex2 !(ex1) !(ex2) ex1 || ex2 ex1 && ex2
TRUE TRUE FALSE FALSE TRUE TRUE
FALSE TRUE TRUE FALSE TRUE FALSE
TRUE FALSE FALSE TRUE TRUE FALSE
FALSE FALSE TRUE TRUE FALSE FALSE

Here, ex1 and ex2 are the expressions which results in TRUETRUE and FALSE.
FAL
FA LSE.
Now, we illustrate the procedure of extraction of elements with the help of
logical operators, to do so, consider the following arbitrary vector y of 5
elements:
#Creating a vector y
> y <- c(1, 4, 2, 6, 3)

As already discussed, when a relational operator is applied inbetween two


elements/vectors, the obtained result will always be in TRUE
RUE or FALSE.
TR FALSE. Note
ALS
that, a logical vector consisting of TRUE or FALSE can be used to extract
specific elements (satisfying
( y g some propertyy or result).
) For the illustration
purpose, now we extract all those elements of y, which are greater than 2. So,
we first use the relational operator and get the logical vector, then use it to
extract those elements which are greater than 2, i.e., corresponding to TRUE
elements as follows:
#Obtaining a logical vector using relational operator
> y>2
2
[1] FALSE TRUE FALSE TRUE TRUE
#Extracting elements of y which are greater than 2
> y[y>2]
]
[1] 4 6 3

Similarly, other relational operators can also be used to extract a vector of


specific elements from a given vector. This approach of creating a logical
vector can be easily extended to matrices. For the illustration purpose,
consider the following arbitrary matrix A, from which we extract all those
elements which are not equal to 2 as follows:
63
Fundamentals of R Language

#Creating a arbitrary matrix A


> A <-
- matrix(1:16,
, nrow=4);
; A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

Next, we obtain a logical matrix of TRUE and FALSE using the relational
operator as follows:
#Getting a logical matrix
> A!=2
2
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE TRUE
[2,] FALSE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE

This matrix of TRUE and FALS SE created with the help of relational operator can
further be used for extracting specific elements of the matrix A.
A. For example,
elements which are not equal to 2 can be extracted easily as follows:
#Extracting elements which are not equal to 2
> A[A!=2]
[1] 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16

SSAQ
SA
AQ 7
Write the output of the following
g statements:
(i) c(TRUE, FALSE) & c(FALSE, TRUE)
(ii) c(TRUE, FALSE) | c(FALS
c(FALSE,
SE, TRUE)
(iii) !c(TRUE, FALSE)
(iv) c(1, 0, 1, 0) > c(0, 2,-1, 2)
(v) x <- c(seq(1,10,2),4,2:10); x[x%%2==0]

2.9 SUMMARY
The main points discussed in this unit are as follows:
Methods of creating different types of vectors and associated vector
operations are discussed.
Method of creating matrices and associated matrix operations is
discussed.
Method of creating of an array in R is discussed.
Methods of extraction of elements/subparts from vectors, matrices and
arrays have been discussed in this unit.
64 Handling of missing values is discussed.
Nitty-Gritty of R
Different types of arithmetic operators, mathematical functions, relational
and logical operators are discussed.
Method of creating a factor object is also discussed.
Finally, elements extraction using relational operators are discussed in
this unit.

2.10 TERMINAL QUESTIONS


1. Write the output of the following code:
round(c(0.234, -1.4532), 2)
2. Define a vector in R. Describe with example any three methods to create
a vector in R.
3. Match the followings:
(i) c(1, 2, 3) (a) Integer vector
(ii) c(
c("FALSE",
FALSE , "TRUE",
TRUE , "FALSE")
FALSE ) (b) Numeric vector
(iii) c(1L, 3L) (c) Character vector
4. Write 5 arithmetic operators available in R, with their purpose of use.
5. Write the output of the following two statements:
(i) matrix(c(3, 1, 0, 1, 2, 0), nrow=3, ncol=2)
(ii) matrix(c(3, 1, 0,1, 2, 0), nrow=3, ncol=2,
byrow=TRUE)
6. Fill in the blanks:
(i) Matrix addition is performed using the ……….. operator..
(ii) Matrix multiplication is performed using the ……….. operator.
(iii) The number of columns in the matrix() function can be fixed
using the ………….function argument.
(iv) The second element of the fourth row of a matrix A can be extracted
using ………..
7. Consider the following two matrices A and B
A <- matrix(c(1, 0, 0, 1), ncol=2)
B <- matrix(c(1, 1, 1, 1), ncol=2)
Write the output for the following statements:
(i) A*B (ii) A+B (iii) A%*%B (iv) A-B
8. Write the output of the following matrix functions, where A and B is
defined in previous problem:
(i) t(A)
(ii) dim(A)
(iii) rowSums(A)
(iv) det(A)
65
Fundamentals of R Language

(v) diag(A)
(vi) diag(k*A)%*%diag(k), where k=2
(vii) rbind(A,B)
(viii) cbind(A,B)
9. Write any two differences between NA and NAN.

2.11 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. The outputs of the given statements are as follows:

(i) The print(x[c(2,5)]) statement will give the following output:

0.1 1.3

(ii) The print(x[-5]) statement will give the following output:

0.20 0.10 -1.21 0.20 1.00

(iii) "numeric"

(iv) The append(x, values=2, after=5) statement gives the


following output:

0.20 0.10 -1.21 0.20 1.30 2.00 1.00

(v) 1.0 1.2 1.4 1.6 1.8 2.0


(vi) 0.20 -1.21 0.20 1.00
(vii) 0.20 0.10 -1.21 0.20 1.30

2. (i) 5 7

(ii) -0.2

(iii) 0.5773503 1.0000000

(iv) 0 0 4 1

3. (a) The code A-C+B%*%C will give the following output:

[,1] [,2]
[1,] 12 14
[2,] 15 17
(b) See section 2.4 for definition and the given matrix can be created as
follows: matrix(c(1, 0, 2, 0, 1, 4, 3, 2, 0), ncol=3)
Or
matrix(c(1, 0, 3, 0, 1, 2, 2, 4,0), 3, 3, byrow=TRUE)
4. B <- array(data=c(2, 1, 3, 4, 1, 1, 0, 1, 0, 1, 2, 0,
1, 2, 1, 0, 1, 1), dim=c(6,3))

The row shown in the rectangular box can be extracted using following
code: B[3,]
66
Nitty-Gritty of R

5. gl(n=2, k=5, labels=c("YES", "NO"))

6. (i) TRUE TRUE

(ii) FALSE TRUE


7. (i) FALSE FALSE
(ii) TRUE TRUE
(iii) FALSE TRUE
(iv) TRUE FALSE TRUE FALSE
(v) 4 2 4 6 8 10

Terminal Questions (TQs)


1. 0.23 -1.45
2. See section 2.2
3. (i)-(b), (ii)-(c), (iii)-(a)
4. See section 2.3
5. Given code will generate following two matrices:
(i) [,1] [,2]
[1,] 3 1
[2,] 1 2
[3,] 0 0
(ii) [,1] [,2]
[1,] 3 1
[2,] 0 1
[3,] 2 0
6. (i) + (ii) %*%
% (iii) ncol (iv) A[4,2]
7. (i) A*B
[
[,1]
1] [
[,2]
2]
[1,] 1 0
[2,] 0 1
(ii) A+B
[,1] [,2]
[1,] 2 1
[2,] 1 2

(iii) A%*%B
[,1] [,2]
[1,] 1 1
[2,] 1 1
(iv) A-B
[,1] [,2]
67
Fundamentals of R Language

[1,] 0 -1
[2,] -1 0

8. (i) t(A)
[,1] [,2]
[1,] 1 0
[2,] 0 1

(ii) dim(A)
2 2
(iii) rowSums(A)
1 1
(iv) det(A)
1
(v) diag(A)
1 1
(vi) [,1] [,2]
[1,] 2 2
(vii) cbind(A,B)
[,1] [,2] [,3] [,4]
[1,] 1 0 1 1
[2,] 0 1 1 1
(viii) rbind(A,B)
[,1] [,2]
[1,] 1 0
[2,] 0 1
[3,] 1 1
[4,] 1 1
9. See sec 2.7

68
UNIT 3
MEMBERSHIPP TESTING,,
COERCION
N AND
D LISTSS IN
NR

Structuree

3.1 Introduction The names() Function

Expected Learning Outcomes The dimnames() Function

3.2 Membership Testing dimensions


Functions
The class() Function
3.3 Coercion Functions
The length() Function
3.4 Lists
3.6 Summary
Creation of a List
3.7 Terminal Questions
Lists Subsetting
3.8 Solutions/Answers
Solution
ns/Answers
Merging Lists Together

3.5 Attributes of Objects

3.1
3 .1
1 IINTRODUCTION
NTRODUCTION
In previous two units (Units 1 and 2) of MST-015 (Introduction to R Software)
course, you have learnt some important aspects of R programming such as
method of creating R objects, namely, vectors, matrices, factors and arrays.
Additionally, you have studied different types of operators, such as arithmetic
operators, relational operators and logical operators; and learnt the method of
using them on scalars and vectors. Moreover, with the help of previous unit,
you got familiar with two types of missing values, namely, NA and NaN.

The main objective of the present unit is to make you familiar with a number of
functions used for testing of membership of different R objects. Here, we shall
also discuss a few functions used for coercion of classes of different R objects.
Moreover, we shall discuss the method of creating a list and methods of
extraction of list components (or elements) and merging them. Lastly, different
attributes of R objects are explained in this unit.

Before studying this unit, we expect that you have studied Units 1 and 2 of
MST-015 thoroughly.
69
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language

Expected Learning Outcomes


After completing this unit, you should be able to:
test for the membership of an R object;
coerce the membership of an R object to another;
create a List;
merge two or more lists;
extract components from a list or list elements; and
obtain the attributes of an R object.

3.2 MEMBERSHIP TESTING FUNCTIONS


To test the membership of different objects, a number of testing functions are
available in R. We are listing here some of the most useful testing functions.
Also, in this section, we shall discuss the method of using each one of them
one-by-one.

Testing Function Objective


is.numeric() tests if an object
j is of numeric type.
y
is.integer() tests if an object
j is of an integer
g type.
y
is.character() tests if an object
j is of a character type.
y
is.factor() tests if an object
j is of factor type.
y
is.logical() tests if an object
j is of logical
g type.
y
is.vector() tests if an object
j is a vector.
is.matrix() tests if an object
obje
j ct is a matrix.
is.array() tests if an object
j is an array.
y
is.list() tests if an object
j is a list.
is.data.frame()
is.data.f
frame() tests if an object
j is a data frame.
is.ts() tests if an
n object
j is a time series.

These testing functions facilitate the users of R, to test the membership of an


object. These functions, also facilitates us to use different functions or
operations, according to the membership of an R object. Also, note that, the
output of all of these functions is either TRUE or FALSE. If the obtained output
is TRUE, this means that, the supplied argument confirms its membership for
which it is tested.
There may be situations, when we encounter a vector, whose membership is
not clearly visible from its elements. In such situations, testing functions helps
us to test for suspected membership of an R object. Recall that in previous
unit, you have learnt to create different types of vectors, such as, numeric,
integer, character and others. Using these testing function, you can easily
check its membership. For the illustration purpose let us create a vector
named x with elements -0.44, 0.02, -0.08, 22.7789, 5.67, 8.09 and -2.2 as
follows:
#Creating a numeric vector
> x <-
- c(-0.44,
, 0.02,
, -0.08,
, 22.7789,
, 5.67,
, 8.09,
, -2.2)
)
70
Membership Testing, Coercion and Lists in R

In the previous unit, you have learnt that, a vector consisting of numeric values
will be of numeric type. The same can be verified using the testing function
is.numeric() as follows:
#Testing for numeric type
> is.numeric(x)
)
[1] TRUE
Since, the obtained output is TRUE, it confirms that the created vector x is of
numeric type. Let us next observe, what output other testing functions will
give, if we supply x as their argument as follows:
#Testing for integer type
> is.integer(x)
)
[1] FALSE

#Testing for character type


> is.character(x)
[1]
[ ] FALSE

#Testing for factor type


> is.factor(x)
[1] FALSE

#Testing for logical type


> is.logical(x)
[1] FALSE

Note that, as the obtained output corresponding to the is.numeric()


is.numeric()
n
function is TRUE
T UE and corresponding to other functions is FALSE,
TR LSE, therefore,
FALS
FA
testing functions confirm the membership of the vector x to numeric type.
We next test, whether the created object x is a vector object or some other
object of R. To do so, we can perform testing for vector, matrix, array, list, data
frame and time series objects. Note that, the data frame object
bject will be
ob
discussed in Unit 4 of MST-015 and list object will be discussed later in this
unit. Just for understanding purpose, for now you can note that, lists and data
fframes are R objects.
bj t Now
N we perform
f the ttesting
th ti ffor aforementioned
f ti d objects
bj t
as follows:
#Testing for vector object
> is.vector(x)
[1] TRUE

#Testing for matrix object


> is.matrix(x)
[1] FALSE

#Testing for array object


> is.array(x)
[1] FALSE

#Testing for list object


> is.list(x)
[1] FALSE
71
Fundamentals of R Language
#Testing for data frame object
> is.data.frame(x)
[1] FALSE
#Testing for time series object
> is.ts(x)
[1] FALSE

From the obtained outputs, you can observe that the output obtained from the
is.vector() testing function is TRUE and the outputs obtained from other
testing functions are FALSE. Thus, testing confirms that x is a vector object.
Note: After performing testing on a vector object. Next, we perform testing on
other objects, such as, matrix, array, factor, data frame, list and time series,
each one-by-one. We supply them as argument to testing function and
observe the outputs.
To perform membership testing on a matrix object, we create a matrix named
A of order 3x3,, with elements 2,, 3,, 4,, 1,, 2,, 1,, 7,, 8 and -1 ((arranged
g column
wise) as follows:
#Creating a matrix
> A<-matrix(c(2, 3, 4, 1, 2, 1, 7, 8, -1), ncol=3); A
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 3 2 8
[3,] 4 1 -1

Next, we perform testing for the membership of a matrix A,


A, co
cconsisting
nsisting
n of
numeric elements. For this, we supply matrix A as an argument to the testing
functions as follows:
#Testing for numeric type
> is.numeric(A)
[1] TRUE

#Testing for integer type


> is.integer(A)
)
[1] FALSE

#Testing for character type


> is.character(A)
)
[1] FALSE

#Testing for factor type


> is.factor(A)
)
[1] FALSE

#Testing for logical type


> is.logical(A)
)
[1] FALSE

Note that the output obtained from the is.numeric() testing function is
TRUE and the outputs obtained from other testing functions are FALSE. So,
72
Membership Testing, Coercion and Lists in R

these testing functions confirms that the matrix A is of numeric type. Next, we
perform testing for different objects as follows:
#Testing for vector object
> is.vector(A)
)
[1] FALSE

#Testing for matrix object


> is.matrix(A)
)
[1] TRUE

#Testing for array object


> is.array(A)
)
[1] TRUE
#Testing for list object
> is.list(A)
)
[1] FALSE
#Testing for data frame object
> is.data.frame(A)
[1] FALSE
#Testing for time series
> is.ts(A)
[1] FALSE

From the obtained outputs, we observe that the testing functions


is.matrix x(A) and is.array(A)
is.matrix(A) i .arra
is (A) gives the outputs as TRUE
ay(A TRUE and other
testing functions gives the outputs as FALSE.
L E. The reason
FALS reasson behind this is that,
the matrices are two-dimensional arrays, due to which, the testing function
is.array(A)
is.a arraay(A) ) also gives TRUE
UE output. Thus, from the obtained outputs it is
TRU
confirmed that the object named A is a matrix or an array object.
Next, we create a random logical matrix consisting of TRUE
RUE and FALSE,
TR LSE, and
FAL
FA
assign it to B as follows:
#Creating a matrix B
> B <-
- matrix(c(rep(TRUE,8),
, rep(FALSE,8)),
, ncol=4);
; B
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE
[3,] TRUE TRUE FALSE FALSE
[4,] TRUE TRUE FALSE FALSE

So, the appearance of matrix B suggests that it is a logical matrix. We now


confirm it, by using the following testing functions:
#Testing for numeric type
> is.numeric(B)
)
[1] FALSE

#Testing for integer type


> is.integer(B)
)
[1] FALSE 73
Fundamentals of R Language
#Testing for character type
> is.character(B)
)
[1] FALSE

#Testing for factor type


> is.factor(B)
)
[1] FALSE

#Testing for logical type


> is.logical(B)
)
[1] TRUE

From the obtained outputs, it is observed that, the output of the testing
function is.logical() is TRUE and the output of other testing functions is
FALSE. Hence, the testing confirms that the matrix B is of logical type. Next,
we test for different objects as follows:
#Testing for vector object
> is.vector(B)
[1] FALSE

#Testing for matrix object


> is.matrix(B)
[1] TRUE

#Testing for array object


> is.array(B)
[1] TRUE

#Testing for list object


> is.list(B)
[1] FALSE

#Testing for data frame object


> is.data.frame(B)
[1] FALSE

#Testing for time series object


> is.ts(B)
)
[1] FALSE

From the obtained outputs, it is confirmed that, the object B is a matrix or array
object.
Next, we perform testing on factor and character objects. To do so, we first
create a factor object named fac using the gl() function discussed in Unit 2
of MST-015. We also create a character vector named Blessed with
elements "BEST" and "WISHES" as follows:
#Creating a factor
> fac
c <-
- gl(3,
, 2);
; fac
c
[1] 1 1 2 2 3 3
Levels: 1 2 3
74
Membership Testing, Coercion and Lists in R

#Creating a character vector


> Blessed
d <-
- c("BEST",
, "WISHES");Blessed
d
[1] "BEST" "WISHES"

Next, we perform testing for factor and character vector, by supplying the fac
and Blessed objects as arguments to the following testing functions:
#Testing for factor
> is.factor(fac)
[1] TRUE

#Testing for character


> is.character(Blessed)
[1] TRUE

On the similar lines testing for the memberships with other objects can be
performed.
Next, we perform testing on an array object and observe the obtained outputs.
To do so, we use an built-in data set data : HairEyeColor. Note that,
datasets::HairEyeColor.
asetsts::
the HairEyeColor data is available in the datasets
datasets package, that is why,
ata
we have written datasets
t sets with ‘ :: ’ and the name of the data set. This data
data
da
consists of distribution of hair and eye color and sex in 592 statistics students.
For more detail on the data set, you can see the associated R documentation
page, by taking help on this function as follows:
#Seeking help of HairEyeColor data
> ?HairEyeColor
starting httpd help server ... done

Recall that, a built-in data set of R can be called in the current working session
by including the associated package first in the working session. Before that
we should always load the package in our session either by using the
library() or the require() function. Let us first view the HairEyeColor
data as follows:
#Loading the datasets package
> require(datasets)
)
#Viewing the HairEyeColor data
> HairEyeColor
r
75
Fundamentals of R Language

, , Sex = Male

Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8

, , Sex = Female

Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8

Next, we check its membership with R objects. To do so, we supply the name
of the data set to each of the testing functions as follows:
#Testing for vector object
> is.vector(HairEyeColor)
[1] FALSE
#Testing for matrix object
> is.matrix(HairEyeColor)
[1] FALSE
#Testing for array object
> is.array(HairEyeColor)
[1] TRUE
#Testing for list object
> is.list(HairEyeColor)
[1] FALSE
#Testing f
for data f
frame object
> is.data.frame(HairEyeColor)
)
[1] FALSE

#Testing for time series object


> is.ts(HairEyeColor)
)
[1] FALSE

Hence, from the obtained outputs it is confirmed that the HairEyeColor data
set is an array object.
Lastly, we supply a list object and a frame object as function arguments to the
testing functions. The method of creating a list will be discussed later in this
unit, but for now, you must know that a list is created using the list()
function. Let us create a list named ylist with two components (or elements)
class and school using the list() function as follows:
#Creating a list
76
Membership Testing, Coercion and Lists in R

> ylist<-list(class=c(1,
, 2,
, 3),
, School=c("X",
, "Y",
, "Z"));
; ylist
t
$class
[1] 1 2 3
$School
[1] "X" "Y" "Z"

Now we perform testing for its membership with R objects as follows:


#Testing for vector object
> is.vector(ylist)
)
[1] TRUE

#Testing for matrix object


> is.matrix(ylist)
)
[1] FALSE

#Testing for array object


> is.array(ylist)
[1] FALSE

#Testing for list object


> is.list(ylist)
[1] TRUE

#Testing for data frame object


> is.data.frame(ylist)
[1] FALSE

#Testing for time series object


> is.ts(ylist)
[1] FALSE

The obtained outputs may seem to be misleading to the learners as


is.vector()
is.vec
is cto
tor(
r()
r( ) and is
is.list()
.list(
l t( ) both testing functions
t() n are giving the output as
E. But we know that the ylist object is a list. The output of the testing
TRUE.
TRUE
function is.vector()
is vector() gives TRUE,TRUE because by default the mode argument of
the is.vector() function specified as "any", which mean the testing
function is.vector() may return TRUE for the atomic modes such as
list and expression ("any" is also valid object type, see as.vector() as
well). To tackle this situation, we can simply perform the testing for vector
object explicitly as follows:
#Testing for vector by default
> is.vector(ylist,
, mode="any")
)
[1] TRUE
#Testing for vector specifically
> is.vector(ylist,
, mode="vector")
)
[1] FALSE

Next, we view the internal structure of the ylist object using the str()
function as follows:
77
Fundamentals of R Language

#Viewing the internal structure of ylist object


> str(ylist)
)
List of 2
$ class : num [1:3] 1 2 3
$ School: chr [1:3] "X" "Y" "Z"
The internal structure confirms that ylist is a list having two components.
The first component class is of numeric type and the second component
school is of character type.
Lastly, we perform the testing by supplying a data frame object to the
membership testing functions. For the illustration purpose, we consider the
built-in dataset sleep available in the datasets package. But before that,
we take help on this data set and see its internal structure using str()
function as follows:
#Seeking help on sleep data
> ?sleep
p
starting httpd help server ... done

#Viewing the internal structure of the sleep data


> str(sleep)
'data.frame': 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7
8 9 10 ...
Hence from the help and internal structure of the data it is confirmed that, the
sleep dataset is a data frame object. Let us view first few rows of it as
follows:
#Viewing the sleep data
> head(sleep)
)
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 1.9 2 1
78
Membership Testing, Coercion and Lists in R

Next, we check its membership with R objects using the testing function as
follows:
#Testing for vector object
> is.vector(sleep)
)
[1] FALSE
#Testing for matrix object
> is.matrix(sleep)
)
[1] FALSE
#Testing for array object
> is.array(sleep)
)
[1] FALSE
#Testing for list object
> is.list(sleep)
)
[1] TRUE
#Testing
g for data frame object
j
> is.data.frame(sleep)
[1] TRUE
#Testing for time series object
> is.ts(sleep)
[1] FALSE

Observe that the outputs obtained from the testing functions


iss.data a.fr
fra
is.data.frame() ame(
e()
e( ) and is.llisst() are TRUE and the outputs obtained from
is.list()
other testing functions are FALLSE. Also note that, the TRUE output from
FALSE.
iss.dat f ame() function was expected. But the output of is.list()
is.data.frame()
ta.fr is.llist
s ()
st
function is also coming out to be TTRUE,
RUE E, because a data frame is a special type
e
of list in which every element of the list has the same length. Moreover, the
data frames are stored in the memory as lists but wrapped into data frame
object.
is.ts() testing function. For
Next, we test for time series object using the is.ts()
ts(
the sake of convenience, consider the BJsales data available in the
datasets package. If you take help on this dataset, you will find that it is a
time series data. The same fact can be verified using is.ts() function. Let
us first view it then we perform the membership testing as follows:

79
Fundamentals of R Language

#Testing for time series object


> is.ts(BJsales)
)
[1] TRUE

Similarly, testing for the membership with other objects can be performed and
it can be verified that the data is of numeric type.
Note: When the membership testing functions for numeric, integer, character,
factor and logical are used on a lists or data frame objects the output will be
FALSE as it combines columns (in case of matrices and data frames) or
components (in case of list) belonging to different classes.

SAQ
Q1
Consider the factor object fac and a character vector Blessed created in
Section 3.2. Perform membership testing using all the discussed testing
functions and verify the fact that factors are not vectors.

3.3 COERCION
N FUNCTIONS
We may encounter a situation, in which we would like to combine elements or
vectors of different classes under the same name. In such a situation implicit
type conversion take place. By implicit coercion, we mean that, no specific
command is given by us to change the class or membership of an object.
Whenever, implicit coercion take place, it coerces a vector or matrix or an
array in accordance with the highest precision of their elements. The coercion
rule can be viewed from the following figure.

•Lowestt
Logical precision

Integer

Numeric

Character •Highest
precision

For the illustration purpose, let us create a vector by mixing a numeric element
with a character element. Then implicit coercion takes place and the output will
a character vector due to higher precision of character than numeric.
#Mixing numeric and character elements
> c(1.7,
, "a")
)
[1] "1.7" "a"

Next, we create a vector by mixing a logical element with a numeric element.


Then implicit coercion takes place and the output will a numeric vector due to
higher precision of numeric than logical.
#Mixing logical and numeric elements
> c(TRUE,
, 2)
)
80 [1] 1 2
Membership Testing, Coercion and Lists in R

Next, we create a vector by mixing a character element “a” with a logical


element “TRUE”. In this case the output will be a character vector due to
higher precision of character than logical.

#Mixing character and logical elements


> c("a",
, TRUE)
)
[1] "a" "TRUE"

For more clarification, let us create two vectors of different types, say a vector
n of numeric type and another vector s of character type with arbitrary
elements as follows:
#Creating a numeric vector
> n <-
- c(2,
, 3,
, 5)
)

#Creating a Character vector


> s <-
- c("a",
, "b",
, "c")
)

Next, we combine them using the c() function to create a single vector as
follows:
#Concatenating two vectors of different types
> c(n, s)
[1] "2" "3" "5" "a" "b" "c"

Next, we bind n and s, column wise and row wise as follows:


#Binding n and s column wise
> cbind(n, s)
n s
[1,] "2" "a"
[2,] "3" "b"
[3,] "5" "c"

#Binding n and s row wise


> rbind(n, s)
[,1] [,2] [,3]
n "2" "3" "5"
s "a" "b" "c"

From the obtained outputs, you can observe that, implicit type conversion
takes place while binding the vectors row-wise and column-wise. Either we
bind them row-wise or column-wise the obtained outputs will be a character
matrix of some suitable order due to the higher precision of character than and
numeric.
Next, we illustrate the explicit type of coercion. Note that, explicit coercion is
not done by the software. We give a coercion function command to change the
class or membership of an object to another. We are listing here some of the
most useful coercion functions with their objectives:
Coercion Function Objective
as.numeric() coerce an object to numeric.
as.integer() coerce an object to integer.
81
Fundamentals of R Language
as.character() coerce an object to character.
as.factor() coerce an object to factor.
as.logical() coerce an object to logical.
as.vector() coerce an object to vector.
as.matrix() coerce an object to matrix.
as.array() coerce an object to array.
as.list() coerce an object to list.
as.data.frame() coerce an object to data frame.
as.ts() coerce an object to time series.

To illustrate, how these coercion functions work, we shall take different types
of objects and coerced them into another type or class of object. Let us first
create an integer vector named x with elements 0 to 5 as follows:
#Creating an integer vector
> x<-0:5
5

After creating it, we now coerce it to a character vector using the coercion
function as.character()
as.character r() as follows:
#Coercing an integer vector to a character vector
> as.character(x)
[1] "0" "1" "2" "3" "4" "5"

Note that, due to the as.cha


as.character()
ara
racter r() function (used on x), the integer
vector x is now coerced to a character vector, but x is not overwritten. As,
coercion command do not change the class of x x.. The vector x is still of
numeric type, as x is not overwritten here.
e e. This can be verified as follows:
n her
#Testing x for integer type
> is.integer(x)
[1] TRUE
#Testing x for character type
e
> is.character(x)
[1] FALSE

Additionally, if we want to implement the changes in the original x vector, then


it should be overwritten using the assignment operator, while using the
coercion function as follows:
#Assigning a vector x
> x <-
- 0:5
5
#Reassigning x while coercing
> x <-
- as.character(x)
)
Next, we again test for its membership as follows:
#Testing for integer type
> is.integer(x)
)
[1] FALSE
#Testing for character type
> is.character(x)
)
82 [1] TRUE
Membership Testing, Coercion and Lists in R

Hence, due to overwriting the x vector is now becomes a character vector or a


vector or character type.
Next, we coerce the numeric and logical vectors to integer vectors. To do so,
we take two arbitrary vectors, one with decimal points (numeric vector) and the
other with TRUE and FALSE (logical vector) as follows:
#Coercing a numeric vector to an integer vector
> as.integer(c(0.006,
, 7.2,
, -1.45,
, 2.01))
)
[1] 0 7 -1 2

#Coercing a logical vector to an integer vector


> as.integer(c(TRUE,
, FALSE,
, TRUE,
, TRUE))
)
[1] 1 0 1 1

Next, we coerce a numeric vector of 0 and 1 to a logical vector using the


as.logical() function as follows:
#Coercing
g a numeric vector to a logical
g vector
> as.logical(c(0, 1, 0, 1, 1, 1))
[1] FALSE TRUE FALSE TRUE TRUE TRUE

Next, instead of taking vector objects, we shall take a matrix object for the
illustration purpose. Consider the following arbitrary matrix named A of order
2x4.
#Creating a matrix to A
> A <- matrix(1:8, nrow=2, ncol=4); A
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8

We next coerce this matrix object A to a data frame object using the
as.da
as ata.fra
fra
rame
as.data.frame() () function. We shall also overwrite
me()
me e matrix A while coercing
as follows:
#Coercing matrix A to a data frames object and overwriting A
> A <-
- as.data.frame(A);
; A
V1 V2 V3 V4
1 1 3 5 7
2 2 4 6 8

Note that, we have used the as.data.frame() function on a matrix object to


coerce it to a data frame object. As the names of the rows and columns were
not specified therefore default names of the columns are printed as V1, V2, V3
and V4 and default names of the rows are printed as 1 and 2, respectively.
This conversion can be verified using the testing function is.data.frame()
as follows:
#Checking whether A is coerced to data frame object or not
> is.data.frame(A)
)
[1] TRUE
83
Fundamentals of R Language
We shall next discuss the as.ts() function. This function is used to coerce
an object to time series. For the illustration purpose, consider a vector x
consisting of elements 7, 8, 10, 4 and 5. We can coerce this vector to a time
series as follows:
#Coercing x to time series
> x <-
- c(7,
, 8,
, 10,
, 4,
, 5)
)
> as.ts(x)
)
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 7 8 10 4 5
Next, we present some illustrations, where coercion does not make sense.
Recall that, we have already discussed that character has the highest
precision. Generally,
p y, if we try
y to coerce a higher
g precision
p class into a lower
precision class, we will get a warning message. For the illustration purpose, let
us create a character vector with elements ‘a’, ‘b’ and ‘c’ and assign it to x as
follows:
#Creating a character vector
> x <- c("a", "b", "c")
Next, we try to coerce this character vector to logical and numeric vectors
using respective coercion functions as follows:
#Coercing a character vector to a logical vector
> as.logical(x)
[1] NA NA NA

#Coercing a character vector to a numeric vector


> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion

From the above outputs, you can observe that we are getting the output as
NAs with a warning message, as this coercion is not possible. It also means
that the values after coercion are not available.
Note: The is.numeric() function tests the mode or membership, not the
class, but as.numeric() function coerces to the class.

Similarly, other explicit coercion functions can be explored by you. In the next
section, we discuss list object of R programming.

SAQ
Q2
Write the output of the following code:
as.data.frame(matrix(1:9, nrow=3, dimnames=
list(c("x1","x2","x3"),c("y1","y2","y3"))))
84
Membership Testing, Coercion and Lists in R

3.4 LISTS
List is an R object. It consists of an ordered collection of objects, which are
known as its components. In some situations, lists are very useful specifically,
when we are required to combine a collection of different types of objects
under the same name (so in lists various components which may be referred
as its elements need not be of the same type). As earlier discussed, this
facility is not available in the case of vectors, matrices and arrays. A list could
consist of the following objects as its components:
Numeric vectors/elements
Logical vectors/elements
Character vectors/elements
matrices
Data frames
Lists
Functions, to name a few.
In this section, we first define a list then discuss the method of creating a list
and the method of extraction of its components and specific element.
Additionally, we discuss the procedure of merging two or more lists. Let us
discuss each one of them one-by-one.
3.4.1
3.4
4.1 Creation
Crrea
atio
on of a List
List
A list object in R is created using the li t() function. You should note that
list()
ist
the elements of a list (which can be referred as components of a list as well)
are always numbered. Also, with the help of these numb m ere s, list components
numbers,
as well as particular element(s) of a list component can be referred. d We first
illustrate the method of creating a list of 4 components and named it Std.
These four components of St Stdd consists of the details of two students, namely,
Deepika and Advait. Its first component consists of the name of the students,
the second component consists of the semester in which they are studying,
i.e., VI, the third component consists of their roll numbers, 50 and 03. Lastly,
the fourth component displays average marks of the students, i.e., 80 and 89,
respectively.
#Creating a list
> Std
d <-
- list(c("Deepika",
, "Advait"),
, "VI",
, c(50,
, 03),
,
c(80,
, 89));Std
d

[[1]]
[1] "Deepika" "Advait"

[[2]]
[1] "VI"

[[3]]
[1] 50 3

[[4]]
[1] 80 89
85
Fundamentals of R Language
After creating Std, we next verify whether the created object is a list object or
any other object by using the testing function is.list() as follows:
#Testing for list
> is.list(Std)
)
[1] TRUE
Hence, the output confirms that the created object Std is a list object.

3.4.2 Lists Subsetting


In this subsection, we discuss the method of lists subsetting. Before that, it is
important to know the difference between ‘ [[…]] ’ and ‘ […] ’ operators.
Recall that we have already used the ‘ […] ’ operator, while extracting
element(s), vector(s) and submatrices in the Unit 2 of MST-015. Additionally,
note that the ‘ [[…]] ’ operator is used to select/extract a single component of
a list, whereas ‘ […] ’ this is a general subscripting operator.
Next, we illustrate two different methods of extraction of the components of a
list. In the first method we use ‘ [[…]] ’ operator and in the second method
we use the name of the list components together with ‘ $ ’ operator. The
second approach is generally used when the list components are named. Let
us first illustrate the first method. To do so, we consider the already created list
object Std d and extract each one of its four components one-by-one using
‘ [[…]] ’ operator as follows:
#Extracting the 1st component
> Std[[1]]
[1] "Deepika" "Advait"
#Extracting the 2nd component
> Std[[2]]
[1] "VI"
#Extracting the 3rd component
> Std[[3]]
[1] 50 3
#Extracting the 4th component
> Std[[4]]
[1] 80 89

Next, we illustrate the method of extracting the jth element of the kth
component, i.e., Std[[k]][j]. To work on that, we extract the 2nd element of
the 1st component and 1st element of the 3rd component from Std as follows:
#Extracting the 2nd element of the 1st component
> Std[[1]][2]
[1] "Advait"
#Extracting the 1st element of the 3rd component
> Std[[3]][1]
[1] 50

Next, we illustrate the method of extraction of list components using the name
86 of the list components and ‘ $ ’ operator. To do so, we first give names to the
Membership Testing, Coercion and Lists in R

list components of Std to make it more self-describing (as presently the list
components are not self-describing) as follows:
#Naming the list components
> Std
d <-
- list(Name=c("Deepika",
, "Advait"),
, Semester="VI",
,
Rollno=c(50,03),
, Marks=c(80,89));
; Std
d

$Name
[1] "Deepika" "Advait"

$Semester
[1] "VI"

$Rollno
[1] 50 3

$Marks
[1] 80 89

From the obtained output, it is verified that each component of the list is
properly named. After naming the components, we next extract each
component of the created list Std one-by-one using components names and
‘ $ ’ operator as follows:
#Extracting the 1st component
> Std$Name
[1] "Deepika" "Advait"
#Extracting the 2nd component
> Std$Semester
[1] "VI"
#Extracting the 3rd component
> Std$Rollno
[1] 50 3
#Extracting the 4th component
> Std$Marks
[1] 80 89

Moreover, there is one more method of extraction of list components. A list


component can also be extracted using the name of the components in the
‘ [[…]] ’ operator as follows:
#Extracting all the four components of Std one-by-one
> Std[["Name"]]
]
[1] "Deepika" "Advait"

> Std[["Semester"]]
]
[1] "VI"

> Std[["Rollno"]]
]
[1] 50 3

> Std[["Marks"]]
]
[1] 80 89
87
Fundamentals of R Language
Note: It is important to note that, in the last two methods, we have only
discussed the procedure of extracting list components. The element(s) of the
extracted list components can be easily referred by appending ‘ […] ’ after ‘ $ ’
operator or ‘ [[…]] ’ operator as discussed earlier.

3.4.3 Merging Lists Together


Two or more lists in R can be combined using the concatenation function c()
in R. For the illustration purpose, we create two lists named, list1 and
list2; and merge them using the c() function as follows:
#Creating two lists
> list1
1 <-
- list(1:3)
)
> list2
2 <-
- list(letters[1:3])
)
#Merging two lists together
> c(list1,
, list2)
)
[[1]]
[1] 1 2 3

[[2]]
[1] "a" "b" "c"

Consider another example in which we create a list with different objects (as
components) like vector, matrix, list and data frame and named it as Lst
Lst as
follows:
#Creating a list with different objects
> data <- sleep
> Lst<-list(c(1986,2022), c("T", "K"), mat
matrix(rep(1,4),ncol=2),
trix(rep(1,4),ncol=2),
list("A", "P"), data)
> Lst
[[1]]
[1] 1986 2022

[[2]]
[1] "T" "K"

[[3]]
[,1] [,2]
[1,] 1 1
[2,] 1 1

[[4]]
[[4]][[1]]
[1] "A"

[[4]][[2]]
[1] "P"
[[5]]
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
88
Membership Testing, Coercion and Lists in R

4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10

Hence, a list with name Lst is created, whose 1st component is a numeric
vector, the 2nd component is a character vector, the 3rd component is a
numeric matrix of order 2x2, the 4th component is a list and the last component
is a built-in data frame sle
eep available in the datasets package.
sleep

SSAQ
SA
AQ 3
Consider the list named Lst created in Section 3.4. Extract its 2nd component
using all the three methods discussed in this section.

3.5
3 .5
5 A
ATTRIBUTES
TTRIBUTES OF
OF O
OBJECTS
BJE
ECTS
In this section, we shall discuss the following attributes of R objects:
names()
dimnames()
dimensions
class()
length()
We shall discuss each of these attributes one-by-one with the help of suitable
examples. Let us first discuss the names() function.
3.5.1 The names() Function
Names of the R objects can be set using the names() function available in
the base package. Setting names are very useful for writing self-describing
and readable code. When this function is used alone on an R object, it will
return the names of the R object. Note that the names() function accepts
different objects as argument such as vector, matrix, list and data frames.
Moreover, when the names() function is used with the assignment operator
‘ <- ’ and a character vector of up to the same length as an object, it will set
the name of the R object.
89
Fundamentals of R Language
Vector argument supplied to names() function:
Consider the first example in which we assign the names to a vector object
pin consisting of the pin codes of different places.
#Creating a vector of pin codes
> pin
n <-
- c(110092,
, 110032,
, 201301,
, 122001,
, 302001);
; pin
n
[1] 110092 110032 201301 122001 302001

After creating a vector named pin, we now illustrate the method of naming
corresponding pin codes using the names() function and the assignment
operator ‘<-’ as follows:
#Naming the pin codes
> names(pin)
) <-
- c("Anand
d Vihar",
, "Shahdara",
, "Noida",
,
"Gurgaon",
, "Jaipur");
; pin
n
Anand Vihar Shahdara Noida Gurgaon Jaipur
110092 110032 201301 122001 302001

The obtained output confirms that the names to the pin codes are successfully
self-describing.
assigned and now elements are more self- f describing.
After setting the names, we next illustrate the method of getting the names of
an R object using the names()
na
amess() function. To do so we simply supply the pin
p n
pi
vector as argument to the names()
na
amees(() function as follows:
#Getting the names
> names(pin)
[1] "Anand Vihar" "Shahdara" "Noida" "Gurgaon"
"Jaipur"

Next, we discuss the method of removing the names of the elements of a


vector object. It can be easily done by assigning NULL as names in the names
setting command as follows:
#Removing the names of vector elements
> names(pin) <- NULL
#Verifying the removal of names
> pin
n
[1] 110092 110032 201301 122001 302001

Matrix argument supplied to names()


) function:
We now create the following arbitrary matrix of pin codes and assigned it to P.
Anand Vihar Noida
110092 201301
P
Shahdara Gurgaon
110032 122001

#Creating a matrix of pin codes


> P <-
- matrix(c(110092,
, 110032,
, 201301,
, 122001),
, ncol=2);
; P
[,1] [,2]
[1,] 110092 201301
[2,] 110032 122001
90
Membership Testing, Coercion and Lists in R

Next, we assign names to each of its elements using the names() function.
#Assigning names to matrix elements
> names(P)
) <-
- c("Anand
d Vihar",
, "Shahdara",
, "Noida",
, "Gurgaon");
;
P
[,1] [,2]
[1,] 110092 201301
[2,] 110032 122001
attr(,"names")
[1] "Anand Vihar" "Shahdara" "Noida" "Gurgaon"

Hence, we have successfully named each of the matrix elements.


Note: Using the names() function we can also get the names of the matrix
elements as we did in case of vectors.
A list argument supplied to names()
) function:
We have already discussed one method of naming the components of a list.
The names of the list components can also be set using the names()
function. To illustrate that we create a list named Lst2 of pocket money (Rs.
22200, Rs. 23000, Rs. 15010, Rs. 10000) of four students Pooja, Barkha,
Shrawanti and Shivam, respectively as follows:
#Creating a list of pocket money
> Lst2 <- list(22200, 23000, 15010, 10000); Lst2

[[1]]
[1] 22200

[[2]]
[1] 23000

[[3]]
[1] 15010

[[4]]
[1] 10000

After creating the list, we next set the names of the components of the list
Lst2 using names() function as follows:
#Setting the names of the list components
> names(Lst2)
) <-
- c("Pooja",
, "Barkha",
, "Shrawanti",
, "Shivam");
;
Lst2
2

$Pooja
[1] 22200

$Barkha
[1] 23000

$Shrawanti
[1] 15010

$Shivam
[1] 10000
91
Fundamentals of R Language
A data frame argument supplied to names()
) function:
Next, we supply a data frame object as an argument to the names() function
and illustrate the method of getting the names of a data frame object. Consider
the built-in data set sleep discussed in the beginning of this unit. Using the
membership testing function, we have already shown that the sleep data is a
data frame object. Let us now supply the sleep data as argument to the
names() function to get the names of the columns of the data frame as
follows:
#Getting the names of the columns of the sleep data
> names(sleep)
)
[1] "extra" "group" "ID"

The names of the sleep data can be overwritten using the names() function
on the same lines as discussed earlier.

3.5.2 The dimnames() Function


Recall that in the Unit 2 of MST-015, you have learnt to set the row and
column names using the dimnames
dimn
mna
mn ames argument of the matrix() function. The
dimnames() function available in the ba
di se package is used to retrieve or set
base
bas
the dimension names of R objects, such as matrix, array or data frame. Note
that the dimension names must be in the form of a list. To understand the
method of using this function, consider the following example in which, we first
create a matrix named Ma Mat
M t with following elements:
1 4 7
2 5 8
3 6 9
Then we set the names of the rows and columns of the created matrix Mat as
(R1,
(R1
R1, R2
R1 3) and (C1, C2,
R2, R3)
R3 C2 dimn
dimnam
mnames
am () function
2, C3), respectively, using the dimnames()
es()
es fu
unction
as follows:
#Creating a matrix
> Mat <- matrix(1:9, 3, 3); Mat
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

#Setting rows and columns names


> dimnames(Mat)
) <-
- list(c("R1","R2","R3"),
, c("C1","C2","C3"));
;
Mat
t
C1 C2 C3
R1 1 4 7
R2 2 5 8
R3 3 6 9

There may be situations, in which we only want to set the row names or the
column names of a created matrix. In such situations, we use the
rownames() and colnames() functions, respectively and supply the created
92
Membership Testing, Coercion and Lists in R

matrix as argument to functions. For the illustration purpose, let us create


another arbitrary matrix MatE as follows:
#Creating a matrix
> MatE
E <-
- matrix(seq(2,
, 12,
, 2),
, 3,
, 2);
; MatE
E
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12

After creating it we next set the names of the rows using the rownames()
function as follows:
#Naming the rows only
> rownames(MatE)
) <-
- c("R1","R2","R3");
; MatE
E
[,1] [,2]
R1 2 8
R2 4 10
R3 6 12

Clearly, the names to the rows of M E are successfully assigned. Next, we


MatE
atE
set the names of the columns using the co
colnames()
olnaame () function as follows:
es(
#Naming the columns of MatE
> colnames(MatE) <- c("C1", "C2"); MatE
C1 C2
R1 2 8
R2 4 10
R3 6 12

Note that, in the obtained output both rows and columns names are appearing
as the names of the rows are already set due to the previously used
roownamees(
rownames()s()) function command. Since the names off the columns are setting
the names of the rows therefore both the rows and columns names are
appearing in the output off colnames()
l () ffunction command.
Next, we illustrate the method of setting names of the rows and columns of a
data frame. To do so, we consider the first three rows of the sleep data.
Assign the extracted data to MD and then set names of the rows and columns
using the dimnames() function as follows:
#Extracting and assigning the first 3 rows of the sleep data
> MD
D <-
- sleep[1:3,];
; MD
D
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3

Now we illustrate the method of setting the names of the rows and columns as
("R1", "R2", "R3") and ("C1", "C2", "C3") respectively of a data frame
object MD as follows:
93
Fundamentals of R Language

#Overwritting existing names of the rows and columns


> dimnames(MD)
) <-
- list(c("R1","R2","R3"),
, c("C1","C2","C3"));
;
MD
D
C1 C2 C3
R1 0.7 1 1
R2 -1.6 1 2
R3 -0.2 1 3

The dimnames() function can also be used to get names of the rows and
columns of a data frame. In that case the obtained output will be a list object
whose first component is consisting of the names of the rows and second
component consisting of the names of the columns. For the illustration
purpose let us get names of the rows and columns of MD data frame as
follows:
#Getting the names of the rows and columns of MD data frame
> dimnames(MD)
)
[[1]]
[1] "R1" "R2" "R3"

[[2]]
[1] "C1" "C2" "C3"

Note: The names of the data frame can be extracted using the ro
row.names()
ow.nammes( ()
function and the names of the columns of a data frame can be extracted using
the na
names()
ames() ) function.

3.5.3
3.5
5.3 dimensions
dim
mensio
ons
s
The dimension of R objects like matrices and da data
ata frames can be obtained
using the di im() function. This function is already discussed in the Unit 2 of
dim()
MST-015. In Unit 2 we supplied a matrix object as an argument to the dim()
function. Recall, that this function returns the number of rows and
an columns
colu
l mns of a
matrix. Similarly, we can supply a data frame object as its argument. For the
illustration purpose, let us supply a data frame object MD as its argument as
follows:
#Getting the dimensions of a data frame
> dim(MD)
) #As MD <- sleep[1:3,]
[1] 3 3

Note that, we are getting the output as 3 3. It means that the MD data frame
consists of 3 rows and 3 columns.
Note: Two separate functions, nrow() and ncol() can also be used on a
data frame (on the similar lines as on a matrix) to explicitly get the number of
rows and number of columns of a data frame object.

3.5.4 The class() Function


The class() function in R is used to obtain the class of an R object. When
the class() function is used on a vector object, it gives the mode of an
object, for example "numeric", "logical", "character", "factor". Other
94
Membership Testing, Coercion and Lists in R

possible values of the output of class() function are "matrix", "array",


"list" and "data.frame".
The class of an object plays a very important role. This allows for an object-
oriented style of programming in R. For example, if an object has class
“matrix”, it will be printed in a certain way. The graphic functions may print
them in a certain way.
#Class of a numeric vector
> class(c(-0.44,
, 0.02,
, -0.08))
)
[1] "numeric"
#Class of a logical vector
> class(c(TRUE,
, FALSE))
[1] "logical"
#Class of a character vector
> class(c("a",
, "b",
, "c"))
)
[1] "character"
character
#Class of a matrix object
> class(matrix(c(2, 1, 4, 3), nrow=2))
[1] "matrix" "array"
#Class of a data frame object
> class(sleep)
[1] "data.frame"
#Class of a list object
> class(list(Name="Deepika", Rollno=01, Marks=99,
Tag="Excellent Student"))
[1] "list"

Note: The effect of the class (if necessary) can be removed temporarily by
using the function unclass()
uncl
clas
class
as s() function.

3.5.5
3.5
5.5
5 The
The length()
gth() Function
leng
The length() function available in the base package is used to get or set
the length of vectors and factors. A list object can also be supplied as its
argument. Other objects on which its execution is defined can also be supplied
as argument to this function.
Let us now supply different objects one-by-one as argument to this function
and observe the obtained output.
#Supplying a vector argument
> length(c(-3,
, -2,
, -1,
, 0,
, 1,
, 2,
, 3,
, NA))
)
[1] 8
#Supplying a character vector as argument
> length(c("Aa",
, "Bb",
, "Cc",
, "Dd"))
[1] 4
#Supplying a factor argument
> length(factor(c(1,
, 1,
, 1,
, 2,
, 3,
, 2,
, 3,
, 1)))
)
[1] 8 95
Fundamentals of R Language

#Supplying a list
> length(list(22200,
, 23000,
, 15010,
, 10000))
)
[1] 4

Note that when a vector or a factor object is supplied to the length()


function, it returns the number of elements. But when a list object is supplied
as an argument to the length() function, it returns the number of list
components. The length() function can also be used to set or increase or
decrease the length of an already defined object. For the illustration purpose,
let us increase the length of a four elements character vector.

#Creating an arbitrary character vector


> y <-
- c("Aa",
, "Bb",
, "Cc","Dd");
; y
[1] "Aa" "Bb" "Cc" "Dd"

#Increasing the length of y


> length(y)
g (y) <- 7
> print(y)
[1] "Aa" "Bb" "Cc" "Dd" NA NA NA

Hence, the length of y has been increased from 4 to 7. Next, we decrease the
length of the y vector to 2 using le
eng ) function as follows:
gth()
length()

#Decreasing the length


> length(y) <- 2
> print(y)
[1] "Aa" "Bb"

Note: When the set length (n) is more th tthan


an the a
actual
ctual length (r) of the vector in
that case (n-r), elements are embedded, which are NA N values. Further, when
the set length (n) is less than the actual length (r) of the vector in that case (r-
n) elements are truncated.

Furthermore, an R object may or may not have attributes. They can be


accessed using the attributes() function. This function return NULL if an
R object does not possess any attribute. For the illustration purpose we check
the attributes of a data set sleep, which is a data frame object as follows:

#Checking attributes of a data frame object


> attributes(sleep)
)
$names

[1] "extra" "group" "ID"


$class

[1] "data.frame"

$row.names

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[17] 17 18 19 20
96
Membership Testing, Coercion and Lists in R

SAQ
Q4
Consider the following student’s data:
Name Score
Deepika Sangwan 98
Shivam 97
Anupam Pandey 85
Anadi Vishist 90
Brijesh 85
Siddharth Tondon 82
Harshvardhan 98
Harshit 96
Shivani 85
Monalisa 97

Write R statements to do the following tasks:


(i) Create a list of the data and set the names of its components names
using names() function.
(ii) Remove the column names.

3.6
3.
.6 S
SUMMARY
UMMARY
The main points discussed in this unit are as follows:
We have discussed several membership testing functions
funcctions available in R.
Implicit and explicit type of coercion have been discussed.
Different explicit type coercion functions available in R have been
discussed
The procedure of creating a list is discussed.
Different methods of extracting list components and elements are
discussed.
Different types of attributes of R objects have been discussed with
examples.

3.7 TERMINAL QUESTIONS


1. State whether the followings statements are TRUE or FALSE:
(i) A data frame is special type of list in which every element of the list
has the same length.
(ii) is.factor() is used for testing integer type.
(iii) is.logical() can be used for testing logical type.
(iv) Matrices are two-dimensional arrays.
(v) The names() function can be used on list and a data frame.

97
Fundamentals of R Language
(vi) The effect of the class (if necessary) can be removed temporarily by
using the function unclass().
(vii) The length() function can’t be used to increase the length of an
already defined vector.
2. Fill in the blanks:
(i) The internal structure of an R object can be viewed using ……
function.
(ii) Testing for a data frame object is done using ………function.
(iii) The is.matrix() function has/have ……….function argument(s).
(iv) The is.list() function is used to test for ……..
(v) Mixing of character elements and integer elements in a vector
results……………..
(vi) Mixing of character elements and logical elements in a matrix
results……………..
((vii)) The output
p of as.integer(c(1.1,
g ( ( , 0.1, )) is ………
, -3.4))
3. Create a vector named w with elements 1.1, 0.1, -3.4, 0.7, 1.8 and 2.2.
Test for its membership with the numeric type of vector and coerce it to
an integer vector.
4. What will be the out of the following R command:
matrix(c(1, FALSE, 0, TRUE),ncol=2)
5. Write R command to get the row names and column names of the
sleep{datasets} data in a single line command.
6. Consider the following data set:
Tree S. No. Age Circumference
1 108 43
1 2 494 98
3 654 106
4 108 42
2 5 494 98
6 654 113
7 108 52
3 8 494 88
9 654 102

Write R code to perform the following tasks:


(i) Create a list named TREE_PP by using the above data set. The list
should have three components tree number, tree age (a numeric
vector giving the age of the tree) and its circumferences (a numeric
vector of trunk circumferences in mm). These 3 components should
be written under the names, tree_no, tree_age, tree_cir.
Further, the age component should be defined using a replication
function.
(ii) Extract its tree_age and tree_cir components.
(iii) Extract the tree circumference entries corresponding to tree age 654
98 from TREE_PP.
Membership Testing, Coercion and Lists in R

7. Consider the warpbreaks data given as a built-in data frame in R. The


first few lines of the data frame are shown here with the help of a screen
shot. Using this data writes the output of the following code:
names(warpbreaks)

8. Write the output of the following statements:


(i) class(c(FALSE, TRUE, FALSE))
(ii) length(c(rep(1,4), rep(2,9))

3.8 SOLUTIONS/ANSWERS
SOLUTIONS/ANSWER
RS
Self-Assessment
Self-A
Assessment Questio
Questions
ons (SAQs)
(SA
AQs
s)
1. The output will be as follows:

is.numeric(fac) FALSE is.numeric(Blessed) FALSE


is.integer(fac) FALSE is.integer(Blessed) FALSE
is.character(fac) FALSE is.character(Blessed) TRUE
is.factor(fac) TRUE is.factor(Blessed)
) FALSE
is.logical(fac) FALSE is.logical(Blessed) FALSE
is.vector(fac) FALSE is.vector(Blessed) TRUE
is.matrix(fac) FALSE is.matrix(Blessed) FALSE
is.array(fac) FALSE is.array(Blessed) FALSE
is.list(fac) FALSE is.list(Blessed) FA
ALSE
FALSE
is.data.frame(fac) FALSE is.data.frame(Blessed) FALSE
is.ts(fac)
is ts(fac) FALSE is.ts(Blessed)
is ts(Blessed) FALSE

Since the is.vector(fac) function command gives the FALSE result,


therefore it is verified that the factor objects are not vectors.
2. The output of the given code is as follows:
y1 y2 y3
x1 1 4 7
x2 2 5 8
x3 3 6 9
3. We first give names ComI, ComII, ComIII and ComIV to the
components of the given list Lst as follows:
Lst<-list(ComI=c(1986,2022), ComII=c("T", "K"),
ComIII=matrix(rep(1,4),ncol=2), ComIV=list("A", "P"),
ComV=sleep)
Then we can extract its 2nd component using these 3 different ways:
Lst[[2]] or Lst[["ComII"]] or Lst$ComII 99
Fundamentals of R Language
4. We can create a list named StdLst and give names to its components
as follows:
StdLst<-list(c("Deepika Sangwan", "Shivam", "Anupam
Pandey", "Anadi Vishist", "Brijesh", "Siddharth
Tondon","Harshvardhan","Harshit","Shivani","Monalisa"
), c(98, 97, 85, 90, 85, 82, 98, 96, 85, 97)); StdLst
Then we can give names to its components as follows:
names(StdLst)<-c("Name", "Score"); StdLst
Finally, we can remove the assigned names using the following
command:
names(StdLst)<-NULL

Terminal Questions (TQs)


1. (i) TRUE (ii) FALSE (iii) TRUE (iv) TRUE
(v) TRUE (vi) TRUE (vii) FALSE
2. (i) str() (ii) is.data.frame() (iii) one
(iv) a list object (v) Character vector (vi) Character vector
(vii) 1, 0, -3
3. Required answer is as follows:
w<-c(1.1, 0.1, -3.4, 0.7, 1.8, 2.2)
is.numeric(w)
as.integer(w)
4. Output
[,1] [,2]
[1,] 1 0
[2,] 0 1
5. The names of the rows and columns of the sleep data can b
bee obtained
function.
in a single command using the dimnames(sleep) func ction.
6. (i) We first create a list using the following command:
TREE_PP
TREE list(tree_no=rep(1:3,each=3),
PP <- list(tree tree_age
no=rep(1:3,each=3), tree age <-
rep(c(108, 494, 654),3), tree_cir <- c(43,
98,106,42,98,113,52,88,102))
(ii) Then tree_age and tree_cir can be extracted as follows:
TREE_PP[[2]]; TREE_PP[[3]]
(iii) Here the which() function can be used to find which indices are
TRUE and then extract it with the help of obtained results as follows:
which(TREE_PP$tree_age==654)
The corresponding output will be
3 6 9
Then we use these elements numbers to extract the entries
corresponding to tree age 654 as follows:
TREE_PP[[3]][which(TREE_PP$Tree_Age==654)]
7. "breaks" "wool" "tension"
8. (i) "logical" (ii) 13
100
UNIT 4
DATA
A FRAMES,, READING
G ANDD
WRITING
G IN
NR

Structure
4.1 Introduction The print() Function

Expected Learning Outcomes The paste() and paste0()


Functions
4.2 Data Frames
The cat() Function
Creation of a Data Frame
4.4 Data Reading from a File
Data Frame Subsetting
4.5 Writing Data to a File
The attach() and detach()
Functions 4.6 Dates and Times
Ordering, Sorting and Ranking 4.7 Summary
Functions
4.8 Terminal
minal Questions
Term
The split() Function
4.9 Solutions/Answers
4.3 Formatting Commands

4.1
4 .1 INTRODUCTION
INTRODUCTIO
ON
In Unit 2 of MST-015 (Introduction to R Software) course, you have learnt the
method of creating and using R objects, namely, vectors, matrices, arrays and
factors. In the same unit, you also got familiar with the arithmetic operators,
relational operators and logical operators. Thereafter, in Unit 3 you learnt the
method of creating a list object and extraction of its components or elements
of the components. In Unit 3 you have also studied a number of membership
testing and coercion functions.
In the beginning of this unit, we shall discuss the method of creating a data
frame, subsetting of a data frame, ordering, sorting and ranking functions. We
shall also discuss some commonly used functions in which data frame is
supplied as an argument to the function. Later in this unit, we shall make you
familiar with some formatting command functions such as print(),
paste(), paste0() and cat() functions. For the data analysis purpose, it
is important to know the way of reading data from different file formats (such
as .txt, .csv, .delim and .xslx) and write the data to a file of specific format. So,
we discuss different functions used for reading and writing from/to a file. We
shall also discuss the commonly used date and time functions, namely, 101
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language

as.Date(), ISOdatetime(), as.POSIXct() and as.POSIXlt() in this


unit.

Expected Learning Outcomes


After completing this unit, you should be able to:
create a data frame;
extract a row, column or specific element from a data frame;
learn the use of attach() and detach() functions;
perform several operations on data frames;
read from a file;
write to a file;
use date and time functions to generate time sequences; and
obtain the output in a specific format.

4.2 DATA FRAMES


Matrices, lists and data frames facilitate the user to include any number of
columns or variables under a single name. The data frame can be defined as
an R object with rows and columns. In fact, a data frame is a special type of
list in which every component of the list has same length. The data frames are
stored in the memory as lists but wrapped into data frame objects.
The functioning of the data frames is quite similar with matrices. That is why,
they are quite easy to handle and manipulate. In any data frame, the rows
consist of the different units or samples or observations and column
d the colum
u n of the
data frame consists of the variables of tthe
he data used for analysis purpose.
Since, whenever a data is read into R, generally, it is read as a data frame. It
is almost impossible that a user while performing data analysis does not
encounter with a data frame. Therefore, it is important for the users
ussers to know
how to write or read data and manipulate
p late it in R.
manipu
When a matrix is created, or handled in R, we encounter one restriction that
the matrix
th t i elements
l t should
h ld bbe off th
the same ttype or class.
l Butt thi
B this iis nott th
the
restriction with the data frames. The columns of a data frame can belong to
different classes. The columns of a data frame could be numeric, character,
factor, logical or could be calendar dates and so on.
Before analysing any data, it is very important to enter the data in correct
manner. The key to enter your data in proper manner is that, firstly select the
main variables and then place the values into the columns of the data frame.
Different columns of a data frame are in vector structure form and they must
be of the same length. Additionally, the row size of all the rows of the data
frame should be of same.
4.2.1 Creation of a Data Frame
A data frame in R is created using the data.frame() function available in
the base package. For the illustration purpose, we now create a data frame of
the admission data of the following six students to a specific programme of
102 IGNOU.
Data Frames, Reading and Writing in R

Name Gender Percentage Age>30


Shreyash Male 88.55 TRUE
Prithu Male 80.13 FALSE
Yuvaan Male 85.31 FALSE
Advika Female 75.22 FALSE
Pawan Male 65.04 TRUE
Pehu Female NA FALSE

In the given data there are four columns. First three columns consists of the
names, genders and percentage of marks of the students. The fourth column
is of logical type indicating whether the age of the student is more than 30 or
not. The rows of the given data are the sample unit or some-times referred as
observations. Note that, the first column consisting of the names of the
students is of character type, the second column consisting of gender
information is a categorical variable of character type, the third column
consisting of the percentage of marks of the students is of numeric type and
the fourth column, i.e., age is of logical type as it consists of TRUE and FALSE.
Also, note that the given data consists of a missing value corresponding to the
percentage of marks of the student Pehu.
Next, we create a data frame of the given data using the da
data.frame()
data.
.fr
rame
e()
function and named it as Adm.data
Adm data as follows:
dm.
#Creating and assigning a data frame
> Adm.data <- data.frame(
+ c("Shreyash","Prithu","Yuvaan","Advika","Pawan","Pehu"),
+ as.factor(c("Male", "Male", "Male", "Female", "Male",
"Female")),
+ c(88.55, 80.13, 85.31, 75.22, 65.04, NA),
+ c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))

Now a data frame named Ad data is created. Observe that while creating
Adm.da
Adm.data
da
this data frame the gender variable is coerced to a factor
c or using the coercion
fact
function as.factor().
a .factor()
as (). Additionally, the columns of the data frame do not
()
names. In Unit 3 of MST
have names MST-015 course, you have learnt to set names of the
015 course
columns of a data frame using the names() function. Let us use the names()
function here to set the names to the columns of the Adm.data as Name,
Gender, Percentage and AgeG30 as follows:
#Setting the column names
> names(Adm.data)
) <-
- c("Name","Gender","Percentage","AgeG30")
)
#Printing the data frame
> print(Adm.data)
)
Name Gender Percentage AgeG30
1 Shreyash Male 88.55 TRUE
2 Prithu Male 80.13 FALSE
3 Yuvaan Male 85.31 FALSE
4 Advika Female 75.22 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE
103
Fundamentals of R Language

Hence, the output confirms that the names to the columns of the data frame
are correctly assigned. After creating a data frame and assigning the names,
next we view the internal structure of the Adm.data data frame using the
str() function as follows:
#Internal structure of the data frame
> str(Adm.data)
)
'data.frame': 6 obs. of 4 variables:
$ Name : chr "Shreyash" "Prithu" "Yuvaan" "Advika" ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1
$ Percentage: num 88.5 80.1 85.3 75.2 65 ...
$ AgeG30 : logi TRUE FALSE FALSE FALSE TRUE FALSE

From the internal structure, it is clear that the obtained information consists of
the complete details on the 6 observations of 4 columns or variables (whose
names are written after ‘ $ ’ operator. The output depicts that the Name
variable is a character variable (as it is specified by chr), the Gender variable
is a factor variable with two levels (as it is specified by Factor), the
percentage variable is a numeric variable (as it is specified by num) and the
last variable AgeGG30 is a logical variable (as it is specified by lo
AgeG30
eG logi).
ogi
gi).
Note: For the sake of convenience the column names are set in the
data data frame, as these variable names facilitate columns extraction,
Adm.data
Adm.da
da
enhance the readability and reference.
Next, we discuss some commonly used functions which are used on data
frames:

Function Objective

gets the first n (with default value


e 6L) numberr of rows of a data
head()
frame.
gets the last n (with default value 6L) number of rows of a data
tail()
frame.
nrow() gets the number off rows of a data frame.
ncol() gets the number of columns of a data frame.
names()
() sets and gets the column names of a data frame.
row.names() sets and gets the row names of a data frame.
dim() gets the dimension of a data frame.
compute the summary on each of the variables of a data frame
summary()
)
(when used on the data frame).
gets the internal structure of the data frame (when data frame is
str()
)
supplied as an argument).
rowSums
s() computes the vector of row sums of a data frame.
rowMeans
s() computes the vector of row means of a data frame.
colSums
s() computes the vector of column sums of a data frame.
colMeans
s() computes the vector of column means of a data frame.

Note: The str(), head() and tail() functions are available in the utils
package and other mentioned functions are available in the base package.
Recall that in Unit 3, you have already learnt the use of the names() and
row.names() functions. So, we do not discuss these functions in detail
104
Data Frames, Reading and Writing in R

again. To illustrate the use of these function, we again consider the created
Adm.data data frame and supply it as an argument to these functions as
follows:
#Getting the number of rows of a data frame
> nrow(Adm.data)
[1] 6

#Getting the number of columns of a data frame


> ncol(Adm.data)
[1] 4

#Getting the number of rows and columns (dimension) together


> dim(Adm.data)
)
[1] 6 4

#Getting the names of the columns of a data frame


> names(Adm.data)
)
[1] "Name" "Gender" "Percentage" "AgeG30"

#Getting the names of the rows of a data frame


> row.names(Adm.data)
[1] "1" "2" "3" "4" "5" "6"

The first argument of the head()


head
he ( and tail()
d() i () functions is assigned as the
tail
name of the data frame and the second argument n is assigned as the number
rows to be viewed. For the illustration purpose, we now supply Adm.data
data frame as an argument to the h head()
ead
d()) and tail()) functions to get
default number of rows and specified number of rows as follows:
#Getting first six rows (by default)
> head(Adm.data)
Name Gender Percentage AgeG30
1 Shreyash Male 88.55 TRUE
2 Prithu Male 80.13 FALSE
3 Yuvaan Male 85
85.31
31 FALSE
4 Advika Female 75.22 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE

#Getting or viewing the first two rows


> head(Adm.data,2)
)
Name Gender Percentage AgeG30
1 Shreyash Male 88.55 TRUE
2 Prithu Male 80.13 FALSE

#Getting last six rows (by default)


> tail(Adm.data)
)
Name Gender Percentage AgeG30
1 Shreyash Male 88.55 TRUE
2 Prithu Male 80.13 FALSE
105
Fundamentals of R Language

3 Yuvaan Male 85.31 FALSE


4 Advika Female 75.22 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE

#Getting last three rows (by default)


> tail(Adm.data,3)
)
Name Gender Percentage AgeG30
4 Advika Female 75.22 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE

Next, we discuss the summary() function. This function gives us summary of


6 numbers, namely, minimum, 1st quartile, median, mean, 3rd quartile and
maximum for the numeric or integer type data. The length, class and mode for
the character type data. Mode and count of TRUE and FALSE for logical type
data. Count of levels for the factor type data. Observe that the summary()
function also gives the count of NA’s. To verify it, we now supply the
Adm.data data frame to get the summary on each column of Adm.data data
frame as follows:
#Getting summary of each column of the data frame
> summary(Adm.data)
Name Gender Percentage AgeG30
Length:6 Female:2 Min. :65.04 Mode :logical
Class :character Male :4 1st Qu.:75.22 FALSE:4
Mode :character Median :80.13 TRUE :2
M
Me
Mean
an :78.85
3rd Qu.:85.31
Max. :88.55
NA's :1

Hence, the obtained output confirms the aforementioned statements. Since


the Name variable is of character type
type, the Gender variable is of factor type
type,
the percentage variable is of numeric type and the AgeF30 variable is of
logical type.
Let us next recall from the previous units that the str() function is used to
get the internal structure of the supplied argument. This function is already
discussed after the creation of a data frame Adm.data.
Note: The colSums(), rowSums(), colMeans() and rowMeans()
functions are used on the same lines on data frames as on matrices. Refer to
Unit 2 for more details on these functions. Importantly, note that these
functions also have the na.rm argument to handle missing values.
Built-in data sets
A number of built-in data sets are available in R, which comes as part of base
packages during installation of R software. The details on each data set can
be easily obtained by taking help on it. A list of data sets available to use can
be viewed using the data() function command. The built-in data sets
106
Data Frames, Reading and Writing in R

available in the loaded packages, will be displayed if we run the data()


function as follows:

Note that when you write the data() command, you will be able to view all
the data sets, whose libraries are already loaded to the working environment,
or otherwise by default, you will view the data sets available in the datasets
package. Moreover, the data() function can also be used to load a specific
data set. To do so, either you call the packages first using require() or
library() function, or otherwise write the following da data()
data () function
ta()
ta
command with the package name. For the illustration purpose we now view
the data sets available in da
datasets
ata
t set ts and MA
MASS
M S1 libraries together as follows:
SS
#Viewing the data sets available in the datasets and MASS
#libraries
> library("MASS")
> data()

By default, the data sets available in the da


datasets
d tasetsts package
packkage appears. So if
you wish to include data sets of other packages as well, include those
packages first and then write the da ata() command. Or otherwise, we can
data()
view the built-in data sets particularly available in a specific package, say
MASS
M
MA SS as follows:
#Particularly viewing the data sets of MASS package
> data(package
e = "MASS")

4.2.2 Data Frame Subsetting


In this subsection, we shall discuss the methods of extracting rows, columns
and element(s) of a data frame. While working with a data frame for analysis
purpose, we may encounter a situation in which we are interested in the
specific part or subpart of a data frame. So, to deal with such a situation we
now discuss the method of data frame subsetting by taking some suitable
illustrations. Consider the following general layout:
df.name[rows.indices, col.indices]
In this layout, df.name is the name of the data frame, row.indices is a
vector consisting of the row numbers written at margin 1 and col.indices is
a vector consisting of the column numbers of the data frame written at margin
2, which are to be extracted. For the illustration purpose we now extract the
2nd, 5th and 6th rows of the Adm.df as follows:
107
Fundamentals of R Language
1
https://CRAN.R-project.org/package=MASS

#Extracting 2nd, 5th and 6th rows


> Adm.data[c(2,5,6),
, ]
Name Gender Percentage AgeG30
2 Prithu Male 80.13 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE

Note that, a blank space at the column indices place is just left to indicate that
all the columns need to be selected. Also, the rows will be extracted in the
same order in which they are written. See for example
#Extracting rows in different order
> Adm.data[c(6,2,5),
, ]
Name Gender Percentage AgeG30
6 Pehu Female NA FALSE
2 Prithu Male 80.13 FALSE
5 Pawan Male 65.04 TRUE

Some general techniques of subsetting


Again, it is important you to observe that in the case of matrices as well as
data frames a black space at the column indices place indicates that all the
columns are to be selected. Similarly, a blank space at the row indices place
indicates that all the rows should be selected.
Consider some of the following useful commands for the extraction of rows,
columns and elements of a data frame named df ame forr more clarification.
df.name
f.nnam cla
arification.

df.name[c(i,j,k), ]
Extracts the ith , jth and kth rows, while keeping or selecting all
the columns of a data frame.

df.name[ ,1:m]
Extracts the first m columns,
columns while selecting all the rows of a
data frame.

df.name[ [ ,c(i,j)]
Extracts the ith and jth columns, while selecting all the rows
of a data frame.

On the similar lines, selection of the rows, columns and subpart of the data
frame can also be done using logical conditions. For example:

df.name[(m>n),c(i,j)]
Extracts the subpart of a data frame consisting of rows
whcih satisties the logical condition (m>n) of the ith and jth
columns.

108
Data Frames, Reading and Writing in R

In addition to all these, particular number of rows and columns can also be
dropped by writing the negative sign in front of the row and column indices.
For example, 4th column of the data frame can be dropped as follows:

df.name[ [ ,-4]
Dropping 4th column of a data frame and considering all the
rows of a data frame.

Next, we consider a built-in data frame named USArrests available in the


datasets package for deep understanding purpose. You can take help on
this data frame as follows:
#Seeking help on a data frame
> ?USArrests
s
starting httpd help server ... done

Also, we can have a look on the data frame by simply writing the name of the
data frame on the R console. Let us display first few rows of the data frame as
follows:

To illustrate subsetting, we now extract the first six rows of the 2nd and 4th
columns of the USArrests data frame by writing row and column indices of
the data frame as follows:
#Extracting subpart of a data frame
> USArrests[1:6,
, c(2,4)]
] 109
Fundamentals of R Language

Assault Rape
Alabama 236 21.2
Alaska 263 44.5
Arizona 294 31.0
Arkansas 190 19.5
California 276 40.6
Colorado 204 38.7
nd
Next, we extract 2 column of the USArrests data frame.
#Extracting 2nd column of the data frame
> USArrests[
[ ,2]
] #Or USArrests$Assault
[1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113
[15] 56 115 109 249 83 300 149 255 72 259 178 109 102 252
[29] 57 159 285 254 337 45 120 151 159 106 174 279 86 188
[43] 201 120 48 156 145 81 53 161

Note: A particular column can also be extracted by using the ‘ $ ’ operator in


between the data frame name and column/variable name.
Let us next extract the 4th row of the USArrests
USAr
rres s data frame.
sts
#Extracting 4th row of the data frame
> USArrests[4,]
Murder Assault UrbanPop Rape
Arkansas 8.8 190 50 19.5

Next, we extract the 2nd element of the 3rd row of the


e US
USAr
A re
rests data
USArrests
sts
st da
ata frame.
(The highlighted element in the shown
w screenshot).

The highlighted element from the USArrests data frame can be extracted
using the following statement.
#Extracting a particular element
> USArrests[3,2]
]
[1] 294

Lastly, we illustrate the method of extraction of all those rows of a data frame,
for which either the Assault variable is more than 250 or the Murder
variable is more than 16 and select only first three columns as follows:
#Extracting using logical condition
> USArrests[USArrests$Assault>250|USArrests$Murder>16,
, 1:3]
]
Murder Assault UrbanPop
Alaska 10.0 263 48
Arizona 8.1 294 80
California 9.0 276 91
110
Data Frames, Reading and Writing in R

Florida 15.4 335 80


Georgia 17.4 211 60
Maryland 11.3 300 67
Michigan 12.1 255 74
Mississippi 16.1 259 44
Nevada 12.2 252 81
New Mexico 11.4 285 70
New York 11.1 254 86
North Carolina 13.0 337 45
South Carolina 14.4 279 48

4.2.3 The attach() and detach() Functions


The attach() and detach() functions comes as part of the base package.
The attach() function is used to attach the database with the R search path.
When this function is used,, the database is searched by y R first while
evaluating a variable, so that the objects in the database can be accessed by
simply mentioning their names. The database could be a data frame or a list or
a created data file or an environment.
For the illustration purpose, we now attach the columns of the U SA
Arre
ests
s
USArrests
data frame using the atta c () function. Then access the columns using only
ach
attach()
their names (or otherwise we need to use ‘ $ ’ operator) as follows:
#Attaching USArrests data frame
> attach(USArrests)

After attaching the data frame now we can access the columns of the
USArrests
U
US Ar est
Arre s s data frame by using the column names only as follows:
#Accessing the Murder and Assault variables
> Murder #Otherwise USArrests$Murder
[1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3
[12] 2.6 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1
[23] 2.7 16.1 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0
[34] 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2
[45] 2.2 8.5 4.0 5.7 2.6 6.8

> Assault
t #Otherwise USArrests$Assault
[1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113
[15] 56 115 109 249 83 300 149 255 72 259 178 109 102 252
[29] 57 159 285 254 337 45 120 151 159 106 174 279 86 188
[43] 201 120 48 156 145 81 53 161

A attached database can be detach using the detach() function available in


R. This function removes the database from the search path of available R
objects as follows:
#Detaching the USArrests data frame
> detach(USArrests)
)
111
Fundamentals of R Language

It is users’ responsibility to always detach the attached data frame after the
work is over. Whenever data frame is detached, its columns cannot be
accessed just by writing the column names. For the illustration purpose, after
detaching the data frame, we now try to access its columns by their names
and see what we get.
#Accessing columns after detaching the data frame
> Murder
r
Error: object 'Murder' not found
> Assault
t
Error: object 'Assault' not found

Thus, after detaching the data frame, we will not be able to access the variable
by simply writing the column names. In the next subsection we discuss about
ordering, sorting and ranking functions.

4.2.4 Ordering, Sorting and Ranking Functions


The rearrangement of the rows of a data frame can be done using the
order() function. The sort t() and rank() functions are used to get sorted
rt(
sort()
column and to get ranks of the column elements. All these functions are
available in the base
e package. We shall discuss each of these functions one-
base
by-one.
The order()
order r() function returns the permutation, which rearranges the data
frame into ascending or descending order of a column of a data frame.
Note: We are generally interested to sort a data frame
fram
me by rows.
For the illustration purpose, we consider the first 10 rows of the USArrests
US
SArrer st
ts
data frame and assign them to data.
d ta. Then sort the subpart of the USArrests
da
data frame, i.e., data
da according to the Murder
Muurdder variable of it ass follows:
#Assigning first 10 rows of the USArrests data fram
frame
a e to data
> data <- USArrests[1:10, ]

Next we illustrate the colSums(), rowSums(), colMeans() and


rowMeans() functions. We use these functions on the data frame data and
compute the vectors of column sums, row sums, column means and row
means as follows:
#Computing a vector of column sums of data
> colSums(data)
)
Murder Assault UrbanPop Rape
99.0 2357.0 694.0 280.1
#Computing a vector of row sums of data
> rowSums(data)
)
Alabama Alaska Arizona Arkansas California
328.4 365.5 413.1 268.3 416.6
Colorado Connecticut Delaware Florida Georgia
328.6 201.4 331.7 462.3 314.2

112
Data Frames, Reading and Writing in R

#Computing a vector of column means of data


> colMeans(data)
)
Murder Assault UrbanPop Rape
9.90 235.70 69.40 28.01

#Computing a vector of row means of data


> rowMeans(data)
)
Alabama Alaska Arizona Arkansas California
82.100 91.375 103.275 67.075 104.150
Colorado Connecticut Delaware Florida Georgia
82.150 50.350 82.925 115.575 78.550

Next, we sort the data frame according to the Murder variable of data. To do
so, we first compute the order of the Murder variable using the order()
function. Additionally, we also append the computed orders to data using the
‘ $ ’ operator and named it as OrderM as follows:
#Computing the order of the Murder column and appending it to
#data
> data$OrderM <- order(data$Murder)
> print(data)
Murder Assault UrbanPop Rape OrderM
Alabama 13.2 236 58 21.2 7
Alaska 10.0 263 48 44.5 8
Arizona 8.1 294 80 31.0 6
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Colorado 7.9 204 78 38.7 5
Connecticut 3.3 110 77 11.1 2
Delaware 5.9 238 72 15.8 1
Florida 15.4 335 80 31.9 9
Georgia 17.4 211 60 25.8 10

Note that, one more column (fifth column) named OrderM, consisting of the
order
d off th
the rows is
i now appendedd d tto d Also, b
data. Al by d
default this ffunction
f lt thi ti gives
i
the output in ascending order (which means rows will be arranged in
ascending order of Murder variable). We now try to understand the obtained
output. The values shown under OrderM variable indicates that the smallest
value of Murder variable i.e., 3.3 is present in the 7 row (as OrderM[1]=7) of
the data frame data and the largest value of the Murder variable, i.e., 17.4 is
present at the 10th row (as OrderM[10]=10) of the data frame. On the similar
lines other elements of the OrderM column can be inferred. Moreover, this
order is shown in increasing order.
Next, we sort the rows of the data frame data according to the Murder
variable, using the computed OrderM column at row indices place of the data
frame data as follows:
#Sorting data according to Murder variable

> data[data$OrderM,
, ] #Arranging rows according to OrderM
Murder Assault UrbanPop Rape OrderM
Connecticut 3.3 110 77 11.1 2 113
Fundamentals of R Language

Delaware 5.9 238 72 15.8 1


Colorado 7.9 204 78 38.7 5
Arizona 8.1 294 80 31.0 6
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Alaska 10.0 263 48 44.5 8
Alabama 13.2 236 58 21.2 7
Florida 15.4 335 80 31.9 9
Georgia 17.4 211 60 25.8 10

From the obtained output we observe that all the rows of the data are now
rearranged according to the Murder variable of data, due to the OrderM
variable. Hence the data frame data is sorted according to the Murder
variable. On the similar lines the data can be sorted according to any column
of the data frame.
Next, we discuss the rank() and sort() functions. The ranks of the
elements of any column of the data frame or a vector, can be obtained using
the rank() function, just by supplying the column as an argument to the
function. Additionally, a column of a data frame can be sorted using the
sort()
s
so () function. Consider the data frame da
rt() a again.
data
ata
#Extracting data
> data <- USArrests[1:10,]

For the comparison purpose, we first append the order of the Assault
AsO to data.
variable with name AsO
#Appending computed order of the
e Assault va
v
variable
riable
> data$AsO <- order(data$Assault)

Next, we compute the ranks of the elements of the Assault


A sa
Assaullt column of the
ul
data. Moreover, we append the computed ranks to data
da a with n
data name
ame AsR.R. In
AsR
As
addition to this, we sort the Assault variable and append the so
Assault
saul
saul ssorted
rted
ed data to
ata with name AsS as follows:
data
d
#Appending computed ranks and sorted Assault variable to data
> data$AsR
R <-
- rank(data$Assault)
)
> data$AsS
S <-
- sort(data$Assault)
)
> print(data)
)
Murder Assault UrbanPop Rape AsO AsR AsS
Alabama 13.2 236 58 21.2 7 5 110
Alaska 10.0 263 48 44.5 4 7 190
Arizona 8.1 294 80 31.0 6 9 204
Arkansas 8.8 190 50 19.5 10 2 211
California 9.0 276 91 40.6 1 8 236
Colorado 7.9 204 78 38.7 8 3 238
Connecticut 3.3 110 77 11.1 2 1 263
Delaware 5.9 238 72 15.8 5 6 276
Florida 15.4 335 80 31.9 3 10 294
Georgia 17.4 211 60 25.8 9 4 335
114
Data Frames, Reading and Writing in R

Observe from the obtained output that the AsO, AsR and AsS columns
consists of the Order, ranks and the sorted Assault variable.

4.2.5 The split() Function


The split() function available in the base package is used to divide the
data assigned to x argument of the function in groups defined by the f
argument of the function. The general form of the split() function is as
follows:
#The split() function
split(x,
, #vector or data frame consisting of the data to be
divided into groups
f)
) #Grouping variable

For the illustration purpose we consider the ships data set available in the
MASS package. We first seek help on the data frame and show some of its
rows as follows:
#Loading MASS package
> library(MASS)
#Seeking help
> ?ships

The details on the ships data set can be read from the R Documentation
page. Next, we have a look on the data frame now:

115
Fundamentals of R Language

We now see the internal structure of the data as follows:


#Internal structure of the data
> str(ships)
)
'data.frame': 40 obs. of 5 variables:
$ type : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1
1 1 1 2 2 ...
$ year : int 60 60 65 65 70 70 75 75 60 60 ...
$ period : int 60 75 60 75 60 75 60 75 60 75 ...
$ service : int 127 63 1095 1095 1512 3353 0 2244 44882
17176 ...
$ incidents: int 0 0 3 4 6 18 0 11 39 29 ...

From the internal structure, it is clear that the type


ype variable of the ships
ty ips data
ship
ship
is of factor type and remaining three variables are of integer type.
To illustrate the use of the split()
spli () function we now
lit(
t( w split the ships
ship
sh ps data in
ip
groups defined by the type
ty variable (which
e (wh
w ich is a factor variable) as follows:
#Grouping the ships data according to the type variable
> split(x=ships, f=ships$type)
$A
type year period service in
incidents
ncidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
...
$B
type year period service incidents
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
...
$C
type year period service incidents
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
...
116
Data Frames, Reading and Writing in R

$D
type year period service incidents
25 D 60 60 251 0
26 D 60 75 105 0
27 D 65 60 288 0
...
$E
type year period service incidents
33 E 60 60 45 0
34 E 60 75 0 0
35 E 65 60 789 7
...

Note that, a specific group can also be extracted by explicitly using the logical
condition. For the illustration purpose, we now extract the rows of the ships
data frame corresponding to the type C as follows:
#Extraction of rows where type=C
> ships[ships$type=="C",]
type year period service incidents
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
20 C 65 75 676 1
21 C 70 60 783 6
22 C 70 75 1948 2
23 C 75 60 0 0
24 C 75 75 274 1

If required, other rows of the ships


ps data can also be extracted according to
ship
different ty pe on the same lines.
type

SSAQ
SA
AQ 1
Consider the admission data discussed in the Section 4.2 and create a data
frame consisting of the admission data. After creating the data frame do the
following tasks:
(i) Observe the output will look like if we do not set the names of the
columns.
(ii) Set suitable row and column names of the data frame in a single
command.
(iii) Sort the data frame according to the percentage variable.

4.3 FORMATTING COMMANDS


In this section, we shall discuss the print(), paste(), paste0() and
cat() functions. These functions are extensively used for printing formatted
outputs or objects.
117
Fundamentals of R Language

4.3.1 The print() Function


The print() function available in the base package prints its argument.
Recall, that in the previous two Units 2 and 3, maximum number of times the
printing of the objects was done by writing the name of the object only.
Whenever, printing is done, by writing the name of the object only, it is called
auto printing. This is one of the interactive and time saving approach or
method. Another, method of printing an object is using the print() function
and we call it as explicit printing. Sometimes explicit printing is necessary, for
example, while writing lengthy codes or functions or R scripts, the auto-printing
does not always work. In that case explicit printing of the object by using the
print() function is required.

4.3.2 The paste() and past0() Functions


The paste() and paste0() functions are used to concatenate vector
objects term-by-term after converting them into character vectors. The output
obtained using these functions will be of character type. These functions
facilitate printing.
Let us first see the internal structure of these two functions:
#Internal structure of the paste() function
> str(paste)
function (..., sep = " ", collapse = NULL, recycle0 = FALSE)

#Internal structure of the paste0 function


> str(paste0)
function (..., collapse = NULL, recycle0 = FALSE)

In these two function the ... is a Dot-dot-dot ob


object
bject of R, w
which
hich is used to
absorb more than one arguments (objects), the sep p argument of the paste()
paste( ()
function is used to assign a character string, which will be used to separate the
terms and the optional collapse
c llap
co pse argument of both the function ons iss used to
functions
assign the character string which separate the results. Lastly, the recycle0
argument (with default value FALSE) is used to specify if zero-length character
arguments should lead to zero-length character(0).

From these internal structures, observe that the only difference between the
paste() and paste0() functions is of the sep argument. For the illustration
purpose, we now concatenate term-by-term first 10 upper-case letters of the
Roman alphabet with 1 to 10 numbers using both the functions as follows:

#Concatenating strings using paste() function


> paste(LETTERS[1:10],
, 1:10)
)
[1] "A 1" "B 2" "C 3" "D 4" "E 5" "F 6" "G 7" "H 8"
[9] "I 9" "J 10"

#Concatenating strings using paste0() function


> paste0(LETTERS[1:10],
, 1:10)
)
[1] "A1" "B2" "C3" "D4" "E5" "F6" "G7" "H8" "I9"
[10] "J10"
118
Data Frames, Reading and Writing in R

Note that, the difference between the two outputs is due to the sep argument
only. If we write the sep argument as sep="" (without an empty space) then
we get the same output as of paste0() function, see for example:
#Alternative approach to paste0() function
> paste(LETTERS[1:10],
, 1:10,
, sep="")
)
[1] "A1" "B2" "C3" "D4" "E5" "F6" "G7" "H8" "I9"
[10] "J10"

Next, we illustrate the use of the sep and collapse arguments, so that the
difference between the two can be clearly understood. To do use, we
concatenate the earlier two vectors term-by-term using the sep and
collapse arguments as follows:
#Concatenating two vectors term-by-term
> paste(LETTERS[1:10],
, 1:10,
, sep="$",
, collapse=",
, ")
)
[1] "A$1, B$2, C$3, D$4, E$5, F$6, G$7, H$8, I$9, J$10"

From the output, observe that the terms are separated using the collapse
argument and elements of the vectors are separated using the sep argument.
Next, we illustrate the use of the recycle0
re
ecy
cycle argument whose default value
c e0
E.
FALSE.
FALSE
#If recycle0 is FALSE
> paste("Use of recycle0", vector(mode="character",length=0),
recycle0=FALSE)
[1] "Use of recycle0 "

#If recycle0 is TRUE


> paste("Use of recycle0", vector(mode="character",length=0),
recycle0=TRUE)
character(0)

Note that these past


paste()
ste(
st e()
e( ) function commands consists of a zero-length
character argument and the recycle0 argument. If the recycle0 0 argument
is TRUE then the output is a zero-length character(0) otherwise the
character string “Use of recycle0” is printed.

4.3.3 The cat() Function


The cat() function concatenates allowable objects and print the result. This
function is useful for producing the output, which are more easily
understandable, readable and user friendly. This function converts all its
arguments to character vectors first, then it concatenates them into a single
character vector. The most useful arguments of this function are sep and
fill. The sep argument is used to assign a character vector or a string,
which is to be appended after each element and the fill argument is a
logical argument, with default value FALSE. Argument fill=FALSE indicates
that only newline created explicitly by writing ‘ \n ’ in the representation is
printed. The fill argument controls how the output is broken into successive
argument lines.
For the illustration purpose consider the following example:
119
Fundamentals of R Language

Suppose you are interested in printing the following information:


The monthly salary of Pawan is 75000 Rs.
The monthly salary of Advait is 65000 Rs.
One of the possible ways to enter this data in R could be by creating the two
objects named Pawan and Advait; and assign their salaries to them as follows:
#Assigning data
> Pawan
n <-
- 75000;
; Advait
t <-
- 65000
0
Then the print command can be used to print these two objects as follows:
#Printing data
> print(Pawan)
)
[1] 75000
> print(Advait)
)
[1] 65000

But this output is not explicitly showing, what these numeric digits are
representing.
The paste() function can be used to print the detailed output and the output
will be printed in double quotes ( " " ) as follows:
#Printing given information using paste() function
> paste("Monthly salary of Pawan is", Pawan, "Rs.")
[1] "Monthly salary of Pawan is 75000 Rs."
> paste("Monthly salary of Advait is", Advait, "Rs.")
[1] "Monthly salary of Advait is 65000 Rs."

Next, we print the given information again using the cat()


() function to get
cat(
ca
detailed output as follows:
#Printing given information using cat() function
> cat("Monthly salary of Pawan is", Pawan, "Rs.", "\n")
Monthly salary of Pawan is 75000 Rs.
> cat("Monthly salary of Advait is", Advait, "Rs.", "\n")
Monthly salary of Advait is 65000 Rs
Rs.

Note: The two-character strings "Rs.", "\n" could be written together as


"Rs.\n". Also, if you compare the outputs of the paste() and cat()
functions to observe the difference in outputs, you will see that the output of
the paste() function is coming in double quotes and the output of the cat()
function is coming without double quotes.
Next, we observe what happens if we do not give the new line character ‘ \n ’
at the end and run two lines together.
#Output without ‘ \n ’ (as by default fill=FALSE)
> cat("Monthly
y salary
y of
f Pawan
n is",
, Pawan,
, "Rs.")
)
Monthly salary of Pawan is 75000 Rs>
> cat("Monthly
y salary
y of
f
Advait
t is",
, Advait,
, "Rs.")
)
Monthly salary of Advait is 65000 Rs.>

Due to the absence of the new line character ‘ \n ’, the outputs are coming in
continuation. This shows the importance of the new line character ‘ \n ’.
120
Data Frames, Reading and Writing in R

The occurrence of the new line character can be controlled using the fill
argument of the function. Recall that, by default, fill=FALSE. So to add new
line argument at the end of each statement we can simply write fill=TRUE
as follows:
#Using fill argument of the cat() function
> cat("Monthly
y salary
y of
f Pawan
n is",
, Pawan,
, "Rs.",
, fill=TRUE)
)
Monthly salary of Pawan is 75000 Rs.
> cat("Monthly
y salary
y of
f Advait
t is",
, Advait,
, "Rs."
" ,fill=TRUE)
)
Monthly salary of Advait is 65000 Rs.

It should be noted here that, the statements to be printed as it is should be


written in the double quotes, i.e., as character string. Further, a new line
character ‘ \n ’ should be added for the new line. Each representation of the
cat() function should be separated by the comma symbol ( , ). A single
command can also be used using the cat() function to print both the
statements together as follows:
#Printing both the statements in single command
> cat(" Monthly salary of Pawan is", Pawan, "\n", "Monthly
salary of Advait is", Advait, "\n")
Monthly salary of Pawan is 75000
Monthly salary of Advait is 65000

Note: A tab character ‘ \t ’ is used to give a horizontal tab space and a new
line character ‘ \n ’ is used for new line.
Next, we illustrate the use of the sep
se argument of the cat()
() function to get
cat(
cat()
modified outputs using it as follows:
#Using separator as blank space
> cat("ABC", "abc", sep=" ", fill=TRUE)
ABC abc
#Using separator as comma
> cat("ABC", "abc", sep=",", fill=TRUE)
ABC,abc
#Using separator as new line character
> cat("ABC",
, "abc",
, sep="\n",
, fill=TRUE)
)
ABC
abc
#Using separator as tab character
> cat("ABC",
, "abc",
, sep="\t",
, fill=TRUE)
)
ABC abc

From the obtained outputs observe that the terms of the output are separated
by the character specified by the sep argument of the cat() function.
Moreover, note that the paste() function can be used as argument of the
cat() function for getting further formatted output as follows:
#Printing using cat() function only
> cat(letters[1:3],
, 1:3,
, sep=",",
, "\n")
)
a,b,c,1,2,3, 121
Fundamentals of R Language

#Using paste() function as an argument to the cat() function


> cat(paste(letters[1:3],1:3,
, sep=":"),
, sep=",",
, "\n")
)
a:1,b:2,c:3,

SAQ
Q2
Write a R command to get the following output:
a##1$, b##3$, c##5$, d##9$

4.4 DATA READING FROM A FILE


R provides several options for reading a file in table format and to create a
data frame from it. In this section, we shall mainly discuss the
read.table(), read.csv() and read.delim() functions to read data
from a file. These functions are available in the utils package. The
read.table() function is generally used to read data frames consisting of
columns of different classes from a .txt file. The read.csv() function is
reead.t a le() function and is mainly used to read the
almost identical to the read.table()
.tab
‘comma separated value’ (.csv) files. The read.delim() function is used for
reading delimited files, defaulting to the TAB character for the delimiter.
Before we start with the discussion on these functions. It is important you to
know about the getwd()
getw
wd(() and setwd()
set d() functions. These two functions are
twd
available in the base
e package. The getwd()
base getw
tw d ) function is used to view the
wd(
current working directory and the setwd()
setw
se twd(
tw () is used to set the working
directory.
#Getting the current working directory
> getwd()
[1] "C:/Users/Taruna Kumari/Documents"

We would like to save our work in a folder named “Introduction to R Software”.


So, we first create a folder on desktop (place of our choice) an and
nd then create a
.txt file named TKfile1.txt
TKfile1 1.txt in that folder to illustrate reading from a .txt file
as follows:

Observe from the screenshot that the location of the .txt file shown in the
image is not same as of our current directory. So, we first set up the path of
the working directory using the setwd() function as follows:
#Setting the path of the working directory
> path="C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software"
"
122
Data Frames, Reading and Writing in R

> setwd(path)
)

Note: An alternative approach to do this is that, we specify the path of the file
while reading it, which will be illustrated soon.
After setting the working director, we verify whether the working directory is
properly set or not using getwd() again as follows:
#Verify the working directory
> getwd()
)
[1] "C:/Users/Taruna Kumari/Desktop/Introduction to R Software"

Hence, it is verified that the working directory is properly set. Also note that,
the .txt file named “TKfile1” consists of the following data.

After, setting the working directory, next we read the .txt file using the
read.table()
re
ead.t tabl le() function and by supplying the name of the file as character
string with proper extension in the following manner:
#Reading the data from a .txt file
> read.table("TKfile1.txt")
V1 V2 V3 V4
1 x y w z
2 13.2 8.2 11 August
3 12.1 3.1 12 December
4 14.8 6.1 13 June
5 14.2 7.2 10 July

Note: If required the read data can be named using an assignment operator.
From the obtained output it can be note that by default row names (1 to 5) and
column names (V1, V2, V3 and V4) are shown in the output, but the column
names were x, y, w and z. So, to read the first line of the file as column names
(as header), we assign the header argument (whose default value is FALSE)
of the read.table() function as TRUE in the following manner:
#Reading the data from a .txt file using header argument
> read.table("TKfile1.txt",
, header=TRUE)
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July
123
Fundamentals of R Language

Hence, the column names are now read and default column names are now
replaced with the original ones. Additionally, as the row.names argument of
this function is missing therefore default numbering is given as row numbers.
Note that the read.table(), read.csv() and read.delim() functions
have two more important arguments with different default values, namely, sep
and dec. The sec argument is used to specify how the elements of the data
are to be separated and dec argument is used to specify decimal point. See
the following help page for more clarification:

The difference between the data input function can be clearly


be clearl understood
r y unders
stood by
looking at the se
sep and de
dec
d c arguments. As TK TKfile1.txt
T fi
ile
le1. txt is a tab separated
1 tx
file consisting of decimal as ‘ . ’, so it can
c n also be read using r
ca read.delim()
ead
a .del
elim
elim()
im ()
function as follows:
#Reading .txt file using read.delim() function
> read.delim("TKfile1.txt", header=TRUE, dec=".", sep="\
sep="\t")
\t")
x y w z
1 13
13.2
2 8
8.2
2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July
What if, the decimal point in the file are shown by ‘ , ’, then in this case, we
have to assign the dec argument of both the function as dec="," . For the
illustration purpose we create another tab separated file named
TKfile2.txt, in which decimal points are written as ‘ , ’ in the following
manner:

124
Data Frames, Reading and Writing in R

The TKfile2.txt can be read using the read.delim() function, by


assigning the header argument as TRUE, the dec argument as ‘ , ’ and the
sep argument as ‘ \t ’ in the following manner:
#Reading data from a file
> read.delim("TKfile2.txt",
, header=TRUE,
, dec=",",
, sep="\t")
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July

The same file can be read using the read.table() function as well but by
changing the default value of the dec argument as follows:
#Reading data from a file
> read.table("TKfile2.txt",
, header=TRUE,
, dec=",")
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July

Or otherwise, the file will be read incorrectly and we will get the following
output (as by default dec=".").
d c=".").
de
> read.table("TKfile2.txt", header=TRUE)
x y w z
1 13,2 8,2 11 August
2 12,1 3,1 12 December
3 14,8 6,1 13 June
4 14,2 7,2 10 July

For more clarification on the dec argument of the function. We create one
more .txt file named TKfile3.txt in the following manner:

Observe that in the TKfile3.txt file the terms are separated using the ‘ , ’
and decimal point is ‘ . ’. So, here it would be better to read this file using the
read.table() function by specifying the dec and sep arguments accordingly
as follows:
125
Fundamentals of R Language

#Reading data from a file


> read.table("TKfile3.txt",
, header=TRUE,
, dec=".",
, sep=",")
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July

Next, we create a CSV (Comma Separated Values) file named TKfile4.csv


and read it using the read.csv() function as follows:

#Reading data from a .csv file


> read.csv("TKfile4.csv", header=TRUE)
x y w z
1 13.2 8.2 11 "August"
2 12.1 3.1 12 "December"
3 14.8 6.1 13 "June"
4 14.2 7.2 10 "July"

Recall that TKfile3.txt


xt is a comma
TKfile3.tx comm
m a separated .txt file, so it can
can
n also be
read using the read.csv() function as follows:
#Reading data from comma separated file
> read.csv("TKfile3.txt",
, header=TRUE)
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July

Lastly, we discuss the method of reading a excel spread sheet. A excel


spreadsheet in R can be read using different packages. But, for the illustration
purpose, here we use the "XLConnect" package, which is used to read and
write to excel sheet, i.e., files with .xlsx and .xls extensions. We first need to
install this package as it does not come with base packages.
#Installing XLConnect package
> install.packages("XLConnect")
)
...
126
Data Frames, Reading and Writing in R

This package can read, write and manipulate both Excel 97–2003 and Excel
2007/10 spreadsheets. The readWorksheetFromFile() function is used to
read a excel file and the writeWorksheetFromFile() function is used to
write to a excel file. For the illustration purpose we have created a excel file
named TKfile5.xlsx in the working directory.

The excel file TKfile5.xlsx


TKfi
TK le5.xlsx consists of two sheets, namely, sheet 1 and
file
file
sheet 2 with following data:

We first set the working directory as the path from where the file is to be read
(if not set earlier), TKfile5 xlsx file,
earlier) then we read both the sheets of TKfile5.xlsx one-
file one
by-one using the readWorksheetFromFile() function and assign them to
df.one and df.two as follows:
#Setting the current working directory
> setwd("C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software")
)

#Reading data from sheet 1 of the excel file


> df.one
e <-
- readWorksheetFromFile("Tkfile5.xlsx",
, sheet
t = 1,
,
header
r = TRUE);
; df.one
e
x y w z
1 1 5 9 R1
2 2 6 10 R2
3 3 7 11 R3
4 4 8 12 R4

127
Fundamentals of R Language

#Reading data from sheet 2 of the excel file


> df.two
o <-
- readWorksheetFromFile("Tkfile5.xlsx",
, sheet
t = 2,
,
header
r = TRUE);
; df.two
o
a b c e
1 12 8 4 r1
2 11 7 3 r2
3 10 6 2 r3
4 9 5 1 r4

From the above outputs observe that the sheet argument of the function is
assigned as 1 to read sheet number 1 and assigned as 2 to read sheet
number 2. Moreover, the header argument is assigned as TRUE, to read the
first line of the file as its header.
Or otherwise, if you do not want to change the working directory and directly
want to read the file from the location where it is saved, you can simply use
the following commands to read both the sheets as follows:
#Assigning the location of the file in path
> path <- “C:/Users/Taruna Kumari/Desktop/Introduction to R
Software/Tkfile5.xlsx”

#Reading sheet 1 of the file


> df.one <- readWorksheetFromFile(path, sheet = 1, header =
TRUE); df.one
x y w z
1 1 5 9 R1
2 2 6 10 R2
...

#Reading sheet 2 of the file


> df.two <- readWorksheetFromFile(path, sheet = 2, header =
TRUE); df.two
a b c e
1 12 8 4 r1
2 11 7 3 r2
...

Note: Specific rows and columns from a .xslx file can be read using the
startRow, endRow, startCol and endCol arguments of the
readWorksheetFromFile() function. The usage of each of these
arguments is self-explanatory.

SAQ
Q3
Create a .txt file of the admission data discussed in Section 4.2 (Adm.data)
and write R command to read it.

4.5 WRITING DATA TO A FILE


The write.table() and write.csv() functions are used to write data to
128 .txt and .csv files. These two functions are also available in the utils
Data Frames, Reading and Writing in R

package. For the illustration purpose we now write the first 6 rows of the
trees data set available in the datasets package to .txt and .csv files.
Note: The write.table() and write.csv() functions also have
arguments such as sep, dec, row.names (with default value TRUE) and
col.names (with default value TRUE). These arguments can be used on the
same lines as discussed earlier.
#Writing first 6 rows of the trees data to .txt file
> write.table(trees[1:6,],
, "Trees1.txt")
)

#Writing first 6 rows of the trees data to .csv file


> write.csv(trees[1:6,],
, "Trees2.csv")
)

#Writing first 6 rows of the trees data to .txt file using sep
#and dec arguments
> write.table(trees[1:6,],
, "Trees3.txt",sep=",",
, dec=".")
)

Remember to set the working directory before running these commands, as


the created files will be available in the working directory only. Recall that, we
have already set our working directory, so these files are available in my
working directory. We are opening each of these created files now. See the
following screenshot for more details:

H
Hence, th
the screenshot
h t confirms
fi that
th t the
th first
fi t six
i rows off the
th sleep
l data
d t are
properly written in the file’s named Tree1.txt, Tree2.csv and Tree3.txt,
according to the written command. Next, we shall use the
writeWorksheetToFile() function available in the "XLConnect" 2
package to write the data in a excel file.
#Loading the package
> library("XLConnect")
)
#Setting the path of the file
> path
h <-
- "C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software/Trees4.xlsx"
"
#Writing data to first sheet of .xlsx format file
> writeWorksheetToFile(path,
, data=trees[1:6,],
,
sheet="FirstSheet")
)

2https://CRAN.R-project.org/package=XLConnect

129
Fundamentals of R Language
The saved data can be viewed by opening the excel file as follows:

The location of the file can be seen by viewing the properties of the file as
files.

Note: The sheet t argument should be assigned as character string. As writing


sheet=1 is invalid in writeWorksheetToFile()
writ
i eW
Works
shee etTooFil () function.
le(

SSAQ
SA
AQ 4
Write R statements to write the USArrests data set in the .csv, .txt and .xslx
files.

4.6
4 .6
6 Dates
Dates and
and Times
s
In R a number of functions are available to deal with date and time data. In this
section, we shall discuss the as.Data(),
as.D
as Dat
ata(
a()
a( ), ISOdatetime(),
ISOd
IS Odat
Odateti
at time
me()), as.POSIXlt()
() as.P
as POS
OSIX
IXlt()
IX )
and as.POSIXct()
OSIXct() functions to dea
as.POS
OS deal
e l with time data.
The as.Date() function available in the base package is used to convert a
character string representation of date (a calendar date) to an object of Date
class. But this function does not handle times. We first take help on this
function as follows:
#Seeking help
> ?as.Date
e
starting httpd help server ... done

130
Data Frames, Reading and Writing in R

For the illustration purpose we now convert the character string(s)


representing the dates to a Date object in following default formats ("%Y-%m-
%d", "%Y/%m/%d") using the as.Date() function as follows:
#Converting character string representing dates to Date object
> as.Date(c("2023-08-13",
, "2023-10-31"))
) The Code and
[1] "2023-08-13" "2023-10-31" Value for
as.Date()
> as.Date("2023/08/13")
) function:
[1] "2023-08-13"
%Y for 4 digits
> as.Date(c("2023/08/13",
, "2023/10/31"))
) year
[1] "2023-08-13" "2023-10-31" %y for 2 digits
year
Observe that the arguments supplied to the as.Date() function are in two %B for full month
specific formats ("%Y-%m-%d", "%Y/%m/%d") only and the supplied argument name
can be a character vector or a single character element. Situations may come %b for
across when the input data are not in the standard or required format, then a abbreviated
formatted string can be composed using %Y, %y, %B, %b, %m and %d elements month name
to read the input data. The formatted string then assigned to the format %m for month in
argument of the as.Date() function to read nonstandard formats as follows: decimal
%d for day
#Reading nonstandard formats with the help of format argument.
> as.Date("13Aug2023", format="%d%b%Y")
[1] "2023-08-13"

> as.Date("August 13, 2023", format="%B%d,%Y")


[1] "2023-08-13"

It would be interesting to view the internal structure of a Date object. See for
example:
#Checking internal structure of date object
> str(as.Date(c("2023/08/13", "2023/10/31")))
Date[1:2], format: "2023-08-13" "2023-10-31"

From this output it is clear that the object supplied to the str() function is a
object.
Date object
Next we discuss a function named difftime() available in base package,
used to compute the difference in time units such as "auto", "secs",
"mins", "hours", "days" and "weeks". For the illustration purpose
consider the following commands in which we compute the difference in two
different time objects in different units of time.
#Assigning two Date objects time1 and time2
> time1
1 <-
- as.Date("2023/08/13")
)
> time2
2 <-
- as.Date("2023/10/31")
)

#Computing difference in time in different units of time


> difftime(time2,
, time1)
)
Time difference of 79 days

> difftime(time2,
, time1,
, units="days")
)
Time difference of 79 days
131
Fundamentals of R Language

> difftime(time2,
, time1,
, units="hours")
)
Time difference of 1896 hours

> difftime(time2,
, time1,
, units="weeks")
)
Time difference of 11.28571 weeks

> difftime(time2,
, time1,
, units="mins")
)
Time difference of 113760 mins

> difftime(time2,
, time1,
, units="secs")
)
Time difference of 6825600 secs

There are some functions which extracts the weekdays, months, quarters and
number of days since some origin of a date (or POSIXt) object. The
weekdays() function is used to get weekdays, the months() function is
used to get months, the quarters() function is used to get quarters and the
julian() function is used to get the number of days since some origin of a
Date object. All these functions come as part of base package. See for the
illustration purpose:
#Getting weekdays
> weekdays(as.Date(c("2023/08/13", "2023/10/31")))
[1] "Sunday" "Tuesday"

#Getting months
> months(as.Date(c("2023/08/13", "2023/10/31")))
[1] "August" "October"

#Getting quarters
> quarters(as.Date(c("2023/08/13", "2023/1
"2023/10/31")))
10/31")))
[1] "Q3" "Q4"

#Getting number of days


> julian(as.Date("2023/08/13"), origin=as.Date("202
origin=as.Date("2023/08/10"))
23/0
08/10"))
[1] 3
attr(,"origin")
[1] "2023-08-10"

Recall from Unit 2 of MST-015 (Introduction to R Software) that the seq()


function is used to generate sequences. Note that, the seq() function
together with the as.Date() function can be used to generate time
sequence. To do so, we use its from, to and by arguments. The from and
to arguments are used to assign starting and ending dates but the by
argument is used to specified time unit of the data. For the illustration purpose
we now generate a sequence from August 10, 2021 to August 13, 2023 by
years, months, weeks and 20 days as follows:
#Generating sequence by year
> seq(from=as.Date("2021/08/10"),
, to=as.Date("2023/08/13"),
,
by="year")
)
[1] "2021-08-10" "2022-08-10" "2023-08-10"

#Generating sequence by months


> seq(from=as.Date("2021/08/10"),
, to=as.Date("2023/08/13"),
,
by="2
2 months")
)
132
Data Frames, Reading and Writing in R

[1] "2021-08-10" "2021-10-10" "2021-12-10" "2022-02-10"


[5] "2022-04-10" "2022-06-10" "2022-08-10" "2022-10-10"
[9] "2022-12-10" "2023-02-10" "2023-04-10" "2023-06-10"
[13] "2023-08-10"

#Generating sequence by 3 weeks


> seq(from=as.Date("2021/08/10"),
, to=as.Date("2023/08/13"),
,
by="3
3 weeks")
)
[1] "2021-08-10" "2021-08-31" "2021-09-21" "2021-10-12"
[5] "2021-11-02" "2021-11-23" "2021-12-14" "2022-01-04"
[9] "2022-01-25" "2022-02-15" "2022-03-08" "2022-03-29"
[13] "2022-04-19" "2022-05-10" "2022-05-31" "2022-06-21"
[17] "2022-07-12" "2022-08-02" "2022-08-23" "2022-09-13"
[21] "2022-10-04" "2022-10-25" "2022-11-15" "2022-12-06"
[25] "2022-12-27" "2023-01-17" "2023-02-07" "2023-02-28"
[29]
[ ] "2023-03-21" "2023-04-11" "2023-05-02" "2023-05-23"
[33] "2023-06-13" "2023-07-04" "2023-07-25"

#Generating sequence by 20 days


> seq(from=as.Date("2021/08/10"), to=as.Date("2023/08/13"),
by="20 days")
[1] "2021-08-10" "2021-08-30" "2021-09-19" "2021-10-09"
[5] "2021-10-29" "2021-11-18" "2021-12-08" "2021-12-28"
[9] "2022-01-17" "2022-02-06" "2022-02-26" "2022-03-18"
[13] "2022-04-07" "2022-04-27" "2022-05-17" "2022-06-06"
[17] "2022-06-26" "2022-07-16" "2022-08-05" "2022-08-25"
[21] "2022-09-14" "2022-10-04" "2022-10-24" "2022-11-13"
[25] "2022-12-03" "2022-12-23" "2023-01-12" "2023-02-01"
[29] "2023-02-21" "2023-03-13" "2023-04-02" "2023-04-22"
[33] "2023-05-12" "2023-06-01" "2023-06-21" "2023-07-11"
[37] "2023-07-31"

In addition to this the length and along arguments of the seq() function
can also be used on the same lines as discussed in the Unit 2 of MST-015
course. For the illustration purpose we now generate a date sequence using
the length argument together with from and to arguments of the seq()
function and assign it to x as follows:
#Generting date sequence using length argument
> x <-
- seq(from=as.Date("2021/08/10"),
,
to=as.Date("2023/08/13"),
, length=5);
; x
[1] "2021-08-10" "2022-02-09" "2022-08-11" "2023-02-10"
[5] "2023-08-13"

Next, we use x to generate a date sequence using the along argument


together with from and to arguments of the seq() function as follows:
#Generting date sequence using along argument
> seq(from=as.Date("2015/08/19"),
, to=as.Date("2021/08/10"),
,
along=x)
) 133
Fundamentals of R Language

[1] "2015-08-19" "2017-02-14" "2018-08-14" "2020-02-11"


[5] "2021-08-10"

Additionally, we discuss other functions, such as, ISOdate() or


ISOdatetime() of base package. From the R Documentation page, it can
be verified that these two functions are same and only differ due to default
values of their arguments. So, we use ISOdatetime() function now to get
Date object from numeric representation. The as.Date() and
ISOdatetime() functions differs due to the time component. So, for dates
without time we prefer to use as.Date() function. Let us first take help on
this function as follows:
#Seeking help
> ?ISOdatetime
e
starting httpd help server ... done

The following R Documentation page will pop-up, when we take help on the
ISOdatetime() functions.

The year,
year
ye ar, month,
ar min and sec
nth, day, hour, min
mont
nt sec arguments can be interpreted
e inter
e preted
literally. The tz argument is left em
empty
mpty to get current time zone,
zone
e, otherwise it
can be “GMT” which is UTC-Universal Time Coordinated. For the illustration
purpose we now create an arbitrary date and time object using
ISOdatetime() function as follows:
#Creating a time object
> ISOdatetime(year=2021,month=8,day=10,hour=11,min=10,sec=5,
,
tz="")
)
[1] "2021-08-10 11:10:05 IST"
Next, we see the internal structure of the data and time object using str()
function as follows:
#Internal structure of the ISOdatetime() function
> str(ISOdatetime(year=2021,month=8,day=10,hour=11,min=10,
,
sec=5,tz=""))
)
POSIXct[1:1], format: "2021-08-10 11:10:05"
From this output it is clear that, the ISOdatetime() function creates an
object of class POSIXct. Moreover, it can be used in the seq() function to
generate sequences on the same lines of as.Date() function. Additionally,
134
Data Frames, Reading and Writing in R

the difference in time can also be computed using the difftime() function
in allowed units as follows:
#Creating date and time objects x and y
> x <-
- ISOdatetime(year=2021,
, month=8,
, day=10,
, hour=11,
, min=10,
,
sec=5,
, tz="")
)
> y <-
- ISOdatetime(year=2023,
, month=8,
, day=13,
, hour=10,
, min=5,
,
sec=15,
, tz="")
)
#Generating date sequence with time
> seq(from=x,
, to=y,
, length=5)
)
[1] "2021-08-10 11:10:05 IST" "2022-02-09 16:53:52 IST"
[3] "2022-08-11 22:37:40 IST" "2023-02-11 04:21:27 IST"
[5] "2023-08-13 10:05:15 IST"
#Difference between 2 times objects
> difftime(y,
, x,
, units="auto")
)
Time difference of 732.955 days
The output can be interpreted on the same lines as earlier.
Next, we discuss other two classes of the date objects which are used to
represent date and time. These two classes are "POSIXlt" and "POSIXct".
The functions as.POSIXTlt()
as.P
.PPOS
OSIXTl
Tlt(
Tl t ) and as.POSIXct()
ass.PPOSIXct t() are used to convert the
objects of other class, specially, "character" and "Date" to "POSIXlt"
and "POSIXct" classes. These functions can also be used to manipulate
objects of these classes. The main difference between these two functions is
due to internally storage of the values. The origin of time for the "POSIXct"
class is January 1, 1970, which means time data is stored as the number of
seconds since January 1, 1970 and the "POSIXlt" class c asss store the time data
cl
as a list with number of components, namely, "sec",e ", "min",
"sec
ec "minin"", "hour",
" ou
"h our",
"mda
"m y , "mon", "year",
"mday",
ay" "year", "wday",
yea "w ay" and "isdst". Consider the
day", "yday"
wda "yda
following for understanding purpose
#Testing for list
> is.list(as.POSIXct("1986-08-13 10:10:00"))
[1] FALSE
> is.list(as.POSIXlt("1986-08-13
3 10:10:00"))
)
[1] TRUE

Hence, verified. Next, we see the components of the lists as follows:


#Checking for date and time list components
> as.POSIXlt("1986-08-13
3 10:10:00")$sec
c
[1] 0
> as.POSIXlt("1986-08-13
3 10:10:00")$min
n
[1] 10
> as.POSIXlt("1986-08-13
3 10:10:00")$hour
r
[1] 10
> as.POSIXlt("1986-08-13
3 10:10:00")$mday
y
[1] 13
> as.POSIXlt("1986-08-13
3 10:10:00")$mon
n
[1] 7
135
Fundamentals of R Language

> as.POSIXlt("1986-08-13
3 10:10:00")$year
r
[1] 86
> as.POSIXlt("1986-08-13
3 10:10:00")$wday
y
[1] 3
The Code and > as.POSIXlt("1986-08-13
3 10:10:00")$yday
y
Value for POSIX* [1] 224
class functions:
These two functions also accept character strings in the following formats, like
%H for Decimal as.Date() function.
hours
%M for Decimal Date format: "%Y-%m-%d" or "%Y/%m/%d"
minutes Time format: "%H:%M:%S" or "%H:%M"
%S for Decimal
second.
Other formats are ambiguous to these functions. If the input string is not in the
standard formats, then the format argument of these functions can be used
There are other
for conversion.
codes as well. To
see them you can Note: Unless a list time object is required "POSIXct" is the obvious choice.
take help on the Also, you can see the system time, by simply using the function Sys.time().
strptime()
Now, we show some examples in which we convert the character strings
function.
consisting date and times using as.POSIXct()
as.P
.POSIX
P IX
IXct ) and as.POSIXlt()
t() as.P
.POSIX
.P IXl
IX lt() )
functions.
#Converting different times to POSIX* class
> as.POSIXct(c("2023/08/13","2023/10/31"))
[1] "2023-08-13 IST" "2023-10-31 IST"
> as.POSIXct(c("2021-08-10 11:10:05", "2023-08-13 10:05:15"))
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
> as.POSIXct(c("2021-08-10 11:10", "202
"2023-08-13
23-08-13 1
10:05"))
0:05"))
[1] "2021-08-10 11:10:00 IST" "2023-08-13 10:05:00 IST"
> as.POSIXct(c("2021/08/10 11:10:05", "2023/08/13 10:05:15"))
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
T
Now, we see what happens if we convert POSIX* class object to Date class
object as follows:
> as.Date(as.POSIXct(c("2021/08/10
0 11:10:05",
, "2023/08/13
3
10:05:15")))
)
[1] "2021-08-10" "2023-08-13"

Observe that due to the as.Date() function now the time component is
removed.
Next, we convert a nonstandard character date and time character string vector
named Timedata to POSIX* class using the POSIXct() function as follows:
#Creating a vector consisting time in nonstandard formats
> Timedata
a <-
- c("10/August/2021:11:10:05",
,
"13/August/2023:10:05:15")
)
> as.POSIXct(Timedata)
)
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
136
Data Frames, Reading and Writing in R

Observe that we are getting an error message because the strings are not in
standard formats. So we can either use the strptime() function (explore
yourself) to get the strings in standard format before conversion or directly use
the format argument of the as.POSIXct() function and defined the format
as follows:
#Converting the character string to POSIX class using format
#argument
> as.POSIXct(Timedata,
, format="%d/%B/%Y:%H:%M:%S")
)
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
Moreover, the class of the created time object can be verified using the str()
function and time difference can be checked using difftime() function as
earlier. To do so, we first assign the time object to TD, then check its internal
structure as follows:
#Assigning object to TD
> TD
D <-
- as.POSIXct(Timedata,
, format="%d/%B/%Y:%H:%M:%S");
; TD
D
[1] "2021-08-10
2021-08-10 11:10:05 IST
IST" "2023-08-13
2023-08-13 10:05:15 IST
IST"
#Checking internal structure
> str(TD)
POSIXct[1:2], format: "2021-08-10 11:10:05" "2023-08-13
10:05:15"

Next, we compute the difference between the two time periods of TD


TD as follows:
#Computing time difference
> difftime(TD[2], TD[1])
Time difference of 732.955 days

Hence, observe that the class is POSIXct


POSI
IXctt and the time difference
e is same as
a
computed using ISOdatetime()
ISOddatetetim
et e() function.
me(

SSAQ
SA
AQ 5
Write the output of the following:
(i) as.POSIXlt(c("2023/08/13","2023/10/31"))
POSIXlt( ("2023/08/13" "2023/10/31"))
(ii) as.POSIXlt(c("2021-08-10 11:10:05", "2023-08-13
10:05:15"))
(iii) as.POSIXlt(c("2021-08-10 11:10", "2023-08-13 10:05"))
(iv) as.POSIXlt(c("2021/08/10 11:10:05", "2023/08/13
10:05:15"))

4.7 SUMMARY
The main points discussed in this unit are as follows:
The creation of a data frame object is discussed together with data
frame subsetting.
The mainly used function on a data frame object are discussed to
manipulate data.
The functions used to get formatted outputs are discussed.
137
Fundamentals of R Language

The method of reading data from different types of files is discussed.


The method of writing data to different types of files is discussed.
The mainly used date and time functions are also discussed in this unit.

4.8 TERMINAL QUESTIONS


1. Write R code to extract first 20 rows of a data frame name USArrests
and named it as ExtUS. Append three columns named UrbO, UrbR and
UrbS in ExtUS consisting of the order, rank and sorted values of its
UrbanPop variable.
2. Consider the Adm.data discussed earlier. Write R code to split the data
according to the Gender variable.
3. Write a command to extract the elements of a first row of USArrests
data frame which are greater than 15.
4. Consider the USArrests data observe the output of the following
functions, when the data frame is supplied as an argument to the
functions: head(), tail(), nrow(), ncol(), names(),
row.names(), dim(), summary(), str().
5. Write difference between using suitable example:
(i) getwd() and setwd()
(ii) read.table() and read.csv()
6. Write output of the following commands:
(i) paste("You are blessed", 1:3, sep="$", collapse=",
")
(ii) cat("Your family", "is your greatest", sep=" ",
fill=TRUE, "strength.")
(iii) cat(LETTERS[1:5], 1:5, sep=",", "\n")
(iv) cat(paste(LETTERS[1:5],1:5, sep=":"),sep=",",
sep=":
:"),s
, ep=",",
"\n")

4.9
4 .9
9 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. We first create a data frame named Adm.data using the following
code:
Adm.data <- data.frame( c("Shreyash","Prithu",
"Yuvaan","Advika","Pawan","Pehu"),as.factor(c("Male",
"Male", "Male", "Female", "Male", "Female")),c(88.55,
80.13, 85.31, 75.22, 65.04, NA),c(TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE))
(i) After creating Adm.data print it and observe the output.
(ii) We can set the rows and columns names in a single command using
the dimnames() function in the following manner:
dimnames(Adm.data)<-
list(paste0("R",1:6),c("Name","Gender","Percentage","
AgeG30"))
138
Data Frames, Reading and Writing in R

Then print Adm.data again and observe the output.


(iii) The Adm.data can be sorted according to the Percentage data
using the following command:
Adm.data[order(Adm.data$Percentage),]
2. The given output can be obtained using the following commands:
The paste() function can be used to get the given output using the
following code:
paste(letters[1:4], paste0(c(1,3,5,9),"$"),
sep="##", collapse=", ")
Or otherwise if we can use the cat() function to get the given output as
follows:
cat(paste(letters[1:4], paste0(c(1,3,5,9),"$"),
sep="##"), sep=", ", "\n")
3. We first create a .txt file named AdData.txt and after setting the path
we read it using the read.table() function using the following code:
read.table("AdData.txt", header=TRUE)
4. After setting the path to working directory, the USArrests data can be
written in given file formats using following commands:
In .txt file:
write.table(USArrests, "USAdata.txt")
In .csv file:
write.csv(USArrests, "USAdata.csv")
In .xlsx file:
library("XLConnect")
writeWorksheetToFile(USAdata.xlsx, data=USArrests,
sheet="FirstSheet")
5. (i) "2023-08-13 IST" "2023-10-31 IST"
(ii) "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15
IST"
(iii) "2021-08-10 11:10:00 IST" "2023-08-13 10:05:00
IST"
(iv) "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15
IST"

Terminal Questions (TQs)


1. The first twenty rows of the USArrests data can be extracted and
assigned to ExtUS as follows:
ExtUS <- USArrests[1:20,]; ExtUS
After extracting the data, we next compute the order, rank and sorted
data corresponding to the UrbanPop variable as follows:
ExtUS$UrbO <- order(ExtUS$UrbanPop)
139
Fundamentals of R Language

ExtUS$UrbR <- rank(ExtUS$UrbanPop)


ExtUS$UrbS <- sort(ExtUS$UrbanPop)
print(ExtUS)
2. The Adm.data can be divided into groups according to its Gender
variable using the following code:
split(x=Adm.data, f=Adm.data$Gender)
3. The elements of the first row of the USArrests data set greater than 15
can be extracted using the logical condition USArrests[1,]>15 in the
first row of the data, i.e., USArrests[1,] as follows:
USArrests[1,][USArrests[1,]>15]
4. See subsection 4.2.1.
5. See sections 4.4 and 4.5.
6. (i) "You are blessed$1, You are blessed$2, You are
blessed$3"
(ii) Your family is your greatest strength.
(iii) A,B,C,D,E,1,2,3,4,5,
(iv) A:1,B:2,C:3,D:4,E:5,

140
UNIT 5

GRAPHICALL REPRESENTATIONN
OFF DATA
A WITH
HR

Structure
5.1 Introduction 5.7 The curve() Function
Expected Learning Outcomes 5.8 Box Plot
5.2 Line and Scatter Plots 5.9 Pie Chart
Line Plot 5.10 Strip Chart
r
Scatter Plot 5.11 Cloud Plot
Saving a Created Plot 5.12 Conditional Plot
5.3 Pairs Plot 5.13 Summary
5.4 Stem and Leaf Plot 5.14 Terminal Questions
5.5 Bar Plot 5.15 Solutions/Answers
5.6 Histogram

5.1
5 .1
1 INTRODUCTION
INTRODUCTION
I th
In the first
fi t four
f units
it off Block
Bl k 11, you h
have llearntt diff
differentt ttypes off objects
bj t off R
like vectors, matrices, arrays, lists and data frames. In addition to this, you
have learnt indexing, the method of subsetting, testing for membership/class
and coercion of classes of R objects.
The main objective of this unit is to make you familiar with functions which are
most frequently used for the graphical representation of data. Graphical
representations of the statistical data help us to present the data in more
meaningful way, which is easily understandable and helps us to take decisions
and draw conclusions quickly. Often it is essential to present statistical data
graphically during the statistical analysis. Various types of graphical functions
are available to create plots by taking care of type of data in R.
There are several advantages associated with the graphical representation of
data. Some of them are as follows:
Graphical representations are more acceptable in comparison of data
presentations. 141
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language

Comparative analysis can be easily made on the basis of graphical


representation of the data.
Graphical representation of data helps us in instant decision making and
presents the data more attractively and logically.
It can be easily understandable to less literate audience and so on.
Graphical representation of the data has long lasting memorizing effect.

Expected Learning Outcomes


After completing this unit, you should be able to:
create different types of plots and graphs in R;
learn the usage of different graphical arguments of the graphical
functions;
change the appearance of a created plot or graph; and
save the created plot or graph.

5.2 LINE AND S


SCATTER
CA
ATTER PLOTS
In this section, we shall discuss the method of creating line and scatter plots,
both using the plot()
pl () function. The main arguments of interest of the
lot(
plot() function are as follows:
#The plot() function
plot(x, #x-axis data
y, #y-axis data
type, #type of plot
lty, #line type
lwd, #line width
pch, #plotting character
cex, #size of plotting characters
col, #line or plotting character color
xlim, #x-axis range
ylim, #y-axis range
xlab, #x-axis label
ylab, #y-axis label
main, #overall title
... ) #other arguments

Let us first discuss the method of creating a line plot using the plot()
function.

5.2.1 Line Plot


A line plot is used to connect the points by the line segments from left to right
of the chart to display the changes in the values of the variable for different
142 number line for example, time, temperature, days etc.
Graphical Representation of Data with R

For the illustration purpose consider the following arbitrary data of the sales of
steel for the period 2011-2022.

Year Sale of Steel (in thousand Year Sale of Steel (in thousand
tonnes) tonnes)
2011 7.9 2017 8.6
2012 8.2 2018 9
2013 9.5 2019 4
2014 10.5 2020 5
2015 8.1 2021 8.5
2016 9.3 2022 13

Now, we discuss the method of creating a line plot of the given sales data in
R. We first assign the year data to vector object Yr and steel sales data to a
vector object Sale as follows:
#Assigning sales data
> Yr <- 2011:2022
> Sale <- c(7.9, 8.2, 9.5, 10.5, 8.1, 9.3, 8.6, 9, 4, 5, 8.5,
13)

To create a line plot using the plot t() function, we assign its x argument as
plot()
Yr
Yr (for x-axis), the y argument as S Sale
alee (for y-axis) and the typ
pe argument as
type
"l"" (to create a line plot), the xl b as ‘Year’, the ylab
xlab
lab ab argument as ‘Sales of
steel (in thousand tonnes)’ and the mainin argument as ‘Sales of steel for the
mai
period 2011-2022’ as follows:
#Creating a line plot
> plot(x=Yr, y=Sale, type="l", xlab="Year", ylab="Sale
ylab="Sales
es of
steel (in thousand tonnes)", main="Sales of steel for the
period 2011-2022")

The created plot is shown in Fig. 5.1.

Fig. 5.1: Plot of the sales of steel data for the period 2011-2022

The type argument of plot() function can take different types like, "p",
"l", "b", "c", "o", "s", "h" and "n". Each of these types are used for
presenting a created plot differently. 143
Fundamentals of R Language

You can observe that the line plot shown in Fig. 5.1 does not reflect the points
which are joined by the line segments. Additionally, line color is black (default)
and line width is also 1 (default) only. We now present Fig. 5.1 differently, by
changing the type of the plot as "o" using the type argument, width of the
line using the lwd argument and color of the line using the col argument of
the plot() function as follows:
#Creating a plot
Possible types for the
type argument in the > plot(Yr,
, Sale,
, type="o",
, col="blue",
, lwd=2,
, xlab="Year",
,
plot() function are ylab="Sales
s of
f steel
l (in
n thousand
d tonnes)",
, main="Sales
s of
f
as follows: steel
l for
r the
e period
d 2011-2022")
)

"p" for points, The created plot is shown in Fig. 5.2. Note that when the type of the plot is
"l" for lines, chosen as "o" the plot() function will create a plot in which points will be
"b" for both points overplotted on the line. Also, lwd argument is used to increase the thickness
and lines,
of the line and the col argument is used for blue color line here.
"c" for empty
points joined by Note: A higher value of lwd argument displays the increased width
lines, (thickness) and a smaller value displays less thickness of the line from its
"o" for overplotted default value 1.
points and lines,
"s" and "S" for
stair steps
"h" for histogram-
like vertical lines,
and,
"n" does not
produce any points
or lines.

Fig. 5.2: Plot of the sales data with different line thickness, color and type

Now, we discuss another important graphical parameter lty of the plot()


function, which is used to display different types of lines. For the illustration
purpose we take x from 0 to 10. Then use the plot() function to plot a
horizontal line with black color (col=1, default value), fix the x-axis range from
0 to 10 and y-axis range from 0 to 6 as follows:
#Creating a line plot
> x <-
- 0:10
0
> plot(x,
, rep(1,11),
, type="l",
, col=1,
, xlim=c(0,10),
,
ylim=c(0,6),
, lwd=2)
)

This plot() function command will only plot a single horizontal line, which
can be verified from the next screenshot.
Note: The range of x and y are chosen suitably for illustration purpose only.
Also, to plot the horizontal lines same y axis points are repeated using the
rep() function 11 number of times (i.e., equals to length of x)
144
Graphical Representation of Data with R

To display different line types, we shall plot more horizontal lines in the
created plot using the lines() function. This function is mainly used to add
lines in the already created plot. We first take help on this function as follows:
#Seeking help
> ?lines
s
starting httpd help server ... done

Integer
Inte
ege
g r value of the
Additionally, the line
lines()
es()) function also supports argum
arguments
u ents such as lty,
l y,
lt
ty function
lty
l
ol, lwd
col,
co wd and type
lw e. These arguments are used on the same lines as
type.
argument displays
discussed in pl
plot()
plot () function. To add more lines to the created plot, we run a
ot(
ot
the following
for
fo r loop using the l
lines()
ine
nes()
ne ) function to draw 5 more horizontal lines after the
line types:
plot()
plot
ot () function command. Moreover, we use the lt
ot() lty, col and lw
ty, co lwd
d
arguments to differentiate between the lines. 0 "blank"
1 "solid"
#Displaying different types of lines
2 "dashed"
> for(i
i in
n 2:6){
{
3 "dotted"
+ lines(x,rep(i,11),
, lty=i,
, col=i,
, lwd=2)
) } 4 "dotdash"
The created plot is shown in Fig. 5.2. 5 "longdash"
6 "twodash"

Fig. 5.3: Plot of different line types 145


Fundamentals of R Language

Note: The for loop is discussed in the Unit 7 of MST-015 (Introduction to R


Software). To understand the execution of the code, refer Unit 7.
The lines() function discussed here is available in the graphics package.
There is one more function abline() available in the same package. The
abline() function is used to add one or more straight lines either vertical,
horizontal or with slope and intercept in the current plot. Additionally, this
function also supports arguments such as lty, col and lwd. We have
discussed its different arguments while creating line plots in the Session 1 of
MSTL-011 (Statistical Computing Using R-I). You are advised to refer Session
1 to get more clarity of this function. The abline() can also be used to
create the plot given in Fig. 5.3 using the following commands:
#Using abline() function for creating different lines
> x <-
- 0:10
0
> plot(x,
, rep(1,11),
, type="l",
, col=1,
, xlim=c(0,10),
,
ylim=c(0,6),
, lwd=2)
)
> abline(h=x, lty=x, col=x, lwd=2)

The line type


y argument
g lty can also be assigned
lty g as a character string.
g This
function supports a number of line types
y as character string g such as "blank",
"bl
b ankk",
"solid",
"soli lid
lid", "dashed",
"dashhed" ", "dotted",
"dotted d", "dotdash",
"ddot
otda
d sh h", "longdash"
"loongdas
ongd sh" or "twodash".
"two
"t w da
dashh".
So instead of assigning
g g lty as an integer g number, it can also be assignedg as
a character string.
g For the illustration purpose, let us create a line plot of the
sales data given
g in the beginning
g g of this unit byy assigning
g g the ltyty argument
lt g
as character string g "longdash",
"lon ngdash", the color argument col as character string
i.e., "red"
red" and the type
"re
re p argument as "b",
type b", to plot both line and points in the
"b
followingg manner:
#Creating a line plot
> plot(Yr, Sale, type="b", lwd=2, lty="longdash", col="red")

The crated plot is shown in Fig. 5.4.

Fig. 5.4: Plot of sales data using different line type

Note: You can have a look on different available colors using the following R
commands:
#Viewing available colors
146 > demo("colors")
)
Graphical Representation of Data with R

Next, we illustrate other important graphical parameters such as pch and cex
of the plot() function. Note that the pch argument is used to plot a character
and the cex argument controls the size of the plotting character.
Let us now display the first 25 plotting character available in R by plotting them
diagonally. To do this, we first assign the x-axis and y-axis as 1 to 25. To plot
each point in different colors we shall use the col argument and in different
characters we shall use the pch argument. Also, we assign the cex argument
as 2 (as its default value is 1) to display the characters bigger than the default
size as follows:
#Assigning x and y
> x <-
- 1:25
5
> y <-
- x
#plotting of a diagonal line consisting of first 25 plotting
#characters
> plot(x, y, pch=1:25, col=1:25, cex=2)

The created plot is shown in Fig. 5.5.


Note: The co l argument controls the line or plotting character color. If col
col l is
1 it displays the black color (alternatively the name of the color can be used,
which is written in double quotes, for example "bl"black",
lac
ck" ", "red",
"re
ed"", "green"
"green" "
etc) and pc h=1 displays the point as circle. Similarly, col
pch=1
ch= col=2
l=22 display the red
color and pc
pch=2
pch= =2 displays the point as a triangle. By default, for these
arguments the integer value 1 is used, which means color as black and
plotting character as a circle.

Fig. 5.5: Plot of first 25 plotting characters

Furthermore, the cex argument is used to enlarge the size of the plotting
characters. To show the importance of the cex argument, we shall vary it by
0.1 in the range 0.6 to 3 (for illustration purpose only) as follows:
#plotting a diagonal line consisting of characters different
#sizes
> plot(x,
, y,
, pch=1:25,
, col=1:25,
, cex=seq(0.6,3,0.1))
147
Fundamentals of R Language

The created plot is shown in Fig. 5.6. From Fig. 5.6 you can observe that the
plotting characters are appearing in increasing sizes due to the cex argument
(starting plotted characters are smaller than the characters plotted at the end),
in different color due to the col argument and in different characters due to
the pch argument.

Fig. 5.6: Plot of first 25 plotting characters in increasing size

5.2.2
2 Scatter
Scatte
er Plot
Plott
A scatter plot is used to display a bivariate data using characters or symbols,
generally dots. It is mainly used to show the relationship between two
quantitative variables for a set of data, i.e., to find the relationship between two
given variables. So, using scatter plot we can easily check whether variables
are correlated or not. To see details on correlation you can refer the Unit 9 of
MST-015.
In R a scatter plot is created using the plot()
p ot
pl t()) function. Most importantly,
impor
o tantly, if
the ty
typepe argument of the plot() () function is not specified, th
hen
thenn by default it
creates a scatter plot. For the illustration purpose we create a scatter plot
between the Murder and Assault variables of the USArrests data frame
(discussed in Unit 4 of MST-015) using the plot() function. Recall that these
two variables can be extracted from the USArrests data frame using the ‘ $ ’
as USArrests$Murder and USArrests$Assault or otherwise we can use
column numbers to refer them as USArrests[,1] and USArrests[,2].
Then a scatter plot can be created by assigning the first two arguments of the
plot function as the Murder and Assault variables, the xlab argument as
“Murder”, the ylab argument as “Assault” and the main argument as "Scatter
plot of Assault vs Murder" respectively, as follows:

#Creating a scatter plot between Murder and Assault variables


> plot(USArrests$Murder,
, USArrests$Assault,
, xlab="Murder",
,
ylab="Assault",
, main="Scatter
r plot
t of
f Assault
t vs
s Murder")
)

The created plot is shown in Fig. 5.7. The scatter plot depicts a positive
relationship between the Murder and Assault variables, which means as
number of murder increases the number of assault cases also increases.
148
Graphical Representation of Data with R

Note: Different pch, col, and cex can also be used in the plot() function,
while creating a scatter plot. Also recall from Unit 4 of MST-015 that the
columns of a data frame can also be referred as variables of a data frame.

Fig. 5.7: Scatter plot between the Murder and Assault


variables of the USArrests data

5.2.3
5.2
2.3 Saving
Saving
g a Created
Cre
eated Plot
Plot
Till now, you have learnt to create a line plot and a scatter plot. After creating a
plot, you may be interested in saving it in a specific format. A number of ways
are available to save a created plot. In this subsection, we shall discuss the
mainly used methods of saving a created plot at chosen locations.
Recall that, in the beginning of this unit, we have given a sales data. For the
illustration purpose, we now discuss the methods of saving n a created plot,
shown in Fig. 5.1 in a specific format. For this, we generally set the working
directory using the setwd()
set
setwd(() function discussed in Unit 4 of MST-015 course.
We next discuss 3 methods of saving a created plot.
Method 1:
To
T save a created t d plot
l t go to
t the
th menu bar
b off the
th R window
i d and
d do
d the
th
following steps:
Step 1: After creating a plot, click on the graphic window.
Step 2: Click on ‘File’ and then on ‘Save as’ as follows:

149
Fundamentals of R Language

Step 3: Save the plot in a required format at a proper location.


Method 2:
Step 1: After creating a plot, right click on the Graphic window.

Step 2: Click on ‘Save as metafile’ or ‘Copy as metafile’.


Step 3: If you have clicked on ‘Save as metafile’, then select a location, where
you want to save the plot. Or otherwise, if you have copied the plot, then open
the file in which it is to be pasted.
Method 3:
In this method we use the location of the working directory to save the created
plot in different formats as follows:
Step 1: Assign the data and set the working directory, i.e., where you want to
save the plot as follows:
#Assigning the data
> Yr <- 2011:2022
> Sale <- c(7.9, 8.2, 9.5, 10.5, 8.1, 9.3, 8.6, 9,4, 5, 8.5,
13)
#Setting the working directory
> setwd("E:/MSC
C in
n Applied
d Statistics
s MSCAS/Introduction
n to
o R
Software")
)

Step 2: After Step 1, open the graphics device in one of these formats, BMP,
JPEG, PNG and TIFF, say PNG. So, firstly we write the format name, say,
png, then in parentheses we write the name of the file with format as extension
as a character string, say, “TarunaKumari.png” as follows:
# Opening the graphical device in png format
> png("TarunaKumari.png")
)

Then we create a plot and close the graphical device window as follows:
#Creating a plot
> plot(x=Yr,
, y=Sale,
, type="l",
, xlab="Year",
, ylab="Sales
s of
f
Steel
l (in
n thousand
d tonnes)",
, main="Sales
s of
f steel
l for
r the
e
period
d 2011-2022")
)
150
Graphical Representation of Data with R

#Closing the graphical device


> dev.off()
)

After executing these commands, the plot will the saved in the .png format in
the working directory using the setwd() function.
Note: On the similar lines, a created plot can also be easily saved in any
allowed format. Note that a plot can also be saved in PDF format on the same
lines. For more details you can seek help as follows:
#Seeking help
> ?png
g
starting httpd help server ... done

SAQ
Q 1
Write a command to create a plot which displays the first twenty plotting
characters in decreasing size diagonally.

5.3 PAIRS PLOT


The pairs plot is used to depict the relationship between all the columns of a
matrix or data frame. A pairs plot in R is created using the pairs() function
available in the graphics package and it produces a matrix of scatter plots.
For the illustration purpose, let us visualize the relationships between all the
variables of the USArrests data frame on the basis of first 30 rows of the
data. To do so, we need to give 4C2 plot() function commands (as there are
4 variables in the data frame) to complete this task. Or alternatively, we can
use the pairs() function to get the matrix of scatter plots depicting the
relationships between the variables as follows:
151
Fundamentals of R Language

#Creating a matrix of scatter plots


> pairs(USArrests[1:30,])
) #to select first 30 rows

Fig. 5.8: Matrix of scatter plots between the 4 variables of the USArrests data

Note that the produced matrix of scatter plots is a symmetric matrix. The upper
triangular part is the same as the lower triangular part of the matrix. The first
row of this matrix depicts the relationships of Murder variable with Assault,
Pop and R
UrbanPop
Urb banP Rape
ape similarly,
e variables. Proceeding similarly y, the second
se w of this
econd row
matrix depicts the relationships of Assault
As
ssaul
sa
ault t variable with Murder,
M rd
Mu er, UrbanPop
der U banP
Ur Popp
and Rape
Ra
apee variables. Remaining rows off the matr
matrix
rix can be inferred on the
same lines.

SSAQ
SA
AQ 2
Write R code to create a suitable plot to depict the relationship between the
first four variables of the following iris data set,
set whose first 10 rows are as
follows:

5.4 STEM AND LEAF PLOT


The stem and leaf plot is used to visualize the shape of a distribution and can
be viewed as an alternative to histograms. In R, the stem() function available
in the graphics package is used to create a stem and leaf plot of the values
in a variable. For the illustration purpose, we now create a stem and leaf plot
152 of the first seven values of the Murder variable of the USArrests data. For
Graphical Representation of Data with R

the sake of convenience, we first extract the first seven values from Murder
variable and assign it to x, then create a stem and leaf plot using the stem()
function as follows:
#Extracting and assigning first 7 values from Murder variable
> x <-
- USArrests$Murder[1:7];x
x
[1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3

#Creating a stem and leaf plot


> stem(x)
The decimal point is at the |

2 | 3
4 |
6 | 9
8 | 180
10 | 0
12 | 2

You can also expand the scale of the stem and leaf plot to make it more
ale argument (with default value 1) of the
readable (if required) using the scale
sca
steem ) function as follows:
stem()
em()
#Creating a stem and leaf plot with scale=2
> stem(x, scale=2)
The decimal point is at the |

3 | 3
4 |
5 |
6 |
7 | 9
8 | 18
9 | 0
10 | 0
11 |
12 |
13 | 2

Note: In the stem() function, when scale=2, the stem and leaf plot
expanded almost twice longer than default. Additionally, the values appearing
on the left side are known as stem and the values appearing on right side are
known as leafs.

SAQ
Q3
Write R command to create a stem and leaf plot of the UrbanPop variable of
the USArrests data with scale as 2.

In the next section, we shall discuss the method of creating a bar plot. If the
bars are not described which make up the plot, then we use the table() 153
Fundamentals of R Language

function to get heights of the bars or to get a discrete frequency distribution


table. The table() function is available in the base package. For the
illustration purpose, let us create a frequency table of the following data using
the table() function.
10, 18, 10, 4, 34, 18, 4, 4
#Getting a frequency table
> table(c(10,
, 18,
, 10,
, 4,
, 34,
, 18,
, 4,
, 4))
)
4 10 18 34
3 2 2 1

Observe that the first row of the obtained output is showing different numbers
available in the data and second row is showing the frequency corresponding
to each number appearing in the first row.

5.5 BAR PLOT


Bar plot or bar chart is a plot that represents a variable with rectangular
g bars
with length
g equals to the values of the variable according g to time, category,
g y
age
g or some other factors or classification. Also, a bar plot can be plotted
horizontally
y or vertically.
y A bar chart in R can be created using g the
barplot()
barp
ba rplot(
rp t() function available in the graphics
t( phics package.
grap g The main
arguments of interest of the barplot()
barrpllott() function are as follows:
#The barplot() function
barplot(height, #values describing bars
horiz, #horizontal or vertical bars
beside, #positi
#positioning
t on
o ing of bars
names.arg, #naming bars
col, #bar color
data, #data frame whose columns will be used
legend.text, #creating legend
args.legend, #extra arguments to supplied to legend()
xlim, #x-axis range
ylim, #y-axis range
xlab, #x-axis label
ylab, #y-axis label
main, #overall title
...) #other arguments

Next, we illustrate the method of creating a bar plot using the barplot()
function. To do so, we create a bar plot of the Temp variable of the
airquality data. Let us first view the first 5 rows of the data using the
head() function as follows:
#Viewing the first 5 rows of the data
> head(airquality,
, 5)
)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
154
Graphical Representation of Data with R

2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5

Note that, before creating a bar plot of the Temp variable, we must compute
the frequency table of the Temp variable using the table() function as
follows:
#Computing frequency table
> table(airquality$Temp)
) #table(airquality[,4])

56 57 58 59 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
1 3 2 2 3 2 1 2 2 3 4 4 3 1 3 3 5 4 4 9 7

78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 97
6 6 5 11 9 4 5 5 7 5 3 2 3 2 5 3 2 1 1

After computing the frequency table, to create a bar plot, we supply the first
argument of the barplot() function as the frequency table. Assign the main
argument to add a main title to the plot and the xlab lab and ylab
xl b arguments to
ylab
add labels to the x-axis and y-axis, respectively. Moreover, the color argument
cool is used to fill color in the bars. In the earlier shown examples, the col
col l
argument was assigned as positive integer values, but colors can also be
assigned as character string(s). So, for the illustration purpose, we now assign
the col argument as " "lavender"
l vend
la nd
derr" as follows:
#Creating a bar plot
> barplot(table(airquality$Temp), col="lavender", xlab="Temp",
ylab="Frequency", main="Bar plot of the Temp data")
The obtained bar plot is shown in Fig. 5.9.

Bar plot of the Temp data


10
8
Frequency

6
4
2
0

56 57 58 59 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 97

Temp

Fig. 5.9: Bar plot of the Temp variable of the airquality data

Furthermore, to illustrate the use of the data, beside, legend.text and


args.legend arguments of the barplot() function, we take another
problem, in which we are interested in creating a multiple bar plot. For the
illustration purpose, let us consider the built-in data set longley available in 155
Fundamentals of R Language

the datasets package. If interested, you can take help on this data and view
what each column of this data is representing.

We now illustrate the method of creating a multiple bar diagram of 3 variables,


namely, GNP, Unemployed and Employed of longley data in accordance
with the Year variable of the data. To create it, we supply the first argument of
the barplot() function as 3 variable binded with the help of cbind()
function and classify them according to the Year variable with the help of ‘ ~ ’
symbol. Also, we assign the data argument as longley, the beside
argument as TRUE, the legend.text
l ge
le end.tex xt argument as a character vector
consisting of the names of the legends and the args.legend
args
gs.l
gs.legen
.l e d arguments as a
list consisting of the position, where legends are to be displayed. Moreover,
the col l argument is used to fill different colors in the bars as follows:
#Creating a multiple bar plot
> barplot(cbind(GNP, Unemployed, Employed) ~ Year,
+ data = longley,
+ beside = TRUE,
+ col=1:3,
+ legend.text=c("GNP","Unemployed","Employed"),
+ args.legend = list(x = "topleft")
+ )

The obtained plot is shown in Fig. 5.10.


GNP
Unemployed
500

Employed
400
300
200
100
0

1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962

Year

Fig. 5.10: Multiple bar plot of 3 variables of longley data

Note: A bar plot of variable(s) according to some factor or classification can be


created using the ‘ ~ ’ symbol.
156
Graphical Representation of Data with R

Next, we create a subdivided horizontal bar plot of the same data. To create a
horizontal bar plot instead of a vertical bar plot, we use the horiz argument
of the function and assign it as TRUE and also assign the beside argument as
FALSE. Additionally, we keep the remaining arguments as earlier in the
following manner:
#Creating a horizontal subdivided bar plot
> barplot(cbind(GNP,
, Unemployed,
, Employed)
) ~ Year,
,
+ data
a = longley,
,
+ beside
e = FALSE,
,
+ horiz
z = TRUE,
,
+ legend.text
t = c("GNP","Unemployed","Employed"),
,
+ args.legend
d = list(x
x = "bottomright"),
,
+ col=c("red","blue","green")
)
+ )

The obtained plot is shown in Fig.


g 5.11.

Fig. 5.11: Subdivided bar plot of 3 variables of longley data

SSAQ
SA
AQ 4
Write R code to create a multiple bar plot of the following data of the number
of students admitted to the M.Sc and B.Sc in Applied Statistics programme in
different academic years.
Year Admitted to M.Sc Admitted to B.Sc
2016-2017 500 1000
2017-2018 550 600
2018-2019 650 800
2019-2020 800 900
2029-2021 720 950
2021-2022 1000 1200

5.6 HISTOGRAM
A basic frequency histogram of an ungrouped data x can be created in R
using the hist() graphics The main
arguments of interest of the hist() function as follows:
157
Fundamentals of R Language

#The hist() function


hist(x, #data
col, #color to be filled in bars
border, #border color of the bars
labels, #to display frequency labels on bars
axes, #to hide x and y axes
plot, #extracting frequency distribution
prob, #for density histogram
density, #shading lines present in bars
angle, #shading lines angles displayed in bar
breaks, #number of breaks
main, #main title
xlim, ylim #range of the x and y axes
xlab, ylab #x-axis and y-axis labels
...) #other arguments

For the illustration purpose, we now create a histogram of the Temp variable
of the airquality data using the hist() function. To do so, we assign the
x argument of the function as the Temp
Te p data and give suitable x label and
overall title using the xlab and main
ma n arguments as follows:
ain
#Creating a histogram
> hist(airquality$Temp, xlab = "Temp", main = "Histogram of
Temp variable")

The created histogram plot is shown in Fig. 5.12.

Fig. 5.12: Histogram of the Temp variable of the airquality data

From Fig. 5.12, observe that the default filled color in rectangular bars are
grey. Suppose you would like to fill the bars with orange color and wants the
borderlines to be blue. Then, it can be done by using the col and border
arguments of the hist() function. Additionally, we can also display the
frequencies by assigning the labels argument as TRUE. Moreover, by default
axes argument is TRUE, which is used to display the axes. So, to hide the
axes, we assign it as FALSE. In addition to all these we assign the xlim,
158
Graphical Representation of Data with R

xlab and main arguments, to specify x-axis range, label and overall title of
the plot as follows:
#Creating a histogram
> hist(airquality$Temp,
, #data
+ col
l = "orange",
, #color to be filled in bars
+ border
r = "blue",
, #border color of the bars
+ labels
s = TRUE,
, #displaying frequency
+ axes
s = FALSE,
, #to hide x and y axes
+ xlim
m = range(airquality$Temp),
, #range of the Temp data
+ xlab
b = "Temp",
, #x-axis label
+ main
n = "Histogram
m of
f Temp
p data",
, #main title
+ )

The crated histogram plot is shown in Fig. 5.13.


Next, we discuss the method of extracting the frequency distribution details
from the hist() function. To do so, we assign the plot argument of the
function as FALSE as follows:
#Getting frequency distribution details
> hist(airquality$Temp, plot=FALSE)
$breaks
[1] 55 60 65 70 75 80 85 90 95 100

$counts
[1] 8 10 15 19 33 34 20 12 2
$density
[1] 0.010457516 0.013071895 0.019607843 0.024836601
[5] 0.043137255 0.044444444 0.026143791 0.015686275
[9] 0.002614379

$mids
[1] 57
57.5
5 62
62.5
5 67
67.5
5 72
72.5
5 77
77.5
5 82
82.5
5 87
87.5
5 92
92.5
5 97
97.5
5

$xname
[1] "airquality$Temp"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

It can be verified that the internal structure of the obtained output is a list. So,
its breaks, counts and mids components can be extracted by appending
‘ […] ’ operator (consisting list component number) as follows:
#Extracting frequency distribution
> hist(airquality$Temp,plot=FALSE)[c(1,2,4)]
]
$breaks
[1] 55 60 65 70 75 80 85 90 95 100 159
Fundamentals of R Language

$counts
[1] 8 10 15 19 33 34 20 12 2

$mids
[1] 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5

Here, $breaks is showing the lower limit of the intervals, $counts is showing
the frequencies corresponding to each interval and $mids is showing the
middle points of the class intervals.

Fig. 5.13: Histogram plot of the Temp variable in orange


color without axes

Note: We have used the axes argument of the h t() func


hist()
ist function
n tion to hide the
axes.

Fig. 5.13 shows a frequency histogram. On the similar lines a probability


histogram can also be created by adding one more argument pr prob
probability
obab
ob abil
ab i ity y
or probb to the hist() function and assigning it as TRUE. Or alternatively, we
may assign the freq argument as FALSE. Then we get a density histogram
plot as follows:
#Creating a density histogram
> hist(airquality$Temp,
,
+ col
l = "lightpink",
,
+ border
r = "pink3",
,
+ main
n = "Histogram
m of
f Temp
p data",
,
+ xlim
m = range(airquality$Temp),
,
+ xlab
b = "Temp",
,
+ axes
s = TRUE,
,
+ labels
s = TRUE,
,
+ probability=TRUE)
) #or freq=FALSE

The obtained density histogram is shown in Fig. 5.14.


160
Graphical Representation of Data with R

Fig. 5.14: Density histogram of the Temp variable

We have already discussed the method of filling color to the bars of a


histogram. Now we discuss the method of filling the bars with colored lines. To
do so, we use the density argument of the hist()
his
hi st() function and adjust the
angle of the shading lines using the angel
l argument of the function as
follows:
#Filling the created histogram with diagonal lines
> hist(airquality$Temp,
+ col = "orange",
+ border = "blue",
+ main = "Histogram of Temp data",
+ xlim = range(airquality$Temp),
+ density = 2, #density or denseness of shading lines
+ angle = 45, #angle of the shading lines
+ xlab = "Temp",
+ axes = TRUE,
+ labels = TRUE)

The obtained histogram is shown in Fig. 5.15.

Fig. 5.15: Plot of histogram with shading lines 161


Fundamentals of R Language

Note that, to fill the histogram with lines, we have assigned the density
argument of the function as 2. If we increase the value of density argument,
then the lines will appear closer and denser.
Next, we illustrate the use of the one of the most important argument breaks
of the hist() function. It is used to create a histogram with specific number
of breaks. For the illustration purpose, we now create the same histogram by
assigning 10 to the breaks argument as follows:
#Creating histogram using break argument
> hist(airquality$Temp,
,
+ col
l = "lightblue",
,
+ border
r = "black",
,
+ breaks
s = 10,
, #number of breaks
+ main
n = "Histogram
m of
f Temp
p data",
,
+ xlim
m = range(airquality$Temp),
,
+ xlab
b = "Temp",
,
+ axes = TRUE,
+ labels = TRUE)$breaks
[1] 55 60 65 70 75 80 85 90 95 100

The obtained histogram is shown in Fig. 5.16. Note that, when the break
breeakk
argument is assigned as a single number, it specifies the number of cells for
the histogram. Also, when the breaks
breaaks argument is assigned as vector, it
gives the breakpoints between histogram cells.

Fig. 5.16: Plot of histogram with specific number of breaks

Moreover, the breaks argument can also be used effectively to create a


histogram of unequal cells or class intervals as follows:
#Creating a histogram of unequal class intervals
> hist(airquality$Temp,
,
+ col
l = "lightyellow",
,
+ border
r = "black",
,
+ breaks=c(56,65,98),
,
+ main
n = "Histogram
m of
f Temp
p data",
,
162
Graphical Representation of Data with R

+ xlim
m = range(airquality$Temp),
,
+ xlab
b = "Temp",
,
+ axes
s = TRUE,
,
+ freq
q = FALSE,
,
+ labels
s = TRUE)[c(1,2,3)]
]
$breaks
[1] 56 65 98

$counts
[1] 18 135

$density
[1] 0.01307190 0.02673797

From Fig. 5.17, observe that we get a density histogram plot as we have
assigned the freq argument as FALSE.
Note: Whenever you are using the breaks argument, it is always better to
see the minimum and maximum values of the data under consideration.

Fig. 5.17: Plot of histogram with unequal class intervals

SAQ
Q5
Write R code to create a histogram of the Wind variable of airquality data
with unequal class intervals.

5.7 The curve() Function


The curve() function available in the graphics package is used to draw a
curve of a function over the given range [from, to]. This function is also used
to plot a given expression. The main arguments of interest of this function are
as follows:
#The curve() function
curve(expr, #name of the function or an expression
from, #lower limit of the curve
to, #upper limit of the curve
163
Fundamentals of R Language

add, #add curve to the existing plot


...) #other arguments

For the illustration purpose, we now create a curve of the following function:
f(x)=x3+x2 over the range -5 to 5 using the curve() function. To do so, we
assign the expr argument of the function as f(x), the from argument as -5
and the to argument as 5 as follows:
#Creating a curve of the given function
> curve(expr=x^3+x^2,
, from=-5,
, to=5,
, col="blue",
, lwd=2,
,
ylab="f(x)=x^3+x^2")
)

You can observe that the curve() function also supports arguments such as
col, lwd, xlab, ylab and so on. These arguments can be easily used on the
same lines as discussed earlier. The plot of the given function is shown in Fig.
5.18.

Fig. 5.18: Plot of f(x)=x3+x2

We next draw two vertical lines in F Fig.


ig. 5.18 using the ab ( function and
abline()
abl
lin
ne()
also plot a character on the curve at (x, y)=(0, 0) using the points() function.
To d
T do so, after the curve()
ft th () ffunction
ti command d we give the abline()
i th bli ()
function command to draw vertical lines. To draw vertical lines at -2 and 2, we
assign it’s v argument as c(-2,2) (as v argument is used to draw vertical
lines) also use different color, change the line type and line width for better
visualization. Thereafter, we plot a point using the points() function. The
points() function is available in the graphics package and is used to add
points to already created plot as follows:
#Creating a curve with lines and a point
> curve(expr=x^3+x^2,
, from=-5,
, to=5,
, col="blue",
, lwd=2,
,
ylab="f(x)=x^3+x^2")
)
> abline(v=c(-2,2),
, col="red",
, lty=2,
, lwd=2)
)
> points(x=0,
, y=0,
, pch=18,
, cex=2)
)

The created curve is shown in Fig. 5.19.


Note: The points() function also supports the pch and cex arguments.
164
Graphical Representation of Data with R

Fig. 5.19: Plot of given curve with lines and a point

Next, we illustrate the use of the add argument


g of the curve() function. The
add argument
g is used to add the created curve in an existing
g plot. We have
alreadyy discussed the method of creating
creating
g a histogram.
g Now, we create a
histogram
g of the 1000 random numbers from standard normal distribution
using
g pl
plot()
lot () function and add a standard normal curve to it using
ot() g the
curv
cu ve() function as follows:
curve()
#Creating a histogram
> x <- rnorm(1000)
> hist(x, col="greenyellow", freq=FALSE)

#Adding a standard normal density curve


> curve(expr=dnorm(x), lwd=2, from=-3, to=3, add=TRUE,
col="red")

The obtained plot is shown in Fig. 5.20.

Fig. 5.20: Plot of histogram and density curve of


standard normal random numbers
165
Fundamentals of R Language

SAQ
Q6
Write a R command to create a density curve of the normal distribution with
mean 2 and variance 16.

5.8 BOX PLOT


Box plot (box-and-whisker) is used to represent the distribution of the data
visually through the five number summary, that are maximum value, minimum
value, first quartile, second quartile (median) and third quartile. Box plot also
displays the outliers and help us to extract them. Box plot is useful in
comparing the distribution of data across variables by drawing box plots for
each of them side-by-side. Boxplots in R, are created by using the
boxplot() function available in the graphics package.
For the illustration purpose, we now create a boxplot of the weight variable of
the chickwts data available in the datasets package. We first take help on
the data as follows:
#Seeking help on chickwts
> ?chickwts
starting
g httpd help server ... done

We next view the first 5 rows of the data:


#Viewing first 5 rows of the chickwts data
> head(chickwts,
, 5)
)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean

To create a box plot using the boxplot() function, we assign the first
argument of the function as the weight variable from the chickwts data. We
also give the label to the y-axis and main title for more clarity as follows:
166
Graphical Representation of Data with R

#Creating a boxplot
> boxplot(x=chickwts$weight,
, ylab="Weights",
, main
n = "Boxplot
t of
f
Weights
s data")
)
The obtained box plot is shown in Fig. 5.21.

Fig. 5.21: Vertical box plot of the weight variable

Note that, the box plot is appearing vertically (as the default value of the
horizontal
hori
ho izonntall argument of the boxplot()
b xplo
bo ot()) function is FALSE).
LSE). We can also
FALS
FA LS
present the box plot horizontally. To do so we take help of the horizontal
hori
rizo
rizontal
zo al
argument of the function. If we assign the horizonal
horizona al argument as TRUE.
Then the box plot will appear horizontally. For the illustration purpose, we now
create the same histogram horizontally using the horizontal argument as
follows:
#Creating a horizontal boxplot
> boxplot(x=chickwts$weight, horizontal=TRUE, ylab="Weights",
main = "Boxplot of Weights data")

Fig. 5.22: Horizontal box plot of the weight variable

The obtained box plot is shown in Fig. 5.22.


Thus when only a vector argument is supplied to the boxplot() function a
single boxplot is created. But if the given vector is further categorized 167
Fundamentals of R Language

according to a factor variable, then side-by-side box plots can be created by


supplying the formula argument of the function as follows:
Formula: Variable ~ Factor variable
Note that the weight variable is further categorized according to the factor
variable feed. So, we now create a plot of side-by-side box plots by writing
the formula as weight~feed and supply it as the first argument of the
boxplot() function. Observe that the weight variable is a numeric vector
which is divided into groups according to the grouping variable feed. Further,
as these two variables are from the chickwts data, therefore, it we assign
the data as chickwts (or otherwise we may write the formula as
chickwts$weight~chickwts$feed). Let us now create side-by-side box
plots of the weight variable according to feed as follows:
#Creating boxplots corresponding to different feeds
> boxplot(weight~feed,
, #data via formula
+ data=chickwts,
, #data frame
+ col=2:7, #colors to be filled
+ main = "Side-by-side boxplots of the weights for six
different types of feeds")
The obtained box plots are shown in Fig. 5.23.

Fig. 5.23: Side-by-side boxplot of the weights for six different types of feeds

SAQ
Q7
Write R command to create box plots of all the variable of the USArrests
data frame using different colors. Also give a main title to the plot.

5.9 PIE CHART


A pie chart in R is created using the pie() function available in the
graphics package. The pie() function supports a number of arguments.
The main arguments of interest of this function are as follows:
168
Graphical Representation of Data with R

#The pie() function


pie(x, #data
labels, #name of slices
clockwise, #placing of slices clockwise or counter
#clockwise
radius, #radius of the pie chart
col, #colors to be filled in slices
main, #main title
...) #other arguments

The x argument of the function is used to assign the values which are
displayed as the areas of pie slices. The labels argument is used to give
labels to the slices, the clockwise argument is used for placing of slices
either counter clockwise (by default) or clockwise and the radius argument is
used to specify the radius of the pie chart.
Note: Byy default,, the pie
p chart is drawn in the center of the square
q box who’s
sided are -1 to 1.
Now we illustrate the method of creating a pie chart of the given arbitrary
expenditure data of a company using the pie()
p e(
pi () function.
Category Expenditure (Rs. in Lakh)
Raw materials 1500
Taxes 560
Other expenses 490
Depreciation 380
Dividends 100
Manufacturing
790
expenses

To create a pie chart of the given data, we create a vector named x of the
expenditure data, then supply it as an argument to the pie()
pie( function as
ie()
follows:
#Numeric vector data
> x <-
- c(1500,560,490,380,100,790)
#Creating a pie chart
> pie(x)
)

The obtained pie chart is shown in Fig. 5.24.


1

3 6

4 5

Fig. 5.24: Pie chart of the expenditure data 169


Fundamentals of R Language

From Fig. 5.24, you can observe that the in the obtained pie chart default
colors are filled in the slices of the pie chart. Also, the pie chart does not have
the labels to enhance the readability, i.e., which slice belongs to which
category and instead of labels, default numbers corresponding to the position
of the vector elements is appearing on the pie chart. Moreover, it does not
have the main title as well.
So, to make the pie chart more readable after assigning data to x, we set
names of the elements of x, so that they can be used for naming the slices
using the labels argument of the function. Also, we use the col and main to
more visual clarity as follows:
#Assigning data
> x <-
- c(1500,560,490,380,100,790)
)

#Setting names to the vector elements for labeling the slices


> names(x)
) <-
- c("Raw
w materials",
, "Taxes",
, "Other
r expenses",
,
+ "Depreciation",
, "Dividends",
, "Manufacturing
g expenses")
)

#Computing the percentages


> percentage <- (x/sum(x))*100

#Creating a percentage pie chart


> pie(percentage, col=1:length(x),labels =
paste(names(x),round(percentage,2), "%"), main="Distribution of
Expenditure")

Note: The paste()


te() function is discussed in the Unit 4 of MST-015 course.
past
The obtained detailed pie chart is shown in Fig. 5.25.
5

Fig. 5.25: Detailed pie chart of the expenditure data

Next, we illustrate the use of the radius and clockwise arguments of the
pie() function. As discussed earlier, the radius argument is used to specify
radius of pie chart and the clockwise is a logical argument with default value
170
Graphical Representation of Data with R

as FALSE. It used for placing of slices counter clock wise or clock wise. Now
we create the same pie chart by using these two arguments as follows:

#Creating a pie chart


> pie(percentage,
, clockwise=TRUE,
, radius=0.5,
, edges=100,
,
col=1:length(x),
, labels
s = paste(names(x),round(percentage,2),
,
"%"),
, main="Distribution
n of
f Expenditure")
)

The final detailed pie chart is shown in Fig. 5.26.

Fig. 5.26: Pie chart of the expenditure d


data
ata

The default value of the radius


radiius argument is 0.8. As we have assigned the
radius
raadi us argument as 0.5, which is smaller than the 0.8, that is why the pie
dius
chart shown in Fig. 5.26 is smaller than the pie chart sshown
hown in Fig. 5.25. Also,
it can be seen that placing of slices are now clockwise as we have assigned
the clockwisi e argument as TRUE
clockwise TRUE.

SAQ
Q8
The following funds were disbursed during 2010 to 2017 by a leading financial
institution.

Year Amount (Rs in crore)


2010 1434
2011 1503
2012 1908
2013 2232
2014 3031
2015 4368
2016 5725
2017 6012
171
Fundamentals of R Language

Write R command to create a pie chart of the following data. Also, add the
colors, main title, labels name and percentages to the created pie chart.

5.10 STRIP CHART


Strip chart is a simple form of graphical representation of data that consists of
data points as dots on a graph. This chart is specially used to depict certain
data trends or groupings. It is preferred for small data sets. The strip chart in R
is created using the stripchart() function available in the graphics
package. This function supports a number of arguments but the main
arguments of interest of this function are method, cex, pch and col.
For the illustration purpose consider the following arbitrary vector x.
#Creating a vector x
> x <-
- rep(1:10,1:10);x
x
[1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6
[20] 6 6 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9
[39] 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10

Now we create a strip chart of the data given in x using the stripchart()
function. To do so, we assign the first argument of the stripchart()
st
tri
r pcha
hart()
ha ()
function as x, the method
met o argument as "stack"
thod s ack" and to enlarge the size off
"st
the plotted symbol (square in this case) we assign the cex
cex argument as 4 in
the following
g manner:
#Creating a strip chart
> stripchart(x, method="stack", cex=4)

The obtained strip chart is shown in Fig. 5.27.

2 4 6 8 10

Fig. 5.27: Strip chart of the data using stack method

Note that, as the method argument was assigned as "stack" that is why
overplotting does not occurred here. But if we want the over plotting to take
place then the method should be chosen as "overplot" and specific
plotting character can be chosen as follows:
#Creating an over plotted strip chart
> stripchart(x,
, method="overplot",
, col=2,
, pch="*",
, cex=5)
)

After using the method as "overplot" the obtained strip chart will appear
as shown in Fig. 5.28.

172
Graphical Representation of Data with R

* * * * * * * * * *

2 4 6 8 10

Fig. 5.28: Strip chart of the data using overplot method

SAQ
Q9
Write a R command to create a strip chart of the given data by controlling
overplotting and using the rep() function.
10 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 8 7 7
7 7 7 7 7 6 6 6 6 6 6 5 5 5 5 5 4 4 4 4 33 3 2 2 1

5.11
1 CLOUD
CLOUD PLOT
PLOT
The cloud plot is a three-dimensional scatter plot. It is created using
g the
cloud()
clooud(() function available in the lattice
latt ce package
tic g is used to create a cloud
plot. The formula methods do most of the work here.
For the illustration purpose we now create a cloud plot of the three variables off
the Insu
Insurance
sura
su n e data frame available in the MASS
ranc
ra S package.
g Let us view first
few rows of the data are as follows:
#Viewing first 6 rows of the Insurance data
> require(MASS)
> head(Insurance)
District Group Age Holders Claims
1 1 <1l <25 197 38
2 1 <1l 25-29 264 35
3 1 <1l 30-35 246 20
4 1 <1l >35 1680 156
5 1 1-1.5l <25 284 63
6 1 1-1.5l 25-29 536 84

A could plot of the District, Holders and Claims variables of the


Insurance data can be created by keeping Claims variable on the z-axis
and; the District and Holders variables on the x and y axes. To do so, we
supply the following formula as an argument of the function:
Claims~District*Holders
#Creating a cloud plot
> require(lattice)
) #Loading the package
> cloud(Claims~District*Holders,
, data=Insurance,
, pch=4,
,
col="blue")
)
173
Fundamentals of R Language

Fig. 5.30: Cloud plot of the variables of Insurance data

SSAQ
SA
AQ 10
Write R code to create a cloud plot of the first 10 rows of the randu data
frame. Also, add the following x, y and z axes labels and the main title to the
created plot.
x-axis label: Uniform1
y-axis label: Uniform2
z-axis label: Uniform3
Main title: Cloud Plot

5.12
5 .12 CONDITIONAL
CONDITIO
ONAL PLOT
Conditional plots in R are created using the coplot()
copl
plot
pl () function
ot()
ot n available in
the gr
graphics
graphi ics package. Using co
coplot()
copplot
o ()( function, two variants off th
the
he
conditioning plots can be framed. Its first argument formula describes the
form of a conditional plot.
plot The two possible ways of assigning the formula
are as follows:
Conditioning on one variable:
A formula of the form y~x|z is used to plot y versus x by conditioning on the z
variable.
Conditioning on two variables:
A formula of the form y~x|z*w is used to plot y versus x by conditioning on z
and w variables.
For the illustration purpose, we now create a conditional plot of the variables of
iris data. To do so, we plot the Sepal.Length variable against the
Petal.Length variable by conditioning on Species variable of the data as
follows:
#Creating a conditional plot
> coplot(Sepal.Length~Petal.Length|Species,
, data=iris)
)
174
Graphical Representation of Data with R

Fig. 5.29: Conditional plot of the variables of iris data

SSAQ
SA
AQ 11
Write R code to construct the conditional plot of the Sepal.Width against
Petal.Width by conditioning on Species for the iris data.

5.13
5 .1
13 S
SUMMARY
UMMARY
The main points discussed in this unit are as follows:
To
T create
t line
li and
d scatter
tt plot
l t iin R
To create Bar plot, Histogram, Box Plot in R
To create Stem and Leaf plot, Strip chart and Pie chart in R
To create conditional plot and a three-dimensional plot in R.
Usage of different arguments of the discussed functions to make graphs
more attractive, readable and comparable.
Methods to saving a created plot.

5.14 TERMINAL QUESTIONS


1. Write R code to create scatter plots between the variables
Petal.Length, Petal.Width of the iris data.
2. Write step-by-step procedure to save a created plot in ‘.jpeg’ format with
name “Figure1” in R.
175
Fundamentals of R Language

3. Consider the following data


37 NA 30 49 110 96 NA 23 21 7 21 NA 9 97 13 18 32 NA 50
7 NA 84 79 135 23 97 28 18 28 NA 45 18 22 13 32 12 NA 11
NA 45 28 65 16 115 NA 20 71 29 85 82
Write R code to perform the following tasks:
(i) Remove NA’s from this data and save it under the name Ozone.
(ii) Create a histogram by specifying the colors for filling bars and for
the boarders. Also, add main and x-axis titles.
(iii) Extract the frequency distribution corresponding to the histogram.
4. Write a R command to create a matrix of scatter plot of the USArrests
data after dropping first ten rows of the data.
5. Which function is used to create a stem and leaf plot of the data. In
which package it is present.
6. Write the output of the following code:
table(rep(1:5,1:5))
7. Write R code to create a conditional plot of Murder variable against the
Assault variables by conditioning on the UrbanPop and Rape
variables of the USArrests data.

5.15
5 .15 Solutions/Answers
Solution
ns/An
nswers
Self-Assessment
S elf-A
Asse
ess
sment Questions
Question
ns (SAQs)
(SAQs)
1. plot(1:20,1:20, pch=1:20, col=1:20, cex=seq(3,0.5,-0.1))

2. pairs(iris)
3. stem(USArrests$UrbanPop, scale=2)
4. The R code is as follows:
Ayear <- c("2016-2017","2017-2018", "2018-2019","2019-
2020","2020-2021","2021-2022")
MSc <- c(500, 550, 650, 800,720,1000)
BSc <- c(1000, 600, 800, 900,950,1200)
barplot(cbind(MSc, BSc) ~ Ayear,
beside=TRUE,
xlab="Academic Year",
legend.text=c("MSc","BSc"),
args.legend=list(x = "topleft"))
5. The histogram of the wind data can be created using the following code:
hist(airquality$Wind,
col = "lightpink",
border = "white",
176
Graphical Representation of Data with R

breaks=c(1,10,15,21),
main = "Histogram of Wind data",
xlim = range(airquality$Wind),
xlab="Wind",
axes = TRUE,
freq=FALSE,
labels = TRUE)
6. The density curve of the normal distribution with mean 2 and variance 16
can be created using the curve() function in a single command as
follows:
curve(expr=dnorm(x, mean=2, sd=4), lwd=2, from=-30,
to=30)
7. The boxplots of all the variables of the USArrests data can be created
using the following code:
boxplot(USArrests, #data frame
col=2:5, #colours to be filled
main = "Side-by-side boxplots of the variables of
USArrests data")
8. To create a pie chart of the given data, we first assign the data to a vector
named Amount, then assign the names to its elements. Thereafter, we
compute the percentages and create a pie chart using the pie() function n
as follows:
Amount <- c(1434, 1503, 1908, 2232, 3031, 4368, 5725,
6012)
names(Amount) <- as.character(2010:2017)
percentage <- (Amount/sum(Amount))*100
pie(percentage, col
pie(percentage col=1:8,labels
1:8 labels =
paste(names(Amount),round(percentage,2), "%"),
main="Pie chart of percentage of amount disbursed" )
9. The strip chart of the given data can be created using the following code:
stripchart(rep(seq(10,1,-1),times=10:1),
method="stack")
10. The cloud plot can be created using the following command:
cloud(z~x*y, data=randu[1:10,] ,xlab="Uniform1",
ylab="Uniform1", zlab="Uniform3", pch=11, col="red",
main="Cloud Plot")
11. The conditional plot can be created using the coplot() function using the
following command:
coplot(Sepal.Width~Petal.Width|Species, data=iris)

177
Fundamentals of R Language

Terminal Questions (TQs)


1. The scatter plot between the two variables can be created using the
plot() function as follows:
plot(iris$Petal.Length, iris$Petal.Width)
2. See subsection 5.2.3.
3. We first assign the data as follows:
x <- c(37, NA, 30, 49, 110, 96, NA, 23, 21, 7, 21, NA,
9, 97, 13, 18, 32, NA, 50, 7, NA, 84, 79, 135, 23, 97,
28, 18, 28, NA, 45, 18, 22, 13, 32, 12, NA, 11, NA,
45, 28, 65, 16, 115, NA, 20, 71, 29, 85, 82)
(i) Then NA’s can be removed as follows:
x <- na.omit(x)
(ii) After removing the NA’s we create the histogram of the data using the
following command:
hist(x, #data
col = "red", #color to be filled in bars
border = "black", #border color of the bars
main = "Histogram of data", #main title
xlab = "x", #x-axis label
axes = TRUE, #to display x and y axes
labels = TRUE) #to display frequencies
(iii) The frequency distribution corresponding g to the histogram can be
extracting either in (ii) step or a separate command can be given as
follows:
hist(x,plot=FALSE)[c(1,2,4)]
hist(x,plot=
t FALSE)[c(1,2,4)]
4. The matrix of scatter plots can be obtained using the pairs() function as
follows:
pairs(USArrests[-c(1:10),])
5. The stem() function available in the graphics package.
6. The output will consist of the following table:
1 2 3 4 5
1 2 3 4 5
7. The conditional plot can be created using the following command:
coplot(Murder~Assault|UrbanPop*Rape, data=USArrests)

178
MST-015
Introduction to R Software
Indira Gandhi National Open University
School of Sciences

Block

2
FUNCTIONS, CONDITIONAL STATEMENTS, LOOPS AND
DESCRIPTIVE STATISTICS WITH R
UNIT 6
Functions in R 183
UNIT 7
Control-Flow Constructs of R 203
UNIT 8
Apply Family in R 231
UNIT 9
Descriptive Statistics and Correlation with R 255
Curriculum and Course Design Committee
Prof. Sujatha Varma Prof. Rakesh Srivastava
Former Director, SOS Department of Statistics
IGNOU, New Delhi M. S. University of Baroda, Vadodara (GUJ)

Prof. Diwakar Shukla Prof. Sanjeev Kumar


Department of Mathematics and Statistics Department of Statistics
Dr. H. S. Gaur Central University, Sagar (MP) Banaras Hindu University, Varanasi (UP)

Prof. Gulshan Lal Taneja Prof. Shalabh


Department of Mathematics Department of Mathematics and Statistics
M. D. University, Rohtak (HR) Indian Institute of Technology, Kanpur (UP)

Prof. Gurprit Grover Prof. V. K. Singh (Retd.)


Department of Statistics Department of Statistics
University of Delhi, New Delhi Banaras Hindu University, Varanasi (UP)

Prof. H. P. Singh Prof. Manish Trivedi, SOS, IGNOU


Department of Statistics
Vikram University, Ujjan (MP) Dr. Taruna Kumari, SOS, IGNOU

Prof. Rahul Roy Dr. Neha Garg, SOS, IGNOU


Mathematics and Statistics Unit
Indian Statistical Institute, New Delhi Dr. Rajesh, SOS, IGNOU

Prof. Rajender Prasad Dr. Prabhat Kumar Sangal, SOS, IGNOU


Division of Design of Experiments,
IASRI, Pusa, New Delhi Dr. Gajraj Singh, SOS, IGNOU

Course Preparation Team


Course Editor Course Writer
Prof. Anoop Chaturvedi (Units 1-9) Dr. Taruna Kumari (Units 1-9)
Retired from Department of Statistics, School of Sciences, In
Indira
ndira Gandhi National Open
University of Allahabad University,
e sity, New Delhi, Delhi
Univer
Prayagraj, Uttar Pradesh

Formatted and CRC Prepared by Dr. Taruna Kumari and Ms Preeti, SOS, IGNOU
Course Coordinator: Dr. Taruna Kumari
Programme Coordinators: Dr. Neha Garg and Dr. Prabhat Kumar Sangal

Print Production
Mr. Rajiv Girdhar Mr. Hemant Parida
Assistant Registrar Section Officer
MPDD, IGNOU, New Delhi MPDD, IGNOU, New Delhi

Acknowledgement: From the depth of my heart I render my gratitude to my family, specially, my father Mr. Puran
Chand, my mother Mrs. Raj Rani, my husband Mr. Anupam Pathak and my son Prithu for providing me necessary
comfort to overcome the ups and downs during the development of this material. Also, I extend my thanks to my
former graduate and post graduate students for their feedbacks and questions, which enabled me to get into
detailed explanation.
April;, 2023
© Indira Gandhi National Open University, 2023
ISBN-978-81-266-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the Indira Gandhi National Open University
Further information on the Indira Gandhi National Open University may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110068 or visit University’s website http://www.ignou.ac.in
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by the Director, School
of Sciences.
INTRODUCTION TO R SOFTWARE
R is a high level language. A language whose popularity is increasing day by day. It can also
be referred as an environment specially used for statistical analysis of the data and graphics
facilities. You may feel astonish to know that, R language has been around us since 1993.
The R language is dialected from the S language. 1The S language was developed at Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. The evolution of the S
language is described by the four books of John Chambers and coauthors. 2For John
Chambers efforts the Association for Computing Machinery (ACM) awarded him with its
Software System Award, that mentioned that this languge is “forever altered how people
analyze, visualize and manipulate data”. R was written by Ross Ihaka and Robert
Gentleman at the Department of Statistics of University of Aukland in New Zealand.
There are several reasons for the popularity of R. We are stating some of them here:
R is an interpreted language, which is free.
An outstanding and magnificent software, which is easy to use as well.
Work on Windows, Unix, Mac and Linux.
A number of statistical packages are available for handling statistical data analysis.
Comes with several data sets.
Quality of support and back-up available (via web-pages, R documents and books) on
functions and packages.
Widely accepted by many researchers, industralists and professors for the data
analysis purpose.
The main reason for impressive growth in the popularity of the R language now a days is,
emergence of data science as a career because data is everywhere and experts are needed
to sort and anlayze that day. So, together with the knowledge of computing, the knowledge of
the statistical methods and machine learning are also required.
This course is mainly written for the learners who are beginners in R computing g software.
Throughout the development of this course the emphasis are given to the packages which
comes with base distribution (i.e., precompiled binary di d stributions of the base sy
distributions ssystem)
stem)
during installation. It is essential for the learners to understand the basics of R b efore,
before,
switching to more complicated problems, such as discussed in the lab courses, i.e., MSTL- MSTS L-
011: Statistical Computing Using R-I, MSTL-012: Statistical Computing Using R-II, MSTL-
013: Statistical Computing Using R-III and MSTL-015: Statistical Computing Using R-V. The
content of this course is organized into self-explainatory 9 units. First five units are the part of
the Block 1 (Fundamentals of R Language) and next 4 units are the part of the Block 2
(Functions, Conditional Statements, Loops and Descriptive Statistics with R). These units
can be summarized as follows:
Unit 1 (Introduction to R): It comprises of installation procedure, methods of seeking help
and details on basic terminologies of R
Unit 2 (Nitty-Gritty of R): The second unit discuss about the R objects such as different types
of vectors, matrices, factors and arrays. It also throw light on missing values, arithmetic and
logical operations.
Unit 3 (Membership Testing, Coercion and Lists in R): As clear from the name in this unit
discuss membership: testing and coercion of R objects. Additionally, the lists objects are also
discussed in this unit.
Unit 4 (Data Frames, Reading and Writing in R): This unit given extensive details on data
frames objects, methods of reading and writing from/to a file and formatting commands.
1 Refer “An Introduction to R” manual by R Core Team
2
Refer“R Language Definition” manual by R Core Team
Unit 5 (Graphical Representation of Data with R): Different types of graphical functions that
are used to create plots of Scatterplot, Boxplot, Histogram, Barplot, Stripchart, Stem and
Leaf plot, Pie chart, pairs plot, coplot, cloud plot etc are discussed in this unit.
Unit 6 (Functions in R): The method of creating your own function is discussed in this unit by
taking some suitable examples.
Unit 7 (Control-Flow Constructs of R): Control-flow constructs such as conditional
statements, different types of loops and method of putting additional control on the loops
using the next and breaks statements are discussed in this unit with examples.
Unit 8 (Apply Family in R): This unit comprises of details on the usage and importance of the
apply family functions.
Unit 9 (Descriptive Statistics and Correlation with R): Unit 9 comprises of details on
measures of central tendency and dispersion together with examples on correlation
computations with R.
To develop this course, we have used Window operating system and the R commands
written in this course are run on R version 4.1.1. In a Window system, we interact with R
through the R console. Futhermore, the written commands can be easily saved. More details
on it are given in Unit 1 of this course.
In this course, the written codes, associated outputs and names of the functions, R objects,
packages, operators are written in ‘Lucida Console’ font type and theory is written in ‘Arial’
font type. Additionally, the R commands are written in bold and associated outputs are
unbold. Note that, the lines starting with ‘ # ’ written before the R commands are the
unexecuted commands, written to give clear understanding of the code part. Furthermore,
while studying this course do all the illustrations on the computers, preferably by writing the
commands in R script files (in an integrated editor) available on R Graphical User Interface
((RGui).
(RGui)). Then do all the SAQs and TQs, without using g computers.
It is important to note that, if you use any R function in your research/publications for data
analysis purpose then you should cite that package, in you written w work. example
ample to
ork. Say for exa
cite the used package base firstly get the citation details
e ails using the citation() function
det
and then use the obtained reference for citation purpose as follows:

In case, if the citation details are accessible (or available) via citation() function at the
prompt them learners may visit the CRAN (Comprehensive R Archive Network) page to get
the details of the contributors (such as author’s names, year and title) for citation purpose.
Lastly, in this introduction page I would like to express my deepest gratitude and thanks to
the R core team, Bill Venables, David M. Smith, John Chambers, Robert Gentleman, Ross
Ihaka, Martin Maechler and other contributors for providing access to enormous R sources
and for their substantial contribution in R language, which has extremely benefited the world.
The MST-015 (Introduction to R Software) is a 2 credit self-explained course, which is
developed for self-study. But still if you want to refer to additional books or references on
discussed topics you may refer to the following books and references.
Suggested Further Reading
1. Braun, W. j. & Murdoch, D. J. (2007). A First Course in Statistical Programming with R.
Cambridge.
2. Crawley, M. J. (2012). The R book. John Wiley & Sons.
3. Albert, J. & Rizzo, M. (2012). R by Example. Springer
4. Teetor, P. (2011). R Cookbook. O’REILLY.
5. Lafaye de Micheaux, P., Drouilhet, R., & Liquet, B. (2013). The R software:
Fundamentals of programming and statistical analysis. Springer.
6. Zuur, A., Ieno, E. N., & Meesters, E. (2009). A Beginner's Guide to R. Springer Science
& Business Media.
7. Heumann, C., Schomaker, M. & Shalabh (2016). Introduction to statistics and data
analysis: With Exercises, Solutions and Applications in R. Springer International
Publishing Switzerland.
8. Dalgaard, P. (2002) Introductory Statistics with R. New York: Springer- Verlag.
References
The packages used for the development of this course matrial can be referred from the
following references:
1. R Core Team (2021). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
2. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.
Springer, New York. ISBN 0-387-95457-0.
3. Mirai Solutions GmbH (2023). XLConnect: Excel
x ell Connector for R. R package version
Exc
1.0.7. https://CRAN.R-project.org/package=XLConnect
https://CRAN.R-project.org/pa
p ckage=
e=XLLCoonn
nnec
ectt
4. Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New
York. ISBN 978-0-387-75968-5
5. Lukasz Komsta and Frederick Novomestky (2022). moments: Moments, Cumulants,
Skewness, Kurtosis and Related Tests. R package version 0.14.1. https://CRAN.R-
project.org/package=moments
Expected Learning Outcomes
After completing this course, you should be able to:
Install R, take helps on functions and data sets, create R scripts and learn some basic
aspects of R;
create R objects and know the different data types and learnt to use membership:
testing and coercion functions;
read and write from/to a file;
do graphic representation of data with R;
do looping, create control statements and functions in R; and
compute descriptive statistics and correlation with R.

Feedback Link: https://forms.gle/SZZ23dxBEDJJdEGt9

Course Preparation Team


INTRODUCTION TO R SOFTWARE

BLOCK 1: Fundamentals of R Language


Unit 1: Introduction to R

Unit 2: Nitty-Gritty of R

Unit 3: Membership Testing, Coercion and Lists in R

Unit 4: Data Frames, Reading and Writing in R


Unit 5: Graphical Representation of Data with R

BLOCK 2: Functions, Conditional Statements, Loops and


Descriptive Statistics with R
Unit 6: Functions in R

Unit 7: Control-Flow Constructs of R

Unit 8: Apply
Apply Family in R

Unit 9: Descriptive Statistics and Correlation with R


FUNCTIONS, CONDITIONAL STATEMENTS, LOOPS
AND DESCRIPTIVE STATISTICS WITH R
In the beginning of this block, we shall discuss the creation of function in R. Then we
discuss the if statements and loop in R, such as for, while and repeat. To make
you familiar with the apply family function one more unit is added to this block, so that a
clear comparsion between the loops and apply family functions can be done. In the last
unit of this block, we shall make you familiar with the computations of descriptive
statistics and correlation with R, using functions and using formulae.

Block 2 consists of four units, namely, Units 6, 7, 8 and 9. We strongly recommend the
learners to study Block 1 of the MST-015 (Introduction to R Software) course before
studying Block 2.

Unit 6: After learning from the units of block 1, you may feel that it is time to write your own
function. It is mainly required when you want to run a piece of code (some statements
together) for different inputs. Functions can be used as a saviour from retyping the same
code again and again. Some suitable examples from the MSTL-011 (Statistical Computing
Using
g R-
R-I)
I)) lab course are discussed in this unit.

Unit 7: This unit comprises of rich detail on the control-flow constructs of R such as different
types of if statements, for loop, repeat loop and while loop. To get addition controls
on the conditional statements and loops, the next and breaks statements are also
discussed in this unit with examples.

Unit 8: It is not always necessary to write a loop. Situations may come arocess when you
can condense your entire loop code into a single command with the help of apply fam mily
family
functions. So unit 8 throw light and consists of brief details on the lapply(), sapply(),
apply(), tapply(), and mapply() functions.

Unit 9: Unit 9 is the last unit of this MST-015 (Introduction to R Software) course, whose
objective is not to discuss about the fundamentals or basics of R, but it helps yo yyou
u to start with
statistical computing in R using descriptive statistics. It consists of details on computations of
measures of central tendency
tendencyy and dispersion along g with correlation coefficients.

This material has been developed for self-study. We hope you will enjoy studying this block.

Expected Learning Outcomes


After completing this block, you should be able to:

create your own function to do a particular task;

use built-in function and manipulate them by using their outputs in your own functions;

learn the difference between user-defined and built-in functions;

create loops in R;

create conditional statements in R;

use apply family functions as an alternative to R loops;

use control-flow constructs of R;


compute measures of cental tendency using R;

compute measures of variability in R; and

compute the Pearson’s and Spearman’s correlation coefficients using built-in functions
and without using built-in functions.

Course Preparation Team


UNIT 6
FUNCTIONSS IN
NR
Structuree

6.1 Introduction 6.7 Argument Matching


Expected Learning Outcomes 6.8 Recursion
6.2 User-Defined Functions 6.9 Environments and Scope
6.3 Built-in Functions 6.10 Summary
6.4 Return Statement 6.11 Terminal Questions
6.5 Function Call 6.12 Solutions/Answers
6.6 Actual and Formal
Arguments

6.1
6 .1 INTRODUCTION
INT
TRODU
UCT
TIO
ON
In any programming language, a function is a self-contained piece of code
(with or without a name) that carries out some specific, well-defined task.
If a name is not given to a function, then it is called as anonymous
function. A function contains some executable statem
statements,
ments, which are
written to accomplish a particular task, say for example, to display a
message, to compute some expression
message expression, to compute coefficient of variation
and so forth.
In this unit, we shall discuss two categories of functions, namely built-in
functions (which comes with packages) and user-defined functions (which
are created by users). Throughout MST-015 (Introduction to R Software)
and MSTL-011 (Statistical Computing Using R-I) courses, we have used
built-in and user-defined functions. So, it is important you to know the
difference between the two. The main difference between these two
categories of functions is that, the built-in functions are not required to be
written by user as they come as part of some packages and are already
written to accomplish a predefined task, whereas user-defined function
are developed by the user, to accomplish an intended task. Also, the user
defined functions can be modified according to the requirements of the
user. In any user-defined function, we can use any already defined
function (either built-in or user-defined) at the time of writing a code.
There are several advantages of using functions, some of them are as
follows: 183
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Use of functions helps the user in avoiding repetitive programming of


the same instructions or executable statements, so that the typing
errors can be controlled.
If any modification is required in the computation of the formula or
task for which the user-defined function is created, then only the
change in the user-defined function will serve the purpose.
Functions allow the user to break down a large program into a
number of smaller self-contained components, such that each
component has some unique, identifiable purpose.
Use of functions enhances the logical clarity resulting from the
decomposition of a user-defined program into several condense
functions.
The length of the code written by the user can be effectively reduced
by using functions appropriately.
This unit comprises general definition or syntax of a function, discussion
about the parts of a function, its advantages and creation of functions.
Here, functions are explained by taking some suitable examples.

Expected Learning
Learn
ning
gOOutcomes
utcomes
After completing this unit, you should be able to:
distinguish between the built-in and user-defined functions;
learn the advantages of the user-defined functions;
differentiate between actual and formal arguments;
learn the concept of argument matching; and
learn to create user-defined functions.

6.2
6 .2 USER
USER DEFINED
DEFIN
NED FUNCTIONS
FUNCTIO
ONS
In Block 1 of MST-015 (Introduction to R Software) course
course,e, we ha
h
haveve
discussed a number of objects of R programming, like vectors, matrices,
arrays, lists, data frames, expression and null objects. In this unit, we will
cover one more object of R programming,
programming which are function objects.
objects It is
surprising but true that functions are R objects, which also have class and
type like other R objects. A user-defined function can be defined
anywhere in the code. Before using any user-defined function, we need to
make sure that it is defined in advance. Function objects, either built-in or
user-defined, have three components which are:
1. A formal list of arguments.
2. Body of the function.
3. An environment.
A user-defined function can be known or anonymous. When a function
has name, we call it as known function, if it does not have name than we
call it as anonymous. An anonymous function in R is created using the
following syntax.
#Syntax for writing an anonymous function
function
n (arglist)
) body
y

184
Functions in R

In this syntax, the first component function is a keyword, which


indicates to R that you are creating or defining a function. The arglist
written in the parenthesis depicts a list of arguments (a created function
may or may not have function arguments) and the body of the function
consists of R syntactically correct R executable statements in braces
‘ { } ’. We shall discuss each one of them in detail next. Note that, an
anonymous function is generally created to supply it as an argument to
other function.
Next, we discuss the syntax for the creation of a known function. Recall
that a function will be called a known function, if a name is assigned to it.
So, let us discuss a more detailed syntax of creating a function as follows:
A function named funName can be created in R using the following
syntax.
#Syntax of writing a known function
funName
e <-
- function
n (arglist)
)
{
Executable statement(s)
}
Starting from left to right, the general syntax consists of the name of the
function, which is fu
funName
funN
nName
nN e (it can be any valid variable name). Then an
assignment operator ( <- ), which is used to assign the function in
funName.
fun nNamme. After that, a keyword fu function
unct o is written which indicates the
tion
beginning of the function. Thereafter, in parentheses ‘ ( ) ’ arglist
arrglist is
written, which consists of a list of function arguments, which will be used
in the function. Lastly, instead of writing a single word body
bo d (which me
ody means
eans
body of the function), we have specifically shown the body d of the function
in the braces, which consists of some executable statements. e ts. For the
statemen
illustration purpose, let us create a user defined function named display
to display a message "Functions
"F nct
"Fun cti ions R as follows:
s in R"
#Creating a function to display a message
> display <- function() print("Functions in R")

N t th
Note that,
t a name tot the
th function
f ti isi given
i in
i suchh a manner, so that
th t it
describes the purpose of creating it. In addition to this, empty parenthesis
after the keyword function shows that the function does not have any
arguments, thus arglist is empty and the function body consists of only
a single statement that is why braces are skipped (as we can skip the
braces if the function consists of only a single executable statement).
Next, we check the class and type of the user-defined function display
as follows:
#Checking class and type of user-defined function
> class(display)
)
[1] "function"
> typeof(display)
)
[1] "closure"
Hence, the class of a user-defined function is "function" and its type is
"closure". The closure word has its own importance, which will be
discussed in function environment. 185
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Before proceeding further, it is important to discuss furthermore about


function name, the function keyword, arglist and body of the
function.
Function name is a suitable valid name for the function (like we give
name to R objects or variables), which generally indicates its purpose.
A name to the function is generally given when the function is to be
used more than once. The function name is framed in such a way, so
that it indicates the purpose of creating it.
function is a key word.
The arglist is a comma separated list of function arguments. These
arguments are known as formal arguments. The arglist can consists
of any number of function arguments, which can be a symbol with
some default value, vector, data frames and other R objects. The Dot-
dot-dot ( … ) object of R can also be passed as an argument via
arglist to the created function. The Dot-dot-dot argument can take
anyy number of supplied
pp arguments.
g It helps
p us to create a function,
which can take the arbitrary number of arguments. If this argument is
not used carefully, it can absorb necessary arguments as well.
The body y of the function consists of executable statements which are
written in braces ‘ { } ’ and the statements written in body of the
function evaluated sequentially. The body of the function can consist of
a single statement, a constant etc. This function body enclosed in
braces and may also contain a return statement (optional).

Note that, the body and the list of formal arguments of a function (user-
defined and built-in) can be extracted using the body()
() and formals()
b dy()
bo foorm
r als(s())
functions (available in the base
ba e package) as follows:
ase
#Extracting body of the display() function
> body(display)
[1] "Functions in R"
#Extracting formal arguments of display() function
> formals(display)
NULL

Hence, the output verifies that the body of the display function consist of
only one statement and the display() function do not have any formal
arguments. Additionally, the third basic component of the function, i.e.,
function environment will be discussed in the end of this unit.

SAQ
Q1
Consider the following names for the user-defined functions. Write which
function names are appropriate and which are inappropriate ones with
reasons:
(i) function
(ii) mean
(iii) sum of squares
(iv) Est$x
186
(v) Varx
Functions in R

6.3 BUILT-IN FUNCTIONS


The built-in functions come with packages. For base distribution packages
we do not need to install any specific package to use a function available
in that package, but to use a function of a package, which is not available
in your R library, the package should be installed first using the
install.packages() function command. Thereafter, before using the
function, the same package is to be loaded using the library() or
require() function.
The purposes of the built-in functions are already defined. Also, the output
of the built-in functions comes in specific formats, may be vector, list, data
frame or any other format. The source of the built-in function can be
viewed by simply writing the name of the function. For the illustration
purpose, let us view the source the matrix() function as follows:
#Viewing
g source
e of
f the
e matrix()
) function
n
> matrix
x
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
{
if (is.object(data) || !is.atomic(data))
data <- as.vector(data)
.Internal(matrix(data, nrow, ncol, byrow, dimnames,
missing(nrow),
missing(ncol)))
}
<bytecode: 0x000000001501b560>
<environment: namespace:base>

Before using any built-in function, it is advisable to view it help page and
see the purpose of the function, its arguments, arguments with default,
examples quoted at the end of the help pages and other important details.

6.4
6 .4 RETURN
RETURN S
STATEMENT
TATEMENT
T
A user defined function has only one return statement
statement, which is optional
optional.
The syntax for the return statement is as follows:
#Return statement
return
n expression
#Alternatively
return
n (expression)
)

If the function does not have a return statement, then by default the last
statement written in the function body will be returned. For the illustration
purpose, now we create a function which computes the following
expression.
n
C xp x qn x , where q 1 p ...(6.1)

The first question which may come to your mind would be “which should
be the function arguments?”. You can observe that to evaluate the given
expression, we should have the value of x, n and p. So, while creating a
user-defined function, we use them as function arguments. So, when
187
Functions, Conditional Statements, Loops and Descriptive Statistics with R

these values will be supplied to the function, then only (6.1) will be
computed. We now create two functions with name Ex1 and Ex2, one
with a return statement and another without a return statement, as follows:
#Creating a function with a return statement
> Ex1
1 <-
- function(x,
, n,
, p){
{
+ y <-
- choose(n,x)*p^x*(1-p)^(n-x)
)
+ return(y)
)
+ }

It can be seen from the user-define function that Ex1 is the name of the
function and x, n and p are its arguments. The function body consists of
two statements out of which the first statement is computing the given
expression and assigning it to y and the second statement is the return
statement which return the value y to the function call (discussed in next
section). The same task could have been done without a return statement
as follows:
#Creating a function without a return statement
> Ex2 <- function(x, n, p){
+ choose(n,x)*p^x*(1-p)^(n-x)
+ }

The Ex2
Ex function is the second user-define function with arguments x, x, n
p. The body of the function consists of only one single statement and
and p.
there is no return statement. So by default, the function will return the last
evaluated value of the statement choose(n,x)*p^x*(1-p)^(n-x).

Note: (i) Whenever, we create a user-defined function, the question arises


which arguments should be used e as function arguments. For that
purpose, it is always better to find out, which argumentss will be needed for
the computation of the required task. The function arguments are chosen
in such a way, so that the function can be computed for different
differennt values of
the arguments via function call to get the required results.
(ii)The function call corresponding to E 1 and Ex2
Ex1
x1 2 functions are
discussed in next section.

6.5 FUNCTION CALL


Functions (either user-defined or built-in) in R are called using the name
of the function with a list of arguments (referred to as actual arguments)
separated by commas in parentheses. Recall that in the beginning of this
unit we have created a function named display, which was not having
any function argument. So, we shall call it in the following manner:
#The display function
> display
y <-
- function()
) "Functions
s in
n R"
"
#Calling the display function
> display()
)
[1] "Functions in R"
Note: Whenever a function is created, it should be called in proper
manner. By proper manner we mean, it is important to take care of
188 argument matching while calling the function. The details on argument
Functions in R
matching in functions is discussed in the Section 6.7 of this unit. In case of
built-in functions, the help page of the function should be consulted before
calling the function and arguments should be supplied to the function
carefully.

Recall that in the last section, we have created two functions, namely Ex1
and Ex2 for computing the expression given in (6.1). Both the functions
were having 3 function arguments. Now, we create a function call
corresponding to these two functions as follows:
#Calling Ex1 function
> Ex1(2,
, 10,
, 0.5)
)
[1] 0.04394531
#Calling Ex2 function
> Ex2(2,
, 10,
, 0.5)
)
[1] 0.04394531

Observe that, in the function call we have not used tags (arguments
names), the actual values 2, 10 and 0.5 are assigned to the formal
arguments, x, n and p by positionally matching the arguments (Refer
Section 6.6 for formal and actual arguments of functions). Due to
positional matching of the arguments, the first value 2 in the function call
is supplied to the first formal argument in the function definition, which is
x.. Similarly, other arguments will be supplied according to the positions of
x
the arguments.
Note: The function call for Ex1
1 with tags will be like Ex1(x=2,
Ex1(x=2
Ex 2, n=10,
=0.5). Also, Ex
p=0.5).
p Ex1 can be computed for different values of the x, n and p
arguments.
Now, we update the user-defined function Ex2 by adding more statements
to it, to check whether any argument of the function is missing or not using
the mi
missing()
missin ing(
in ) function (available in base
g()
g( s package). Give the name
Ex2MArg
E
Ex 2M MArg to the updated function as follows:
#Checking for missing arguments
> E
Ex2Marg
2M <-
- f
function(x,
ti ( , n,
, p){
){
+ cat("Is
s the
e value
e of
f x missing?",
, missing(x)
) ,"\n")
)
+ cat("Is
s the
e value
e of
f n missing?",
, missing(n)
) ,"\n")
)
+ cat("Is
s the
e value
e of
f p missing?",
, missing(p)
) ,"\n")
)
+ #choose(n,x)*p^x*(1-p)^(n-x)
)
+ }

The Ex2MArg function consists of three statements, which will test for the
missing arguments. The missing() function will return TRUE if its value
is missing in the evaluation frame of the function and FALSE if its value is
available. Moreover, as we are not interested to evaluate the expression
(6.1), therefore we have used ‘ # ’ (so that it will be considered as
comment and will not get evaluated).
#Creating a function call
> Ex2MArg(n=10,
, p=0.5)
)
Is the value of x missing? TRUE
189
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Is the value of n missing? FALSE


Is the value of p missing? FALSE

From the output, it is clear that, when exact matching on tags (refer
Section 6.6 for argument matching) will be conducted, the value of x will
be missing and n and p will be available due to the function call
Ex2MArg(n=10, p=0.5). The same can be verified from the output.
The output confirms that the value of x argument is missing, as TRUE is
returned by the missing() function command. Also, the arguments n
and p are not missing, as FALSE is returned corresponding to these two
arguments.
Note that, the earlier defined two functions Ex1 and Ex2 are evaluating
the probability mass function (pmf) of the binomial distribution, which can
be computed using the built-in function dbinom() as well (refer Session
3 of MSTL-011 (Statistical Computing Using R-I) course for more detail).
The same can be verified by passing its arguments as x, size and prob
as follows:
#Computing the pmf of binomial distribution
> dbinom(x=2, size=10, prob=0.5)
[1] 0.04394531

The Ex1x1 and Ex2


Ex 2 are user defined functions and dbinom()
db
bin
nom m() function is a
stats package. To call the dbinom()
built-in function available in the stats
tat dbin
dbin
nomm()
function, we have used three arguments of it. Hence, using all the three
approaches, we get the same result as 0.04394531.
Note: To make a function call for built-in functions, it is better
b tter to always
be
consult the help page of the function.
functio
on.

SSAQ
SA
AQ 2
Consider the following code:
x <- list(runif(5), runi
n f(10), runif(15));x
runif(10), x
y <- (x[[1]]-mean(x[[1]]))/sd(x[[1]]);y
z <- (x[[2]]-mean(x[[2]]))/sd(x[[2]]);z
w <- (x[[3]]-mean(x[[3]]))/sd(x[[3]]);w
Create a function to do the same task and call it accordingly.

6.6 ACTUAL AND FORMAL ARGUMENTS


While defining the syntax of a function, we have already discussed a little
about formal arguments. After studying previous sections, you must now
be clear about ‘how a function is created and called?’. You must have
noted that the function definition consists of arglist, which is a comma
separated list of arguments or variables whose values are to be supplied
at the time the function is called. So, arglist in function definition syntax
consists of commas separated formal arguments of the function. For the
illustration purpose, we again consider the user-defined function Ex2 as
follows:
190
Functions in R

#Function definition of Ex2


> Ex2
2 <-
- function(x,
, n,
, p){
{
+ choose(n,x)*p^x*(1-p)^(n-x)
)
+ }

The arglist of Ex2 function definition consists of three arguments,


which are x, n and p. So, x, n and p are the formal arguments of the
function.
To use the created function Ex2 a function call needs to be invoked. As,
discussed earlier, the function call will consist of the name of the function
with a list of values of the arguments, which are to be supplied to the
function definition. Since the function call consists of the actual values
(data), which are to be supplied to the function definition therefore, these
arguments are known as actual arguments or supplied arguments. For the
illustration purpose let us call Ex2 function with x=3, n=5 and p=0.8 as
follows:
#Function call of Ex2
> Ex2(x=3, n=5, p=0.8)
[1] 0.2048

So, 3, 5 and 0.8 are the actual arguments. Also, the value 3 is supplied to
x, 5 to n and 0.8 to p. This process of calling a function by supplying
actual arguments is known as call-by-value.

SSAQ
SA
AQ 3
Consider the following code and write the formal and actual argument
arguments:
ts:
Line <- function(ch, n)
{
for(i in 1:n) cat(ch)
cat("\n ")
}
Line("*", 50)

6.7 ARGUMENTS MATCHING


After creating a function, when a function is called, the formal arguments
are matched with the actual arguments. In R programming the matching
of the formal argument can be exact matching on tags (names), positional
matching of the arguments and partial matching on the tags. We discuss
each one of them with example one-by-one.
First, we discuss exact matching on tags and positional matching. By
exact matching, we mean that the tagged actual arguments are supplied
to the same tagged formal argument. So, corresponding to each tagged
actual argument, there should be only one same tagged formal argument.
The same can be understood from the following examples:
In the first example, we create a function named ArgMat1 to compute the
product of any three numbers say, x, y and z. We supply their actual
values from the function call using tags and show how exact matching on
tags take place.
191
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Creating a function
> ArgMat1
1 <-
- function(x,
, y,
, z){
{
+ x <-
- x+1;
; y <-
- y+1;
; z <-
- z+1
1
+ x*y*z
z
+ }

The ArgMat1 function consists of three arguments named (tagged) x, y


and z. In the function body, each of these arguments are increments by
one and then their product is returned. Now, we create a function call of it
by using the exact tags for actual arguments as follows:
#Function call of ArgMat1 with exact matching on tags
> ArgMat1(x=2,
, y=1,
, z=4)
) #tags on same positions
[1] 30
> ArgMat1(z=4,
, x=2,
, y=1)
) #tags on different positions
[1] 30

Hence, the output shows that, the tags on actual and formal arguments
are matched when the function is called and the function is evaluated.
Also, it can be observed that if tags are used, then the position of the
arguments in function call does not matter.
Next, we call the Ar
ArgMat1
rgMat t1 function without using tags. We shall show that
in this case positional matching, the value appearing at the first place in
the function call will be supplied to the first formal arguments. Similarly,
the values appearing in the function call (actual arguments) at the second
and third places in the function call will be supplied to the second and third
formal arguments in the function definition.
#Function call of ArgMat1 with positional matching arguments
> ArgMat1(2, 1, 4)
[1] 30

Also, note that if any argument is left unmatched, you will surely
sur
u ely get an
error message. The same can be verified from the following function call:
#An argument left unmatched leads to an error
> ArgMat1(1,
, 4)
)
Error in ArgMat1(1, 4) : argument "z" is missing, with no
default

Next, we illustrate what happens if same formal argument matched with


serval actual arguments with a function call as follows:
#Function call with two actual values with same tags
> ArgMat1(z=4,
, z=3,
, x=2,
, y=1)
)
Error in ArgMat1(z = 4, z = 3, x = 2, y = 1) :
formal argument "z" matched by multiple actual arguments

Since, there are two supplied values for the z argument, that is why, we
get an error message. Hence, when there is more than one supplied value
for any of the actual argument an error occurs or vice-versa.
In the next example, we increase the number of formal arguments of
192 ArgMat1 by adding one more argument to the function, which is an
Functions in R

expression (w=x+y+z) and pass it as function argument. Also, we named


the function as ArgMat2 as follows:
#Creating a function
> ArgMat2<-function(x,
, y,
, z,
, w=x+y+z){
{
+ x<-x+1;
; y<-y+1;
; z<-z+1
1
+ cat("Is
s the
e value
e of
f w missing?",
, missing(w)
) ,"\n")
)
+ x*y*z
z
+ }

So, the only difference between ArgMat1 and ArgMat2 is of the 4th
formal argument w whose value depends on x, y and z arguments. So, to
make a function call of ArgMat2, it is enough to pass x, y and z
arguments only as follows:
#Function call of ArgMat2
> ArgMat2(x=2,
, y=1,
, z=4)
)
Is the value of w missing? TRUE
[1] 30

Note that the w argument is not used in the function body, therefore the w
argument will not get evaluated until unless it is used in the function body
(which is known as lazy evaluation of the function argument). Therefore,
the missing(w)
miss
mi sing( w) is returning TRUE.
g(w) TRUEE. Hence, the function argument will not
be evaluated until unless its value is required. For more clarification, let
us modify ArgMat2
Arg
ArgMatt2 and create new function named ArgMat3
ArgM
gMa
gM at3 as follows:
#Creating a function
> ArgMat3 <- function(x, y, z, w=x+y+z){
+ x <- x+1; y <- y+1; z <- z+1
+ x*y*z*w
+ }
It can be seen that the only difference between ArgMat2
ArgM
Ar t2 and ArgMat3
gMat
gM ArgM
gMa
gM at3 is
in the return statement. The return statement of ArgMat3 uses the value
of w, then w gets evaluated and the value of the product x*y*z*w is
returned as follows:
#Function call
> ArgMat3(x=2,
, y=1,
, z=4)
)
[1] 300

After discussing exact matching on tags and positional matching. Now, we


illustrate partial matching on tags in R. For the illustration purpose, we
shall create two user-defined functions, namely, StdData1 and
StdData2. The first function consists of three formal arguments with tags
English, Hindi and Mathematics. Additionally, the second function
argument consists of two formal arguments with tags Management and
Mathematics. These two functions are simply created to print the marks
obtained by candidates in different subjects. Let us first consider
StdData1.
#Creating user-defined function StdData1
> StdData1
1 <-
- function(English,
, Hindi,
, Mathematics){
{
193
Functions, Conditional Statements, Loops and Descriptive Statistics with R

+ cat("\n
n English-",
, English,
, "\n
n Hindi-",
, Hindi,
, "\n
n
Mathematics-",
, Mathematics,"\n")
)
+ }

This function can simply be called using tags and without using tags, as
discussed earlier. Now, we call this function by using incomplete tags as
follows:
#Calling the StdData1 function with incomplete tags
> StdData1(Eng=100,
, Hin=50,
, Math=90)
)

English- 100
Hindi- 50
Mathematics- 90

> StdData1(Hin=50,
, Eng=100,
, Math=90)
)

English- 100
Hindi- 50
Mathematics- 90
Mathematics
Hence, the output verifies that incomplete tags can also be used to call a
function, but it should not be encouraged.
Next, to discuss partial matching, we consider StdData2
StdD
St Datta2 function and
call it by matching exactly one argument completely as follows:
#Creating user-defined function StdData2
> StdData2 <- function(Management, Mathematics){
+ cat("\n Management-", Management, "\n Mathematics-",
Mathematics,"\n")
+ }
#Calling the StdData2 function by partial matching on a tag
> StdData2(Management=50, Ma=100)

Management- 50
Mathematics- 100
Hence, the output verifies that partial matching on tags took place. It will
be interesting to check how far this partial matching on tags is allowed.
Consider the following function call:
#Calling the StdData2 function
> StdData2(Man=50,
, Ma=100)
)
Error in StdData2(Man = 50, Ma = 100) : formal argument
"Management" matched by multiple actual arguments
You can see that an error is occurring here, the reason behind the error is
that the actual argument Man is used for supplying Management marks,
but the actual argument Ma matched both the formal arguments,
therefore, the error “formal argument Management matched by multiple
actual arguments” appears.
In the next, illustration, we used a very interesting object of R as a
function argument, which is Dot-dot-dot (...). Recall that, the ...
argument allow us to take any number of arguments. To illustrate it, we
now create a function named Mn, which accepts any number of arguments
194 together with the n and m arguments as follows:
Functions in R

#Creating a function
> Mn
n <-
- function(...,
, n,
, m){
{
+ (sum(...)-n)/m
m
+ }

From this function definition, it is clear that the Mn function computes the
sum of its arguments except for n and m arguments and the subtract n
from the sum and thereafter divide the remaining value by m. Let us create
a function call for Mn, by suppling 3 vector arguments in addition to n and
m as follows:
#Calling the Mn function
> x <-
- 1:10;
; y <-
- 11:20;
; z <-
- 21:30
0
> Mn(x,
, y,
, z,
, n=3,
, m=2)
)
[1] 231
#Verification of the obtained result
> Mn(1:30,
( , n=3,
, m=2)
)
[1] 231

Hence, the output verifies that, when the Mn


M function was called the formal
arguments were compared with the actual arguments. Due to the same
tags (or exact tags) of n and m arguments in the function call, the
remaining arguments were absorbed to the ... argument.
Most importantly, try to use ... argument as the last argument in the
arglist
arglis st of any function or otherwise it may absorb other arguments as
well, see for example.
#Calling the Mn function without using tags
> Mn(1:30, 3, 2)
Error in Mn(1:30, 3, 2) : argument "n" is missing, with no
default
> Mn(x, y, z, 3, 2)
Error in Mn(x, y, z, 3, 2) : argument "n" is missing, with
no default
You should note that, we are getting an error message as the ...
argument has absorbed the other arguments as well. This problem can be
tackled by modifying the function formal argument list of Mn. We name this
updated function as MnUpd as follows:
#Creating a function
> MnUpd
d <-
- function(n,
, m,...){
{
+ (sum(...)+n)/m
m
+ }
#Calling the function
> x <-
- 1:10;
; y <-
- 11:20;
; z <-
- 21:30
0
> MnUpd(3,
, 2,
, x,
, y,
, z)
)
[1] 234
#Verification of the obtained result
> MnUpd(3,
, 2,
, 1:30)
)
[1] 234 195
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Hence, the output verifies that firstly n and m arguments were supplied
then the remaining arguments were absorbed to ... argument.

SAQ
Q4
Create a function which returns the minimum of each column of the data
frame argument.

6.8 RECURSION
In programming languages, recursion is a process in which a function call
itself repeatedly, until some specified condition is satisfied. Whenever you
solve a problem using recursion, two conditions must be satisfied. First,
the function should call itself again and again and second, the recursive
function should include a stopping criterion (generally an if statement).
The absence of stopping criteria leads to an infinite recursion. For the
illustration purpose, now we create a recursive function named nthderiv
to compute the nth order derivative of a simple function using the D()
function (refer Session 2 of MSTL-011 for more detail).
#Creating a function to compute the nth order derivative
> nthderiv <- function(fx, ch, n){
+ y <- D(fx, ch)
+ n <- n-1
+ if(n>0){
+ nthderiv(y, ch, n)
+ } else {
+ return(y)}
+ }

The created user-defined function nthderiv


n hd
nt derriv have three forma
formal
al
arguments, namely, fx, ch h and n.
n. Each argument is used d for a specific
purpose. The fx argument is used to get an expression ob object,
bject
c , whose
derivative is to be computed. The second argument ch is used to get the
variable, with respect to which the derivative is to be computed
character variable computed.
The n argument is used to compute the derivative n number of times. So,
we are controlling the functioning of the nthderiv function using 3
function arguments. Now we discuss its execution.
After, receiving its arguments from the function call, this function will be
evaluated sequentially. Firstly, the first derivative of the expression will be
computed and assigned to y. As once the derivative is computed, so n will
be decremented by 1, then an if condition is used to perform recursion
and to exit from recursion. If after decrementing n, the value of n is
greater than zero, then the nthderiv function will call itself again or
otherwise the function will return the value of y. This process will continue
till n is greater than zero.
#Creating an expression object
> eobj
j <-
- expression(x^4+5*x-3)
)

196
Functions in R
Note: An R object of type "expression" is created using the
expression() function available in the base package. The expression
objects are unevaluated R statement. The expression objects can only be
evaluated by using the eval() function (Refer Session 2 of MSTL-011
(Statistical Computing Using R-I) for more detail).
Next, we call (invoke) the created function by passing the three arguments
as follows:
The first argument as an expression object eobj.
The second argument as a character variable with respect to
which derivative is to be computed.
The last argument as 2, to compute 2nd order derivative.
#Invoking a function (creating a function call)
> nthderiv(eobj,
, "x",
, 2)
)
4 * (3
3 * x^2)
)
d 4
Hence, we get the required result, i.e., x 5x 3 12x 2 .
dx
Note: Recursion in R is not used frequently, due to the availability of a
number of built-in functions.

SSAQ
SA
AQ 5
Create a recursive function which will call itself n number of times to print
the following message by Dr. APJ Abdul Kalam ji.
"Dream Transform into Thoughts and Thoughts Result in
Action"

6.9
6.
.9 E
ENVIRONMENTS
NV
VIRO
ONMEN
NTS AND
AND SCOPE
SCO
OPE
Whenever we start R and create some objects in R, by default they are
created in the global environment. So, the user’s workspace is the global
environment. The current environment in R can be seseen
een using the
envi r nment() function available in ba
environment()
viro
vi ase package. For the illustration
base
purpose, we now create three arbitrary R objects, namely, x, y and z.
Then we list all of them using ls() function and thereafter check in which
environment these objects are available using environment() function
in the following manner:
#Removing all the objects
> rm(list=ls())
)
#Creating R objects
> x <-
- matrix(1:4,
, 2,
, 2)
)
> y <-
- list(x,
, as.logical(x))
)
> z <-
- as.data.frame(x)
)
#Listing all objects present in the current R environment
> ls()
)
[1] "x" "y" "z"
#Checking the current R environment
> environment()
)
<environment: R_GlobalEnv> 197
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Hence, by default the objects are created in the global environment


(R_GlobalEnv).
Whenever a function is created the function environment gets active and
the variables or objects bound in that environment are available to the
function. This process is called function closure, whereby closure it
means bindings of its environment. Additionally, when a function is called
the evaluation environment gets created and formal arguments are
matched with the supplied arguments. The body of the function is always
evaluated in the evaluation environment of the function.
Scope of variables:
Before we discuss scope, it is important to understand the difference
between global and local objects or variables. The R objects created on
the R workspace are global but the objects used inside a function
definition are local objects. Due to the scope of the variables, we are able
to use the same name for local and global variables/objects. Note that,
whenever, an object is requested inside a function, it is firstly searched in
the evaluation environment, then in the enclosure and so on until the
global environment is reached. Mainly, scope of a variable comes into the
picture, if an unbounded variable (whose value is not known) is found. For
the illustration purpose, we now create a function named PRINT
PRIINT to print a
line of given character ch with n number of times and it stops at the
character p. Before that, let us check the current environment as follows:
#Checking for current environment
> environment()
<environment: R_GlobalEnv>

Now we define a character variable


e p in the global environment.
enviro
r nment.
#Defining a object p in global environment
> p <- "$"

Next, we create the function named PRINT


PR INT to print a line of given ch
PRIN
IN h
character n number of time. So, ch and n are its function arguments.
#Creating a function
> PRINT
T <-
- function(ch,
, n)
)
+ {
+ for(i
i in
n 1:n)
) cat(ch)
)
+ cat(p,
, "\n")
)
+ environment()
) #Evaluation environment
+ }
> PRINT("*",
, 50)
)
**************************************************$
<environment: 0x0000000015c5b700>

Note that in the created function, there is only one unbound object, which
is p. Additionally, to get the evaluation environment, the environment()
function is called inside the function. Since p is not found in the
environment of the function, therefore it values is searched in the
198
Functions in R

enclosures and the global environment. In the global environment p is


assigned as ‘ $ ’, that is why, we get the output as follows:
**************************************************$

Note: The search path for any R object can be seen using the search()
function available in the base package as follows:
#Looking for the search path
> search()
)
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"

6.10 SUMMARY
The main points discussed in this unit are as follows:
We have discussed the g y
general syntax g a user-defined
of creating
function;
The usage of function arguments and about argument matching are
discussed;
The method of creating a function call by taking care of argument
matching is discussed in this unit.;
Function environment and scope is also discussed;
Illustrated several user-defined functions; and
The differences between built-in and user-defined
d functions are
explained.

6.11
6 .1
11 T
TERMINAL
ERMINAL QUESTIONS
QUES
STIO
ONS
1. Define built-in functions. Give an example of a built-in function.
2. Write the name of the at least two built-in functions, which are
available in base package.
package
3. Differentiate between the user defined and built-in functions.
4. Create a function to check whether the given number is an even
number or an odd number.
5. Create a function to obtain Fibonacci series of size n.
6. If A is a square matrix of order 3x3 and I is an identity matrix of
same order. Then create a function named Mult which computes
the following expression for any arbitrary matrix A:
A3+3I
7. Write any two advantages of user-defined functions.

6.12 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
199
Functions, Conditional Statements, Loops and Descriptive Statistics with R

1. (i) function – Inappropriate, as keywords or reserved words


should not be use as function name.
(ii) mean – Inappropriate, as mean is the name of a built-in
function.
(iii) sum of squares – Inappropriate and invalid, the function
name should not consist of white space.
(iv) Est$x – Inappropriate and invalid, the function name should
not consist of any special character except for underscore.
(v) Varx – Appropriate
2. From the given code, it is clear that x is a list with three
components. To evaluate y, z and w the same code is repeated 3
times but with different components of the list x. So, we create a
function name Appen as follows:
Appen <- function(z){
(z-mean(z))/sd(z)
}
Then we call Appen to get the values of y, z and w by passing each
of the list components one-by-one as follows:
y <- Appen(x[[1]]);y
z <- Appen(x[[2]]);z
w <- Appen(x[[3]]);w
3. The arguments in the function definition Ch and n are the formal
arguments and the arguments in the function call " * " and 50 are
the actual arguments.
4. A function named MinDf to compute
compute the minimum of each column
of any supplied data frame can be created as follows:
MinDf <- function(Df){
n <- ncol(Df)
for(i in 1:n) cat(
cat("Minimum
("Minimum of", i, "column",
names(Df)[i], "is", min(Df[,i]),"\n")
}
After creating the function, the written code can be checked by
supplying any data frame as an argument to the function, say for
example.
MinDf(USArrests)
Note: The output will be as follows:
Minimum of 1 column Murder is 0.8
Minimum of 2 column Assault is 45
Minimum of 3 column UrbanPop is 32
Minimum of 4 column Rape is 7.3
5. To print the given massage n number of time, we can create a
recursive function named QUOTES as follows:
QUOTES <- function(n){

200
Functions in R

print("Dream Transform into Thoughts and Thoughts


Result in Action")
n <- n-1
if(n>0) QUOTES(n)}
After creating the function we can call it to print the quote 5 number
of times as follows:
QUOTES(5)
Note: The same task could have been done using rep() function,
loops and other methods. But in the question the method was
clearly mentioned.

Terminal Questions (TQs)


1. Refer Section 6.3.
2. The print() and summary() functions are available in the base
package.
3. Refer Section 6.1.
4. A function named CheckEv can be created using remainder
operator and if-else statement to check whether a number is
even or not as follows:
#Function definition
CheckEv <- function(x){
if(x%%2==0)
print("Supplied value is an even number")
else
print("Supplied value is an odd number")
}
This function takes only one argument, which is the number x, and
checks using the remainder operator, whether the number is even or
odd. If the number is even the message "Supplied value is
an even number" is printed or otherwise if the number is odd the
message "Supplied value is an odd number" is printed.
Next, to check if the function is working properly or not a function
call can be invoked with one even value and another odd value as
follows:
#Function call
CheckEv(10)
CheckEv(7)
5. We now create a function named FibSe to obtain the Fibonacci
series of size n as follows:
FibSe <- function(n){
z <- c()
x <- 0; y <- 1
for(i in 1:(n-2)){
201
Functions, Conditional Statements, Loops and Descriptive Statistics with R

z[i] <- x+y


x <- y
y <- z[i]}
c(0,1,z)
}
The FibSe function only takes one function argument, which is n
(the size of the Fibonacci series). In the beginning of this function a
null vector z is created to assign the values of the computed
Fibonacci numbers. As the first two Fibonacci numbers are 0 and 1,
therefore the remaining n-2 numbers are computed with the help of
a for loop. The last statement of the function, which is the
computed Fibonacci series is the default return statement.
After creating the function, we can all it with any value of n using the
following function call:
FibSe(n=10) #say for n=10
6. In this question, we are asked to create a function, which takes a
matrix as an argument as compute the following matrix expression:
A3+3I
where, A is a square matrix of order 3x3 and I is an identity matrix
of same order.
A function named Mult can be created to compute the given matrix
expression as follows:
Mult <- function(A, n){
I <- diag(n)
A%*%A%*%A+3*I
}
It can be observed that, the Mult function consists off two
o function
because
arguments A and n becaus u e the computations of the given
expression depend on the matrix A and the identity matrix I. After
created the Mult function the expression A3+3I can be computed for
any arbitrary matrix A as follows:
#Arbitrary matrix B of order 3x3
B <- matrix(1:9, 3, 3)
#Function call
Mult(B,3)
7. Refer Section 6.1.

202
UNIT 7

CONTROL-FLOW
W CONSTRUCTSS
OFF R

Structure
7.1 Introduction
The while Loop
Expected Learning Outcomes
The repeat Loop
7.2 Versions of if statements
Nested Control-Flow Constructs
An if Statement
7.4 The next and break
An if-else Statement
Statements
Nested if Statements
7.5 Summary
The ifelse() Function
7.6 Terminal Questions
7.3 Loops
7.7 Solutions/Answers
The for Loop

The Nested for Loop

7.1
7 .1 INTRODUCTION
INTRODUCTION
It is not always possible to write your entire code (to do a particular task) as a
single sequence of R statements. The problems encountered in real life for the
data analysis purpose are rarely that simple. Often the situations are
encountered when we need to execute particular number of statements of a
written code under a specific condition and different set of statements under
other conditions. Additionally, we may encounter situations in which a program
or set of statements has to be executed or evaluated more than once. Such
situations can be handled with the help of control-flow constructs of R
smoothly and efficiently. Moreover in R, the statements written in different
types of conditional statements and loops are executed in the same order in
which they appear, like in any other language.
The control-flow constructs determine the sequence, in which the statements
are executed in a code portion. Generally, the computations consist of
sequentially evaluating statements. Further, the statements are either
separated by a semicolon ( ; ) or a new line. Syntactically complete and
correct statements in R are evaluated when we press the enter key, i.e., when
203
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R

the new line is encountered at the end of the statement (on R console). For
error-free execution of the code portion the syntax and statements should be
syntactically complete and proper.
In addition to this, note that in R more than one statement can be grouped
together using the braces ‘ { } ’ and the group of statements is referred to as
a block.
The if, for, while, repeat, break and next words used in conditional
statements and loops are reserved words. These are the basic control-flow
constructs of the R language.
Note: The if statements is also known as conditional execution statements
and the loops are also known as repeated execution statements.

Expected Learning Outcomes


After completing this unit, you should be able to:
write and use various types of conditional if statements;

write and use various types of loop functions, such as for, while and
repeat;

use the control structure appropriately (according to the problem


requirements); and

learn the difference between the use of the break


br k and next
xt loop
control statements.

7.2
2 VERSIONS
VER
RSION
NS OF if
if S
STATEMENTS
TATEMENTS
In this section, we shall discuss different versions
n of if f statements (al
(also
lso
known as conditional execution statements). In any conditional execution
statement on the basis of a test condition, one out of two or more possible
actions is carried out. In conditional execution, we firstly frame
e a test condition.
The test condition can be a value, variable, or an expression that yields TRUE
or FALSE
FA or a numeric value. Note that, if a test condition is a numeric value,
then a non-zero value means TRUE and a zero value means FALSE.
Moreover, a test condition may include different types of operators. Recall
that, different types of operators have been already discussed in the Unit 2 of
MST-015 (Introduction to R Software) course. So, before, reading this unit you
must refer to operators first and remember the precedence of different
operators used in the test conditions. By precedence we mean that, highest
precedence operators will be solved prior than the lower precedence
operators. For a quick reference of operators see the following table:

Operator Category Operators Highest


Arithmetic multiply, divide and remainder * / % .
.
Arithmetic addition and subtraction + -
.
Relational operators < <= > >= .
Equality operators == != .
Logical operators ! && & || | Lowest
204
Control-Flow Constructs of R

Whenever a test condition is framed, firstly, parenthesis ‘ ( ) ’ is solved, then


the expressions involving arithmetic operators are solved, thereafter relational
operators and then logical operators are solved. It is also important to note
that the test expression (or test condition) is solved from left to right.
Recall that in the Unit 2 of MST-015 course, the equality operators were
discussed under the relational operators. We should always be careful while
dealing with the equality operator ‘ == ’, which seems to be similar to the
assignment operator ‘ = ’, but these two are entirely different. If we use the
wrong operator then we may or may not get an error message but we will
certainly get erroneous results.
To learn the execution or working of control-flow constructs, it is very important
to first learn about a block and then framing of test conditions in R. A block in
R is created using braces ‘ { } ’, i.e., opening and closing curly brackets.
Note that, the last statement of a block is the default output statement. See the
following code for the understanding purpose, in which we write the three
statements together by three different ways:
#Writing 3 statements using enter key at the end of each
#statement
> x <- 5
> y <- 7
> x+y
[1] 12 #Output

#Writing 3 statements together by separating them with


#semicolon
> x <- 5; y <- 7; x+y
[1] 12 #Output

#Writing 3 statements within a block


> {x <- 5
+ y <- 7
+ x+y #default return statement
+ }
[1] 12 #Output

Next, we shall show some examples of test conditions. Recall that a test
condition should always yield one value, either TRUE, FALSE or any numeric
value, which means the length of the test condition is only one. To make you
understand the framing of the test conditions, we now frame some test
conditions and see their interpretation as TRUE or FALSE.
Before framing the test conditions using different types of operators, it is
important to have some assigned variables, say, i and f as numeric
variables; and j as a character variable as follows:
#Assigning variables
> i <-
- 7;
; f <-
- 5;
; j <-
- "z"
"

Then some of the possible test conditions which can be framed using these
assigned variables could be as follows: 205
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Test
t Condition
n Interpretation
n Numeric
c Value
e
(i >= 5) && (j == "z") TRUE 1
(i >= 5) && (j == "z") FALSE 0
(i >= 5) || (j == "z") TRUE 1
(f < 15) || (i >2) TRUE 1
(j != "x") && ((i+f )<12) FALSE 0
(i<2) && (j>3) || (j == "z") TRUE 1

On the similar lines using different operators and assigned variables, we can
frame any number of test conditions.

7.2.1 An if Statement
An if statement is generally used when we get the answer to a testing
condition (or question) in terms of yes (TRUE) and no (FALSE). The general
conditional structure involves the if-else statement. But often we may
encounter problems in which the else part is not required. This means a
number of statements (or a single statement) are to be executed if the test
condition is TRUE and the same statements will be skipped if the test condition
is FALSE.
The iff statement begins with the reserved word if f and consists of a test
condition which is an expression, written as TestCond
TestCo ond within the parentheses
‘ ( ) ’, which must results in TRUE
TR E or FALSE
RUE FAL
FA LSE or a numeric value. After
parentheses, the R statements which are to be executed written in braces
‘ { } ’.
Note: We can get help on if statement by writing the
h following
w ng command:
followi com
mmand:
#Seeking help
> ?'if'

The general syntax of writing an if statement without else


l e part is as follows:
#An if statement
if (TestCond) {
Excutable
E t bl statement(s)
t t t( )
}

Alternatively, it may possible that instead of several statements, we just want a


single statement (or an expression) to be evaluated on the occasion when the
TestCond yields TRUE. In that case braces ‘ { } ’ can be skipped and the
general syntax can be written in single line code as follows:
#Single line if statement

if (TestCond) expression

Note that, in any if statement executable statement(s) will be executed if and


only if the TestCond is TRUE or otherwise they will not be executed. Thus, the
executable statement(s) will be skipped if the TestCond is FALSE.
Now we present two illustrations for deeper understanding of an if statement.
In the first illustration, we would like to obtain the square root of a non-zero
number, but the given number can be positive or negative. Also, we know that
206
Control-Flow Constructs of R

we cannot compute the square root of a number if the number is negative.


After computing the square root, we would like to keep that value to add it in
the original value (whose square root is computed). So, in this case an if
statement can be used efficiently. To do so, we first assign the variable x
(whose square root is to be computed) as 10, and we frame the test condition
as x>0. Then we write some executable statements in the braces which will
be evaluated if and only if the test condition is TRUE. Since that value of the
computed square root is to be used to add in x, so we assign the computed
value to y and finally compute their sums using the statement x+y as follows:
#Computing square root of a number
> x <-
- 10
0
> if(x>0){
{
+ y <-
- sqrt(x)
)
+ cat("Square
e root
t of
f x is",
, y,
, "\n")
)
+ }
Square root of x is 3.162278

#Adding the computed square root to x


> print(x+y)
[1] 13.16228

Note that, since the assigned value was positive, therefore the test condition
x>0 yields TR
x> RUE. Due to which the executable statements written in braces
TRUE.
are executed and the square root of x is computed as 3.162278. This
computed value is assigned to a variable y and is printed using the cat()
function. After the execution of the if statement, the sum of x and y are
computed and printed using x+y +y as 13.16228 (without any error message).
x+
Note: The value of the y variable will be known if the square root of x is
computed successfully otherwise it will be unknown and an error message will
be produced.
Next, we assign x with a negative value, say -2 and run the same code again
as follows:
#Computing square root of a number
> x <-
- -2
2
> if(x>0){
{
+ y <-
- sqrt(x)
)
+ cat("Square
e root
t of
f x is
s ",
, y,
, "\n")
)
+ }

#Adding the computed square root to x


> print(x+y)
)
Error in print(x + y) : object 'y' not found

Observe from the output that, as the assigned value of x is negative. Due to
which the executable statements written in braces of an if statement are
skipped and y is unknown. Therefore, the statement print(x+y) gives an
error message.
The same if statement can be written in more concisely (without the braces),
by not assigning the square root to y as follows: 207
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Computing the square root and sum concisely


> if
f (x>0)
) cat("Square
e root
t of
f x is
s ",
, sqrt(x),
, "\n",
, "sum
m =",
,
x+sqrt(x),
, "\n")
)
Square root of x is 3.162278
sum = 13.16228

In the next illustration, we shall show that a non-zero number depicts TRUE
and a zero number depicts FALSE. To do so, we first assign a variable x as
zero. Then overwrite it by incrementing it by 1 using an if statement and
finally print it as follows:
#When the test condition is numeric 0
> x <-
- 0 #assigning x
> if(x)
) x <-
- x+1
1 #incrementing x by 1
> print(x)
) #printing x
[1] 0

Note that, we get a 0 output because the test condition, which is a zero value,
yields FALSE, therefore the statement written after the if statement was
skipped. Next, we assign a non-zero value 3 to x and observe the output as
follows:
#When the test condition is a non-zero value

> x <- 3
> if(x) x <- x+1
> print(x)
[1] 4

Note that, as the test condition, which is a non-zero value, yields TRUE
TR E
therefore, the if statement is executed and the value x is over-written by x+1.
if
Thus, the value of x is incremented by 1 and printed as 4.

7.2.2
7.2
2.2
2 An
An i se Sta
if-else
f-els Statement
atement
In the previous subsection, you must have observed that, an if statement do
nothing
thi ((or just
j t skip
ki executable
t bl statements
t t t written
itt in braces off an if
i b
statement) if the TestCond is FALSE. But we may encounter situations in
which, we may want to execute some statements (as alternative action) if the
TestCond is FALSE, then in such situation we use an if-else statement.
The general syntax of an if-else statement is as follows:
#An if-else statement
if (TestCond) {
True block executable statement(s)
} else {
False block executable statement(s)
}

Note: There should not be a newline in between the ‘ } ’ and else. Or


otherwise, we will get a syntax error. An if-else statement is the most
common version of the if statements, which specifies an alternative action in
208
the case when the TestCond is FALSE.
Control-Flow Constructs of R

Next, we explain the execution of an if-else statement. If in an if-else


statement the TestCond is TRUE, then the executable statements written in
the true block will be evaluated and the executable statements written in the
false block will be skipped. Or otherwise, if the TestCond is FALSE, then the
executable statements written in the false block will be evaluated and the
executable statements written in the true block will be skipped.
For the illustration purpose consider the following example in which a credit
card company has to decide, whether a credit card should be given to a
candidate or not, based on the fulfilment of the following two conditions:
1. Candidate should have job experience of minimum 5 years.
2. Candidate should belong to the medium or high-income group.
If both the conditions are satisfied then only the credit card should be issued to
a candidate otherwise credit card should not be issued. In this case, the
TestCond can be framed as follows:
#Framing a test condition
(experience >= 5) &&
& (group == "Medium") || (group == "High")

Here, the experience variable is representing a candidate’s job experience


and the group
gr
rou p variable is representing a candidate’s income group.
oup
Note: Note that the AND condition is framed using the logical operator ‘ && ’
and the OR condition is framed using the logical operator ‘ || ’.
To test the framed condition, we need some assigned values of the
experience
ex
xperien e ce and groupoup variables. So, we first assign these two variables
g ou
gr
arbitrarily as follows:
#Assigning candidate’s information
> experience <- 6
> g
group
roup <- "Hig
"High"
gh"

After assigning the data, we next use the if-else


if-else s statement together with
the framed test condition to arrive at a decision as follows:
#Using if-else statement to arrive at a decision
> if((
if((experience
i >=
= 5)&&(
5)&&(group ==
= "
"Medium")||(group
di ")||( ==
=
"High")){
{
+ print("Credit
t card
d application
n is
s approved")
) #true block
+ } else
e {
+ print("Credit
t card
d application
n is
s rejected")
) #false block
+ }
[1] "Credit card application is approved" #Decision

The obtained output shows the acceptance of the application of a candidate


who has more than 5 years of experience and belongs to the "High" income
group.
We again run the same if-else statement with another candidate’s data as
follows:
#Assigning candidate’s information
> experience
e <-
- 4
> group
p <-
- "Medium"
" 209
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Using if-else statement to arrive at a decision


> if((experience
e >=
= 5)&&(group
p ==
= "Medium")||(group
p ==
=
"High")){
{
+ print("Credit
t card
d application
n is
s approved")
)
+ }else{
{
+ print("Credit
t card
d application
n is
s rejected")
)
+ }
[1] "Credit card application is rejected" #Decision

The obtained output shows the decision as rejected, as the experience of the
candidate is less than 5 years and the credit card will be given if and only if
both the conditions are satisfied at a time or together.
In the next illustration, we would like to compute the tax, which varies
according to the type of the item. If considered item is an essential item, then
the person has to pay 5% of the price of the item as tax. Or otherwise, if the
item is a luxury item, then the person has to pay 18% of the price of the item
as tax. Now, we frame an if e se statement to compute the tax as follows:
if-else
f-el
#Framing an if-else statement
if(item == "Essential") {
tax = 0.05 * price
} else {
tax = 0.18 * price }

#Or otherwise
if(item == "Luxury") {
tax = 0.18 * price
} else {
tax = 0.05 * price }

The same if
if-else statement can also be written in concisely as follows:
folllows:
#Writing if-else statement more concisely
if(item=="Essential")
) tax
x = 0.05
5 * price
e else
e tax=
= 0.18
8 * price
e

#Or otherwise
if(item=="Luxury")
) tax
x = 0.18
8 * price
e else
e tax=
= 0.05
5 * price
e

Note: You can use any one out of the four written if-else statements to
compute the tax as all will give the same computed tax.
Next, we assign an item’s information arbitrarily and observe the output as
follows:
#Assigning information of a person
> item
m <-
- "Luxury"
"
> price
e <-700000
0
#Computing the tax
> if
f (item
m ==
= "Essential")
) {
+ tax
x = 0.05
5 * price
e
210
Control-Flow Constructs of R

+ } else
e {
+ tax
x = 0.18
8 * price
e }
> print(tax)
)
[1] 126000 #Computed tax

Since the item is a luxury item therefore, the calculated tac is 700000x0.18
=126000, which is same as the obtained result.

7.2.3 Nested if Statements


In R programming it is possible to nest or embed an if-else statement, i.e.,
an if-else statement within the if statements. Accordingly, this nesting
gives rise to different forms of nested if-else statements. The examples of
most general nested if-else statements shown in this subsection.
The nesting of the if statements can be performed in various ways. We shall
now show some commonly used nested if statements.
Syntax
y 1: The true block of an if-else statement is nested with an if-
else statement.
#Nesting a true block of an if-else statement with an if-else
#statement

if(TestCond1){
-
if(TestCond2){
Executable statement(s) 1
} else {
Executable statement(s) 2
}
-
} else {
Executable statement(s) 3
}

Note: Here ‘ - ’ is used for other R statements.


In syntax 1, the TestCond1 will be tested first. If the TestCond1 is TRUE,
then the TestCond2 will be tested. If the TestCond2 is also TRUE then only
executable statement(s) 1 will be evaluated or otherwise executable
statement(s) 2 will be evaluated and the false block of the outer if-else
statement will be skipped. Contrarily, if the TestCond1 is FALSE, then the
executable statement(s) 3 will be evaluated and true block of the outer if-
else statement will be skipped. The execution of this nested if-else
statement can be summarized in the following table:

TestCond1 TestCond2 Executable statement(s)


TRUE TRUE Executable statement(s) 1
TRUE FALSE Executable statement(s) 2
FALSE - Executable statement(s) 3
211
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Next, for the deeper understanding of the nested if-else statement, we


solve the following problem by nesting the true block of an if-else
statement:

x 2 x 4
y x 4 2 x 4 …(A)
2
x x 2

One of the possible ways of solving given problem (A) is by writing a nested
if-else statement. But to run the if-else statement, we need an assigned
value of x in advance, say, x as 1.5. Then the problem can be solved using
the following code:
#Assigning x
> x <-
- 1.5
5
#Writing nested if-else statement
> if(x>2){
{
+ if(x<4)
+ y <- x+4
+ else
+ y <- x-2
+ } else
+ y <- x^2 #braces are skipped due to single statement
#Printing the output
> print(y)
[1] 2.25 #Output

In this example, x is less than 2, therefor


therefore
o e the te
test
est condition
n x>2
x>2 is FALSE
FA
ALSE and
LSE
the false block be evaluated. Accordingly, y is assigned a value
x^2
x^ 2=1.5*1.5=2.25 and the output is printed.
x^2=1.5*1.5=2.25
Syntax 2: The true block of an if-else
if-el
if e statement is nested wi
lse with
w th an
n if
if
statement.
#Nesting a true block of an if
if-else
else statement with an if
#statement
if(TestCond1){
-
if(TestCond2){
Executable statement(s) 1 }
-
} else {
Executable statement(s) 2 }

In syntax 2, the TestCond1 will be tested first. If the TestCond1 is TRUE,


then the TestCond2 will be tested. If the TestCond2 is also TRUE then only
executable statement(s) 1 will be evaluated or otherwise these statements will
be skipped. Contrarily, if the TestCond1 is FALSE, then the executable
statement(s) 2 will be evaluated and true block of outer if-else statement
will be skipped.
212
Control-Flow Constructs of R

Note: On the similar lines the false block of an if-else statement can be
nested with an if statement as well.
Syntax 3: The false block of an if-else statement is nested with an if-
else statement.
#Nesting a false block of an if-else statement with an if-else
#statement
if(TestCond1){
Executable statement(s) 1
} else {
-
if(TestCond2){
Executable statement(s) 2
} else {
Executable statement(s) 3 }
-
}

In syntax 3, the TestCond1 will be tested first. If the TestCond1


Test
TestCo
stCond
Co d1 is TRUE
nd TRUE
then executable statement(s) 1 will be evaluated and the false block will be
skipped. Contrarily, if the TestCond1
Te
est
s Con FALSE, then the TestCond2
nd1 is FALSE,
FA TeestC
Con
nd2 2 will be
tested. If the TestCond2
tCond2 is TRUE
T st
Te RUE then the executable statement(s) 2 will be
TR
evaluated or otherwise executable statement(s) 3 will be evaluated and true
block of outer if-else
if-
-elsese statement will be skipped.
For the illustration purpose we solve (A) again by nesting the else part of
nested if-else
if-el
if e se statement as follows:
#Alternative way to solve (A)
> x <- 1.5
> if(x<=2){
+ y <- x^2
+ } else {
+ if(x<4)
)
+ y <-
- x+2
2
+ else
e
+ y <-
- x-2
2 }
> print(y)
)
[1] 2.25

Hence, the result verifies that we get the same result as earlier.
Syntax 4: Both true and false blocks of an if-else statement is nested with
if-else statements.
#Nesting the true and false blocks of an if-else statement with
#if-else statements

if(TestCond1){
-
213
Functions, Conditional Statements, Loops and Descriptive Statistics with R

if(TestCond2){
Executable statement(s) 1
} else {
Executable statement(s) 2 }
-
} else {
-
if(TestCond3){
Executable statement(s) 3
} else {
Executable statement(s) 4 }
-
}

In syntax 4, the TestCond1 will be tested first. If the TestCond1 is TRUE,


then the TestCond2 will be tested. If the TestCond2 is also TRUE then only
executable statement(s) 1 will be evaluated or otherwise executable
statement(s) 2 will be evaluated and the false block of outer if-else
if-el
else
el se
statement will be skipped. Contrarily, if the TestCond1
Te
esttCond
nd1
nd 1 is FALSE,
LSE, then
FALS
TestCond3
TeststCo
st nd3 will be tested, if the TestCond3
C nd Test
tCon nd3 is TRUE
TRUUE then executable
statement(s) 3 will be evaluated or otherwise executable statement(s) 4 will be
evaluated. The true block of outer if-else
if-el
if s statement will be skipped. The
lse
execution of this nested if-else statement can be summarized in the
following table:

TestCond1 TestCond2 TestCond3


TestCo
C nd3 Executable statement(s)
TRUE TRUE - Executable statement(s) 1
TRUE FALSE - Executable statement(s) 2
FALSE - TRUE Executable statement(s)
statem
ement(s) 3
FALSE - FALSE Executable statement(s) 4

Syntax 5: Multiple if-else statements (or if-else ladder). It is generally


used when the selection is to be done from given multiple conditions.
#Multiple if-else statements

if(TestCond1) { Executable Statement 1


} else if(TestCond2) { Executable Statement 2

} else if(TestCond3) { Executable Statement 3

} else if(TestCond_4) { Executable Statement 4

} else Executable Statement 5

In this layout, we have only 4 if-else, just to make you understand the
execution process. If the TestCondi, where i=1,2,3,4 is TRUE then
executable statement i will be evaluated and other executable statements will
be skipped. If none of the four test conditions is TRUE (or if all the test
conditions are false) then the last executable statement 5 will be evaluated.
214
Control-Flow Constructs of R

Consider the following problem for illustration purpose:


Marks Grade
[90, ) A
[80, 90) B
[70, 80) C
[60, 70) D
[0, 60) F

Now we create an if-else ladder to solve this problem. To do so, we first


assign a variable marks representing the marks of a student as 69.
#Assigning grade to a student
> marks
s <-
- 69
9
> if(marks>=90)
) {cat("Grade
e A","\n")
)
+ } else
e if(marks>=80
0 &&
& marks<90)
) { cat("Grade
e B","\n")
)
+ } else
e if(marks>=70 & marks<80) { cat("Grade
0 && e C","\n")
+ } else if(marks>=60 &&
& marks<70) { cat("Grade
D","\n")
+ } else cat("Grade F","\n")
Grade D

The obtained results verifies that the if


if-else
f-el
e see ladder is working well. You
may run it for different values of marks
s variable.
m rks
ma
Note: You should note that:
(i) It is advisable to use braces, if more than one executable statement is
there, special care is required before placing else.e.
else
(ii) In all the previous syntax the test condition results only one single value,
either, TR
TRUEU or FA
FALSE
FALS E or any numeric value. But not a vector.
LSE
LS

7.2.4
7.2.4
4 The
The ifelse()
ifelse() F
Function
unction
Consider a situation in which the length of the result of a test condition is more
than one, which means, it returns a vector instead of a single value. For those
cases there is a vectorized version of the if-else statement, i.e., ifelse()
function available in the base package. The main arguments of interest of the
ifelse() function are as follows:
#The ifelse() function
ifelse(test, yes, no)

This function returns a vector of the same length as test, where test
representing the test condition. The yes and no are function arguments,
representing the statements or expressions, which are to be evaluated
according to the test condition. If any element of test is TRUE then
expression yes will be evaluated, otherwise expression no will be evaluated.
For the illustration purpose, we consider a vector x with elements 1, 2, 3, 4, 5,
-1, -2 and -3. The problem is to compute square root of the elements of x. The
test condition will be tested, i.e., x>0 for each element of x. If the test
condition is TRUE (the value of the element of x is greater than 0), then x will 215
Functions, Conditional Statements, Loops and Descriptive Statistics with R

be returned or otherwise NA will be returned. Then the square root of the


resultant vector is computed.
#Assigning x
> x <-
- c(1:5,-1:-3)
)

#Computing square root


> sqrt(ifelse(x>0,
, x,
, NA))
)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 NA NA NA

SAQ
Q1
(i) Write a concise if statement to evaluate the following:
if (x == 999) {
z <- 2*x+sqrt(x)+x/2
cat("The computed value of z is ", z, "\n")
}
(ii) Write an if-else statement to compute the value of y, where
4x3 and increase x by
y 1,, if x 4
y 2
4(x 1) and decrease x by 1, if x 4.

7.3 LOOPS
LOOPS
S
In R programming the looping facility is provided by three types of loops,
namely, for,
forr, while and repeat. The for fo and while are entry-controlled
loops and repeat
re eat is an exit controlled loop. Additionally, there are ttwo
epe wo flow
x and break,
control statements, next
next br
reaak, which facilitates additional control over
the evaluation of these loops.
Note: Loop functions belonging to the apply family provides implicit looping
and will be discussed in Unit 8 of MST-015 course. Note that, in R so
ssometimes
metimes
looping may not be necessary, as many arithmetic functions are vectorized.
But looping may be necessary, for those functions which are not vectorized.

7.3.1 The for Loop


We shall first discuss the for loop and its general syntax. The for loop starts
with a line that specifies an assignment name, along with the object (a
vector or a list) we want to step through. The first loop line starts with the
reserved word for and is followed by a block of statements that we want to
repeat. The general syntax of the for loop is as follows:
#The for loop
for(name in object){
Executable statement(s)
}

In this syntax, an object can be either vector or a list. When a for loop runs,
it assigns all the items/elements present in the iterable object to name one-
by-one and executes the body of the for loop for each item.
216
Control-Flow Constructs of R

Note that, generally in the for loop the name variable is used for assignment
is usually a new variable in the scope where the for statement is coded. The
value of the name variable can be changed inside the loop, but it will
automatically be set to the next item in the sequence when the control returns
to the first line of the loop again. It should be noted here that the name
variable remains in existence even after the loop is concluded. The last item of
the object will remain as the value of name variable.
Note: In R programming the for loop is used much less as compared to other
compiled languages. Further, code that take a ‘whole object’ as an argument
is likely to be faster and clearer in R programming language. Apply family
functions are great substitutes for the loops.
Now, we present an illustration in which, we shall write a for loop to print
numbers 1, 2, 3, 4, 5 as follows:
#Writing for loop to print the numbers 1 to 5
> for(i
i in
n 1:5){
{
+ print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

On comparing the written loop with the general format of the for fo loop,
loopp, we get
that i is representing namme and 1:5 is a vector represen
name e tiing ob
representing bject
object.ct. There is
ct
only one executable statement in the block of for r loop w hich is pr
which rint( (i)
print(i).i).
During the execution of the for
or loop the i variable will be assigned each
fo
element of the object 1:5 sequentially, i.e., one-by-one. Then the body of the
loop will be evaluated for each element of the object.
Note that, the variable name i will remain in existence even after the complete
execution of the for loop and its value is the last used value (last element of
object) in the loop. It can be verified by printing i, outside the loop (after the
loops ends) as follows:
#Checking the value of i after completing the for loop
> print(i)
)
[1] 5
Next, we present another illustration in which a string will be printed using a
for loop.
#Concatenating strings together using for loop

> for(x
x in
n c("Hello",
, "R",
, "Programming")){
{
+ cat(x,
, "\t")
)
+ if(x
x ==
= "Programming")
) cat("\n")
)
+ }
Hello R Programming #Output
217
Functions, Conditional Statements, Loops and Descriptive Statistics with R

On comparing this loop with the general syntax, we get x as name and the
character vector c("Hello","R","Programming") as object. So, x will
be assigned the elements/items from the character vector one-by-one in
sequence. Then for each element printing will be performed using cat()
function (to concatenate the output). Also, the elements are separated due to
the tab character ‘ \t ’. Moreover, an if statement is nested in the for loop
to pass the control to the new line ‘ \n ’ at the encounter of the last element of
the object, i.e., "Programming". The step-by-step execution of the for loop
can be understood from the following table:

Changes in output
Step
Execution according to the steps
No.
involved in the loop
Firstly, x is assigned as first element of
1
the character vector, i.e., "Hello".
Then cat() function is executed to print
2 "Hello" and a horizontal tab space is Hello
appended due to the tab character ‘ \t ’.
Next, the if statement is tested and as x
is not same as "Programming"
"Pr
P ogram mming"
therefore, the test condition, i.e., x ==
3
"Programming"
"Progra ammin
i g"" results FALS
FALSE.
LSE.
LS
Accordingly, the ca
cat("\n")
c t(
("\n ) statement is
n")
skipped.
After executing the entire fo or loop for the
for
first item of the character vector. The
4 control is again passed to the first line of
the fo
for
f r loop and x is assigned as second
element of the character vector, i.e., ""R".
R".
Next, the cat
t() function is executed for
cat()
the second time to print "RR" in
"R"
continuation (due to the tab charac
character)
a ter) to
5 Hello R
" He llo" and a horizontal tab space is
"Hello"
ell
again appended at the end of "R" " due to
the tab character ‘ \t ’.
Then if statement is tested for
f r the
fo
second item again and as x is not same
as "Programming", therefore, the test
6
condition,
diti ii.e., x == "P
"Programming"
i "
results FALSE. Accordingly, the
cat("\n") is once more skipped.
After executing the entire for loop for first
two items of the character vector. The
control is again passed to the first line of
7
the for loop and, x is assigned as the
third (last) element of the character
vector, i.e., "Programming".
The cat() function is executed to print
"Programming" in continuation to
8 Hello R Programming
"Hello R " and a horizontal tab space
is appended.
Lastly, the if statement is tested again
and as x is same as "Programming"
therefore, the test condition, i.e., x ==
9 "Programming" results TRUE. Due, to
which the cat("\n") statement is
executed and control will pass to the next
218 line (or new line) and the for loop ends.
Control-Flow Constructs of R

For more clarification, we next create another for loop to compute the sum of
the squares of all the elements of a vector consisting elements 2, 4, 6 and 8.
#Compute 22 + 42 + 62 + 82 using for loop
> sum
m <-
- 0
> for(i
i in
n c(2,4,6,8)){
{
+ sum
m <-
- sum
m + i^2
2
+ }
> cat("2^2+4^2+6^2+8^2
2 =",
, sum,
, "\n")
)
2^2+4^2+6^2+8^2 = 120 #Output

The sum variable with initial value 0 before the for loop is used to save the
sum of the square of the elements of the vector. In addition to this, the step-
by-step execution of the for loop can be understood from the following table:

Changes in output
Step
Execution according to the steps
No.
involved in loop
Firstly, sum is assigned as 0. Then the
1 evaluation of the for loop starts and the i
variable is assigned as 2
Next, the statement sum <- sum ^2 is
sum+i^2
m+i^
evaluated with the initial value of sum as 0 and
sum
2 i as 2. Due to this statement sum is sum
sum
m = 4
overwritten by the value 4 (computed from
0+22).
As the loop consists of only single statement,
the control of the loop is again passed to the
3
first line of the fo
for
f r loop and i is assigned as
4.
After that, the statement su
sum <-- suum+i^2 is
sum+i^2
evaluated again with updated value of sum as
4 4 and i as 4. Due to this statement sum is
sum
s sum
su
um = 20
overwritten by the value 20 (computed from
4+42).
The control of the loop again passed to the first
5 line of the fo
for loop and i is assigned a value
6.
Thereafter, the statement sum <- sum+i^2
is evaluated 3rd time, with the updated value of
6 sum as 20 and i as 6. Due to this statement sum = 56
sum is overwritten by the value 56 (computed
from 20+62).
Next, i is assigned as the last element of the
7
numeric vector, i.e., 8.
Then the statement sum <- sum+i^2 is
evaluated with updated value of sum as 56 and
8 i as 8. Due to this statement sum is sum = 120
overwritten by the value 120 (computed from
56+82) and the for loop ends.
After, the for loop ends, we run the printing
9 statement and final computed value of sum is 2^2+4^2+6^2+8^2 = 120
printed due to cat() function.

219
Functions, Conditional Statements, Loops and Descriptive Statistics with R

7.3.2 The Nested for Loop


Single for loop is required if we want to put control only on one subscript or
index of an object which may be an array or a data frame. But, the nested for
loop is generally required when we want to put control on two subscripts or
indices of an object simultaneously.
Consider the following syntax of nested for loop.
#The nested for loop

for(name1 in object1){
-
for(name2 in object2){
Executable statement(s)
}
-
}

When a nested for loop runs, it assigns the first element present in the
iterable object1 to name1,
m 1, then control will pass to nested for loop. The
name
nested for loop runs and assigns all the elements present in the iterable
object2
obje
ob j ctt2 to namme2 one-by-one and executes the nested f
name2 or loop for each
for
item. Then the control again passes to outer f forr loop, and it assigns the
or
second element present in the iterable ob
object1
obje
j ct1 me1, then control will
1 to nam
name1,
once more passes to nested fo for loop. Thereafter, for the second time the
nested for loop runs and assigns all the elements present in the iterable
object2
objjectt2 to na
ame2 one-by-one and executes the nested fo
name2 or loop for each
for
item again. This process will continue
u until all the elements of the ob
object1
obje
ject11
are assigned to nam
name1.
me1.
For comparison purpose, we now again discuss the fo or loop. Also, recall that
for
n() function and USAr
the mean
mean() USArrests
rrest
stss data frame were discussed in the Unit 6
of MST-015 course. We know that the USUSArrests
USAr
Arrrest
sts
st s data frame
e consists of four
columns, which can be verified using the l h() function as follows:
length()
ength
#Computing the number of columns of the USArrests data frame
> length(USArrests)
)
[1] 4

Next, we illustrate the method of computing the mean of each column of the
USArrests data frame using the for loop by controlling only column
subscript (column indices) of a data frame as follows:
#Computing the column means
> for(i
i in
n 1:length(USArrests)){
{
+ cat(names(USArrests)[i],"=",mean(USArrests[
[ ,i]),"\n")
)
+ }
Murder = 7.788
Assault = 170.76
UrbanPop = 65.54
Rape = 21.232
220
Control-Flow Constructs of R

Now, we explain its execution. In this for loop the i variable will take values
from 1:4 one-by-one and print the computed mean of the ith column of the
data frame. Also, observe that the printing of the computed column means
(with name of the columns in front) is done using the cat() function, which is
already discussed in Unit 4 of MST-015 course.
The computed results can be verified using the colMeans() function
discussed in the Unit 4 of MST-015 course. The colMeans() function gives
us an alternative way to compute a vector of column means of a data frame in
a single command. Note that, the column means can also be computed using
the apply family functions discussed in the Unit 8 of MST-015 course.
#Alternative approach
> colMeans(USArrests)
)
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
Thus from the computed outputs we conclude that using both the approaches,
(for loop and colMeans() function) we get the same result. But, note that
here single for loop was used as the control was put on column subscript
only. But if we want to put control on column and row subscripts together, in
that case we use the nested for loop.
For the illustration purpose we now extract the highlighted entries appearing in
the sixth and ninth rows of the Mu rder and Ra
Murder
M ape columns of the U
Rape USArrests
SAArressts
s
data frame using nested for r loop and compute the following sum of the
extracted elements.
i
xij , where x is representing USArrests data frame.
i (6,9) j (1,4) j

Murder Assault UrbanPop Rape


Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8

Since the highlighted entries are appearing in the 6th and 9th rows of the 1st
and 4th columns therefore the required sum can be easily done using nested
for loops as follows:
#Controlling two subscripts of a data frame using nested for
#loop
> sum
m <-
- 0
> for(i
i in
n c(6,9)){
{
+ for(j
j in
n c(1,4)){
{ 221
Functions, Conditional Statements, Loops and Descriptive Statistics with R

+ sum
m = sum
m + choose(i,j)*USArrests[i,j]
] }
+ }
> cat("sum=",
, sum,
, "\n")
)
sum= 4785.9

7.3.3 The while Loop


The while loop is an entry-controlled loop. It is called an entry-controlled loop
because the test condition mentioned in the first line of the while loop is
tested while entering into this loop. In this loop there is always an iterable
variable involved, that is assigned outside the while loop (i.e., before while
loop starts) and its increment or decrement statement is written inside the
while loop. The general syntax of the while loop is as follows:
#The while loop

Assignment of initial value of the iterable variable


while (TestCond){
Executable Statement(s)
increment/decrement
}

Note that, wh
w
whilee is a reserved word and the executable statements written
ile
inside the body of the while
wh e loop are executed repeatedly until test condition
hile
TRUUE. This process stops when Te
results TRUE TestC
TestCond
Con ALSE.
nd is FA
FALSE
For the illustration purpose, we now write a while loop to print the values
while
assigned to a variable i and then decrement its value by 2 if the body of the
loop is executed as follows:
#Initialization of the iterable variable
> i <- 7

#Printing and decrementing the value of i by 2


> while(i>2){
+ print(i); i <- i-2
+ }
[1] 7
[1] 5
[1] 3

Note: Here, the body of the while loop will be executed 3 times.
The step-by-step execution of the illustrated while loop can be understood
from the following table:
Changes in output
Step
Execution according to the
No.
steps involved in loop
1 Firstly, the variable i is assigned as 7.
Recall that the while loop is an entry- controlled
2 loop. After the assignment of i, the first line of the
while loop is executed and the test condition
i>2 is tested at the entry. As the assigned value
222
Control-Flow Constructs of R

of i is greater than 2, therefore the test condition


is satisfied and results TRUE.
Then the statements written inside the body of the
while loop are executed in sequence one-by-
one. So, firstly 7 is printed. Then due the
3 7
i <- i-2 statement, i is decremented by 2, and
becomes 5. The same statement overwrites i
with the decremented value, so i becomes 5.
For the updated value of i as 5, the entry test
condition is tested for the second time. As the
4
updated value of i is greater than 2, therefore the
test condition is satisfied and results TRUE.
Next, the print statement is executed again and 5
5 is printed thereafter, i is decremented by 2 again 5
and its now become 3.
For i = 3 the entry test condition is tested for the
third time. As the updated value of i is 3, which is
6
greater than 2, therefore the test condition is
satisfied and results TRUE.
The print statement is executed and 3 is printed.
7 3
Further, i is decremented by 2, and becomes 1.
For i as 1 the entry test condition is tested for the
fourth time. As the updated value of i is less than
8
1, therefore the test condition results FA
F SE and
FALSE
LS
while
whhil
ilee loop will be exited.

7.3.4
7.3
3.4
4 The
Th
he repeat L
Loop
oop
The repeat
repe t loop is the third type of loop available in R. It causes the
peat
pe
repeated evaluation of the body of the repeat
p at loop until a break
repe break is explicitly
requested. This loop should always be carefully handled as there are quite
high chances of creation of an infinite loop due to the repeated execution of
the loop body. Generally, the body of the repeat
repeat loop consists of more than
two executable statements. The general syntax of the repeat
reepe at loop is as
peat
follows:
#The repeat loop

initialization of the iterable variable


repeat {
Executable Statement(s)
increment/decrement
break condition
}

In this general syntax repeat is a reserved word. If the break condition is


placed at the beginning of the repeat loop, then it works similar to the while
loop. Generally, this loop is used to perform an exit at the middle or end of the
loop. The break condition is generally given using an if statement. For the
illustration purpose consider the following repeat loop in which firstly the
iterable variable x is assigned as 2 and it is decremented inside the repeat
loop. A break will encounter when x becomes less than 1.
#Initialization of the variable
> x <-
- 10
0
223
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Decreasing the value of x using repeat loop


> repeat{
{
+ print(x)
)
+ x <-
- x/2
2
+ if
f (x<1)
) break
k
+ }
[1] 10
[1] 5
[1] 2.5
[1] 1.25

Note that, the control-flow constructs break and next will be discussed in the
next section of this unit. For now, you can understand that in the repeat loop
the break statement is used to get exit from the repeat loop.

Changes in output
Step
Execution accordingg to the steps
p
No.
involved in loop
1 Firstly, x is assigned as 10.
Then the first line of the repe
peat loop is
repeat
pe
executed. As there is no loop entering testing
condition, the executable statements written
2 inside the body of the repeat
re eat loop are
epe 10
executed for the first time. Since the first
statement is a print statement, therefore, the
value of x is printed as 10.
Then due to the decrement statement
3 x <
<-
- x/
x 2, x is overwritten by x/
x/2, /2 and
x/2
becomes 5.
Next, the test condition of the if
f statement, i.e.,
1 is tested for the updated value of x as 5
x<1
4 and it results FA
ALSE. Thus, br
FALSE. b ak statement is
break
ea
not executed and control will again pass to the
peat loop.
repeat
repe
pe
The body of the repeat loop is executed for
the second time and the value of x is printed as
5 5
5 and x is overwritten by x/2 and becomes
2.5.
Next, the test condition of the if statement is
tested for the updated value of x as 2.5 and
6 results FALSE. So, break statement is not
executed this time as well and control will again
pass to the repeat loop.
Thereafter, the body of the repeat loop is
executed for the third time and the value of x is
7 2.5
printed as 5. Also, x is overwritten by x/2 and
becomes 1.25.
The test condition of the if statement is tested
for the latest value of x as 1.25 and results
8 FALSE. So, the break statement is again not
executed this time as well and control will again
pass to the repeat loop.
Finally, the body of the repeat loop is
executed for the fourth time and the value of x
9 1.25
is printed as 1.25. Also, x is overwritten by x/2
224 and becomes 0.625.
Control-Flow Constructs of R

Lastly, the test condition of the if statement is


tested for x as 0.625 and it results TRUE as
10
0.625 is less than 1. So, break is executed and
repeat loop is exited.

7.3.5 Nested Control-Flow Constructs


The control-flow constructs, such as for, while, repeat and if statements
can be nested within each other. The following are the points which should be
kept in mind while writing nested structures and loops:
1. The inner and outer loops not necessarily be generated by the same
type of control flow constructs.
2. It is essential that one loop or if statement should be completely
embedded within the other, i.e., there should not be any overlap.
3. Each loop should be controlled by a different index or variable.
The nested structures may be as complex as necessary. A loop can be nested
within a loop or an if-else statement and vice-versa.
Consider the following example in which an if-statement
f-statement is nested inside the
if
nested for loops to perform conditional printing, which happens if i is equals
to 2 and the value of j is less than 4 on the same time as follows:
#An if statement inside the nested for loops
> for (i in 1:3){
+ for(j in 1:5){
+ if(i==2 &&
& j<4) cat("(i,j)=","(",i,",",j,")","\n")
+ }
+ }
(i,j)= ( 2 , 1 )
(i,j)= ( 2 , 2 )
(i,j)= ( 2 , 3 )

Note that, in the given loop the i variable can take vavalues
alues 1, 2 and 3 and the j
variable can take values 1, 2, 3, 4, 5. The illustrated loop can be understood
from the following table:

Changes in output
Step
Execution according to the
No. steps involved in loop
The first line of the outer for loop is executed first
1
and i is assigned as 1.
Then control is pass to nested for loop and j is
assigned the values from 1 to 5, one-by-one. For
each pair of values (1,1), (1,2), (1,3), (1,4) and
(1,5) the test condition of the if statement is
tested and results as follows:
2 i j TRUE/FALSE
1 1 FALSE
1 2 FALSE
1 3 FALSE
1 4 FALSE
1 5 FALSE 225
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Since, the test condition (i==2 && j<4) is not


satisfied for any pair of values, therefore, the
3 printing is not performed for these pair of values
and printing will be skipped. Also, the control will
again pass to the outer for loop again.
4 Next, i is assigned as 2.
Then again, the nested for loop is executed and
j is assigned the values from 1 to 5, one-by-one
for the second time. For each pair of values (2,1),
(2,2), (2,3), (2,4) and (2,5), the test condition of
the if statement is tested and results as follows:
i j TRUE/FALSE
5
2 1 TRUE
2 2 TRUE
2 3 TRUE
2 4 FALSE
2 5 FALSE
Since the test condition (i
(i==2 && j<4) results (i,j)= ( 2 , 1 )
TRUE for first three pair of values and FA L E for
FALS
FALSE
(i,j)= ( 2 , 2 )
6 last two pairs. Therefore, the printing is performed
only for the pairs of values for which the test (i,j)= ( 2 , 3 )
condition is TR
TRUE.
RUE.

7 Next, i is assigned a value as 3.

The control is passed to the nested fo forr loop and j


is again assigned the values for 1 to 5. Thereaftere,
Thereafter,
the test condition of the iff stat
statement
tement is tested for
each pair of values (3,1), (3,2), (3,3), (3,4) and
(3,5) for the third time and gives the following
results:

8 i j TRUE/FALSE
3 1 FALSE
3 2 FALSE
3 3 FALSE
3 4 FALSE
3 5 FALSE

Since the test condition (i==2 && j<4) results


FALSE for all pair of values therefore, printing is
9 not performed for these pair of values and the
loops are exited (as the loop is executed for all
values of i).

SAQ
Q2
Write a loop to compute the product of the squares of the following terms:
x 2 , x=2, 4, 6 and 8. Also, write the step-by-step execution of the loop.
x

226
Control-Flow Constructs of R

7.4 THE next AND break STATEMENTS


In a repeat loop the importance of the break statement has already been
discussed and illustrated. Note that, the break statement is only statement
using which a repeat loop can be exited otherwise the loop will become
infinite. Additionally, the break and next both are the reserved words. The
break statement can also be used to terminate any loop, possibly abnormally.
Generally, the break and next statements are mainly used to exit the
innermost loop of the nested loops.
The next statement in R programming is generally used in looping and it is
mainly used to discontinue one particular cycle of a loop and skip to the next
iteration of the loop.
Let us consider an example in which next statement is used inside the for
loop to replace entries of first, third and fourth columns of data consisting first
six rows of USArrests data frame. The updated entries are computed by
dividing each entry by their respective column means. Further, we particularly
skip the computations for second column of the data frame data using the
next statement for the illustration purpose:
> data <- head(USArrests)
> for( i in 1:length(data)){
+ if(i==2) next else data[ ,i] <- data[,i]/mean(data[ ,i])
+ }
> print(round(data,4))
Murder Assault UrbanPop Rape
Alabama 1.3895 236 0.8593 0.6506
Alaska 1.0526 263 0.7111 1.3657
Arizona 0.8526 294 1.1852 0.9514
Arkansas 0.9263 190 0.7407 0.5985
California 0.9474 276 1.3481 1.2460
Colorado 0.8316 204 1.1556 1.1877

SSAQ
SA
AQ 3
State whether the following statements are TRUE or FA
F
FALSE:
LSE:
(i) for and if are the simple words in R.
(ii) break is an entry-controlled loop.
(iii) Nesting can be done in both the TRUE and FALSE blocks of the if-
statement.

7.5 SUMMARY
The main points discussed in this unit are as follows:
Firstly, in this unit we have discussed various types of if statements,
such as an if statement without else part, an if-else statement,
nested if statements, multiple if statement. Focus is given on the
clarity of the concept that ‘how an if statement is selected on the basis
of given problem’.
Different types of loops such as for, while, repeat and their nesting
are discussed in this unit. The difference between them and their usage
are discussed using the general syntax and suitable examples.
Lastly, using break and next statements, way of imposing additional
control on control structures is explained. 227
Functions, Conditional Statements, Loops and Descriptive Statistics with R

7.6 TERMINAL QUESTIONS


1. Create a test condition which can be used to checks whether the value
of the given statement is equals to 111 or not.
x
x xy e2x
y
2. Write the following if-else statement concisely:
if (x <= 4){
y <- 4 * x^3
} else {
y <- 4 * (x-1)^2
}
3. Write the output of the following code:
Gender <- "Female"; pay <- 10000
if (Gender == "Male") {
tax = 0.30 * pay
} else {
tax = 0.10 * pay }
print(tax)
4. Write the name of the different types of if statements available in R.
Also, write the general syntax of each one of them and explain how they
are executed.
5. Write R code to convert the following while loop to for loop.
x <- 0
while(x<9){
print(x); x <- x+3}
6. Write the name of the different types of loops available in R. Also, write
the general syntax of each one of them and explain how theyt ey are
th
executed.
7. Write the output of the following code:
for(x in list(c(1
list(c(1,2,3), c(4,5,6,7,8))){
2 3) c(4 5 6 7 8))){
print(length(x))}
8. Write step-by-step execution of the following while loop.
x <- 0
while(x<9){
print(x); x <- x+3
}
9. Write step-by-step execution of the following repeat loop.
data <- head(USArrests)
i=1
repeat{
data[ ,i] <- data[,i]/mean(data[,i])
if(i==4) break
i <- i+1
}

228 10. Find the odd one:


Control-Flow Constructs of R

if, else, repeat, object, while, function, for, in, next,


break, TRUE, FALSE, NULL, Inf, NaN and NA.

7.7 Solutions/Answers
Self-Assessment Questions (SAQs)
1. (i) Then given code can be written concisely as follows:
if (x == 999) cat("The computed value of the z is ",
2*x+sqrt(x)+x/2, "\n")
(ii) The value of y for a given value of x using the given conditions can
be computed using the if-else statement. We first need to assign the
value of x, then y can be computed using the following:
if (x <=4){
y <- 4 * x^3
x <- x+1
} else {
<- 4 * (x
y < (x-1)^2
1) 2
x <- x-1
}
The computed result can be printed using the cat() function in the
following manner:
cat("x =", x, "\t", "y=", y, "\n")
2. The product of the squares of the vector elements 2, 4, 6 and 8 can be
computed using any one of the three loops. So, we consider for loop to
evaluate it. The product using the for loop can be computed using the
following code:
prod <- 1
for(k in c(2,4,6,8)){
prod <- prod * k^2
}
cat("Product =", prod, "\n")
The step wise execution can be understood from the following table:
Changes in the output
Step
Execution according to the steps
No.
involved in loop
Firstly, prod is assigned a value 1 and k is
1
assigned a value 2
Then, the statement prod <- prod *
k^2 is executed or evaluated with the initial
2 value of prod as 1 and k as 2. Due to this prod = 4
statement prod is overwritten by the value
4 (computed from 1 * 22).
As the loop consists of a single statement
only, the control of the loop is again passed
3
to the for loop and k is assigned a value
4.
Then, the statement prod <- prod *
k^2 is executed with the initial value of
4 prod as 4 and k as 4. Due to this prod = 64
statement prod is overwritten by the value
64 (computed from 4 * 42).
229
Functions, Conditional Statements, Loops and Descriptive Statistics with R

The control of the loop again passed to the


5 first line of the for loop and k is assigned a
value 6.
Thereafter, the statement
prod <- prod * k^2 is executed with
6 updated value of prod as 64 and k as prod = 2304
6. Due to this statement prod is overwritten
by the value 2304 (computed from 64*62).
7 Lastly, k is assigned a value 8.
The statement prod <- prod * k^2 is
executed for the last time with updated
8 value of prod as 2304 and k as 8 and prod = 147456
prod is overwritten by the value 147456
(computed from 2304*82).
Loop ends and final output is printed due to
9 Product = 147456
cat("Product =", prod, "\n")

3. (i) FALSE (ii) FALSE (iii) TRUE

Terminal Questions (TQs)


1. The required test condition can be framed as follows:
sqrt(x)+x/y+x*y-exp(2*x) == 111
2. The given if-else statement can be written concisely as follows:
if (x <= 4) y <- 4 * x^3 else y <- 4 * (x-1)^2
3. The given if statement test condition Gender=="Female" is FALSE
for the given values of Gender and pay. Therefore, the statements
written in the false block will be evaluated and the output will be printed
as 1000.
4. Different types of if-statements are as follows:
if statement, if-else statement, Nested if statement, if-else
ladder (or multiple if statement). The details on each one off them is
given in this unit, refer Section 7.2.
5. The converted loop is as follows:
for(x in seq(0,6,3)){
print(x) }
6. Different types of loops are, for loop, while loop and repeat loop.
The general syntax and details on the execution of each one of them is
given in the material.
7. In the given for loop the object is a list, so each time a list component will
be assigned to x one-by-one sequentially. Since the first component has
three elements and the second component as 5 elements therefore, the
output will be as follows:
3
5
8. Refer Section 7.3.3 and write the step-by-step execution of the given
while loop accordingly.
9. Refer Section 7.3.4 and write the step-by-step execution of the given
repeat loop accordingly.
230 10. Object is not a reserved word and remaining are reserved words.
UNIT 8

APPLY
Y FAMILY
Y IN
NR

Structure
8.1 Introduction 8.6 The mapply() Function
Excepted Learning Outcomes 8.7 Summary
8.2 The lapply() Function 8.8 Terminal Questions
8.3 The sapply() Function 8.9 Solutions/Answers
8.4 The apply() Function
8.5 The tapply() Function

8.1
8.
.1 INTRODUCTION
INT
TRODUCTIO
ON
The apply family functions in R are very powerful because they allow us to
conduct a series of operations on data using a condensed
conden nsed form. Loop
functions sometimes can be used as an alternative to the e control-flow
constructs, for, while e and repeat,
re
epeat t, discussed in the Unit 7 of MST-015
(Introduction to R Software). A major advantage of these functions is that we
don’t have to write multiple R statements to do a particular task. These
functions are generally one-line code or statement of R. These loop functions
save the time of the user and shows an efficient way to code.
The apply family functions comes as a part of base package. Note that
whenever a predefined function is supplied as an argument to the FUN
argument of the apply family functions only the name of the function is used. If
the function has dependencies on other arguments, then they are supplied as
function arguments to the apply family functions.
In this unit we shall discuss the lapply(), sapply(), apply(), tapply()
and mapply() functions of apply family.
Expected Learning Outcomes
After studying this unit, you should be able to:
learn and use lapply() function;
learn and use sapply() function;
learn and use apply() function;
learn and use tapply() function; and
231
learn and use mapply() function.
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R

8.2 The lapply() Function


The lapply() loop function comes as a part of the base package. This
function is mainly used on a list object. If the object (on which this function is to
be used) is not a list, then firstly it will be coerced to a list by the function itself
and then only it is used. Also, note that the output obtained from the
lapply() function will be in the form of a list.
The main two arguments of interest of lapply() function are, X and FUN.
The X argument is used to assign a list and the FUN argument is used to
assign a function which is to be applied on each component of the list X.
#The lapply() function
lapply (X, #vector or a list object
FUN, #function to be applied on each element of X
...) #Other arguments of FUN and lapply functions

Note that the lapply()


pp y function loops p over a list, byy applying
pp y g a function
assigned by us to the FUUN argument (which may be a package function or a
FUN
user-defined function) to each component of a list and finally returns a list.
For the illustration purpose, we now create a list object named Lst
t1, with two
Lst1,
components. Its first component x is a vector object and the second
component y is a matrix object. It is defined as follows:
#Creating a list
> Lst1 <- list(x=1:25, y=matrix(c(10,-2,5,1), nrow=2, ncol=2));
Lst1
$x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 16 17
[18] 18 19 20 21 22 23 24 25

$y
[,1] [,2]
[1,] 10 5
[2,]
[2 ] -2 1

After creating a list, we next use the min() function to compute the minimum
of each component of Lst1. To do so, we assign the X argument of the
lapply() function as the list Lst1 and the FUN argument of the function as
the min() function in the following manner:
#Computing minimum of each component of a list
> lapply(X=Lst1,
, FUN=min)
)
$x
[1] 1

$y
[1] -2

Note: The min() function is assigned to the FUN argument without


parenthesis. So, only the name of the function is used to assign a function to
the FUN argument.
232
Apply Family In R

Observe that, when the min() function is applied on the first component of
the list, i.e., x, it returns the minimum of the vector elements, which is 1 and
when it is applied on second component of the list, i.e., y, it returns the
minimum of the elements of the matrix y, which is -2.
Next, we assign other functions such as sum() and as.logical() to the
FUN argument and observe the obtained outputs as follows:
#Computing sum of the elements of each component of a list
> lapply(X=Lst1,
, FUN=sum)
)
$x
[1] 325

$y
[1] 14

You can easily verify that the sum of the elements of the vector x is 325 and
the sum of the elements of the matrix y is 14.
#Coercing each component of a list to logical object
> lapply(X=Lst1, FUN=as.logical)
$x
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[23] TRUE TRUE TRUE

$y
[1] TRUE TRUE TRUE TRUE

Note that, when a coercion function as.logical()


as
s.log
o ical( () iss assigned
asssigned to the FUN
FUN
argument of the lapply()
lapp
pply
pp y() function, then each compone
componentent of the list Lst1
st1 is
Ls
coerced to a logical object (as they consist of TRUE and FALSE only),
generally vector.
We next present another illustration, in which we assign a user-defined
FU argument of the lapply()
function to the FUN lapplly() function. To do so, let us first
create a list named Ls
Lst2 with two data frame components as follows:
#Creating a list with two data frame components
> Lst2
2 <-
- list(data.frame(x1=c(1,-1,4),
, x2=c(2,0,5)),
,
data.frame(y1=c(-2,1,3),
, y2=c(4,9,6)));Lst2
2
[[1]]
x1 x2
1 1 2
2 -1 0
3 4 5

[[2]]
y1 y2
1 -2 4
2 1 9
3 3 6
233
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Next, we create a function named Ext to extract 2nd and 3rd rows of a data
frame as follows:
#Creating a function to extract 2nd and 3rd rows of a data frame
> Ext
t <-
- function(x)
) x[c(2,3),]
] #Function definition

Note: Refer Unit 6 of MST-015 to see detailed discussion on functions.


After creating the function Ext, we assign it to the FUN argument of the
lapply() function, to extract the 2nd and 3rd rows of each data frame
component of Lst2 as follows:
#Extracting 2nd and 3rd rows of the components of Lst2
> lapply(X=Lst2,
, FUN=EXT)
)
[[1]]
x1 x2
2 -1 0
3 4 5

[[2]]
y1 y2
2 1 9
3 3 6

An alternative way of doing the above task without separately defining a user
defined function is as follows:
#Using user-defined function anonymously
lapply(X=Lst2, FUN=function(x) x[c(2,3),])

Note: By anonymously we mean that, we do not name the function.


Additionally, we will get the same output if we define the function FUN in the
lapply()
lapp
lapply
pp ) function command. Furthermore, if the function is required in
ly()
ly
another section of the programme then it must be defined outsid
outside
de the
lapply
pp ly() function by giving some
lapply()
lapp ly e suitable name to it.
The components of the obtained list object (output) can be accessed on the
MST-015
same lines as discussed in the Unit 3 of MST 015 course.
course For the illustration
purpose, we now extract the first component of the output as follows:
#Extracting the first component of the output
> lapply(X=Lst2,
, FUN=function(x)
) x[c(2,3),])[[1]]
]
x1 x2
2 -1 0
3 4 5

Similarly, the other components or elements can be extracted. For more


details refer Unit 3 of MST-015.
In the next illustration, we assign a vector to the X argument and a function to
the FUN argument of the lapply() function. The assigned function computes
the square root of a number, if it is greater than equal to zero or otherwise
returns a statement ‘Square root cannot be computed’ for negative numbers
as follows:
234
Apply Family In R

#Computing the square root of the vector elements


> lapply(X=-1:2,
, FUN=function(x)
) if(x>=0)
) sqrt(x)
) else
e "Square
e
root
t cannot
t be
e computed")
)
[[1]]
[1] "Square root cannot be computed"

[[2]]
[1] 0

[[3]]
[1] 1

[[4]]
[1] 1.414214

For the next illustration, we consider the built-in painters data available in
the MASS package. Let us first take help on the data as follows:
#Loading the package
> library(MASS)
#Seeking help
> ?painters
starting httpd help server ... done

The following R Documentation page will pop up when we seek help on the
pa
ainte ers data.
painters

Note: The information on the painters data and about its columns can be
read from the R Documentation page.
Next, we shall display some of the rows of the painters data frame but
before that we now see the internal structure of the data frame as follows:
#Internal structure of the painters data frame
> str(painters)
)
'data.frame': 54 obs. of 5 variables:
$ Composition: int 10 15 8 12 0 15 8 15 4 17 ...
$ Drawing : int 8 16 13 16 15 16 17 16 12 18 ...
$ Colour : int 16 4 16 9 8 4 4 7 10 12 ...
235
Functions, Conditional Statements, Loops and Descriptive Statistics with R

$ Expression : int 3 14 7 8 0 14 8 6 4 18 ...


$ School : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1
1 1 1 1 1 1 ...

From the obtained output it is clear that its first four variables, namely,
Composition, Drawing, Colour and Expression are of integer type and
the last variable School is of factor type. Next, we display first few rows of the
data frame in the following screenshot:

Now we illustrate the method of computing the 25%, 50% and 75% quantiles
of the first four columns of the painters
p inte
pa te rs data, using the quantile()
ters q anti
qu ile
le()
()
function (the qua antile() function is used to compute the quantiles of a
quantile()
data).
To compute the quantiles of the first four columns of the pa t rs data
painters
ain
nte
frame, we assign the X argument of the function as pain
painters[,1:4],
nte
ters[,
[ 1: 4], the
1:4]
4]
FUN
F
FU N argument as quantit le() function. Note that quantile() function has
quantile()
a pr bs argument, which is used to assign the probabilities. So to compute
probs
prob
the quantiles, we supply this argume
argument
ent as an additional argument to the
lapply()
laapply ) function as follows:
y()
#Computing the quantiles of the first four columns of painters
#data frame using lapply() function
> lapply(X=painters[,1:4], FUN=quantile, probs=c(0.25, 0.50,
0.75))
$Composition
25% 50% 75%
8.25 12.50 15.00

$Drawing
25% 50% 75%
10.0 13.5 15.0

$Colour
25% 50% 75%
7.25 10.00 16.00

$Expression
25% 50% 75%
4.0 6.0 11.5

From the obtained output, note that the quantile of each column of the
236 painters data frame is computed in a single line command.
Apply Family In R

Using lapply() function together with the split() function:


The split() function is already discussed in the Unit 4 of MST-015 course.
Recall that this function is used to split a vector or data frame according to the
groups defined by its second argument, which is a factor variable. For the
illustration purpose we consider the built-in painters data frame again.
Next, we use the split() function to split the painters data frame
according to the School variable and assign the grouped data to
split.data as follows:
#Splitting the painters data according to School variable
> split.data
a <-
- split(painters,
, painters$School);split.data
a
$A
Composition Drawing Colour Expression School
Da Udine 10 8 16 3 A
Da Vinci 15 16 4 14 A
Del Piombo 8 13 16 7 A
...

$B
Composition Drawing Colour Expression School
F. Zucarro 10 13 8 8 B
Fr. Salviata 13 15 8 8 B
Parmigiano 10 15 6 6 B
...

$C
Composition Drawing Colour Expression School
Barocci 14 15 6 10 C
Cortona 16 14 12 6 C
Josepin 10 10 6 2 C
...

$D
Composition Drawing Colour Expression School
Bassano 6 8 17 0 D
Bellini 4 6 14 0 D
Giorgione 8 9 18 4 D
...

$E
Composition Drawing Colour Expression School
Albani 14 14 10 6 E
Caravaggio 6 6 16 0 E
Corregio 13 13 15 12 E
...

$F
Composition Drawing Colour Expression School
Durer 8 10 10 8 F
Holbein 9 10 16 13 F
Pourbus 4 15 6 6 F
...
237
Functions, Conditional Statements, Loops and Descriptive Statistics with R

$G
Composition Drawing Colour Expression School
Diepenbeck 11 10 14 6 G
J. Jordaens 10 8 16 6 G
Otho Venius 13 14 10 10 G
...

$H
Composition Drawing Colour Expression School
Bourdon 10 8 8 4 H
Le Brun 16 16 8 16 H
Le Suer 15 15 4 15 H
...

Next, we create a function, which will be applied to each group (obtained by


splitting the painter data frame). To do so, we define a function named
Fun1 to compute a column mean of a data frame as follows:
#Defining a function to compute the column means of first 4
#columns
> Fun1 <- function(x) colMeans(x[,1:4])

After defining the function Fun1,


Fu 1, now we assign it to the FUN
un1 UN argument of the
FU
lapply()
lappply
ly()) function and assign its X argument as sp
split.data
plitt.d ta to compute
dat
the column means according to the factor variable S ool as follows:
School
cho
#Computing column means school wise
> lapply(X=split.data, FUN=Fun1)
$A
Composition Drawing Colour Expression
10.4 14.7 9.0 8.2
$B
Composition Drawing Colour Expression
12.166667 14.333333 7.333333 8.166667
$C
Composition Drawing Colour Expression
13.166667 13.500000 7.500000 7.166667
$D
Composition Drawing Colour Expression
9.1 9.9 16.1 3.2
$E
Composition Drawing Colour Expression
13.571429 12.857143 11.857143 8.142857
$F
Composition Drawing Colour Expression
7.25 10.25 9.50 7.75
$G
Composition Drawing Colour Expression
238 13.85714 10.42857 14.85714 10.00000
Apply Family In R

$H
Composition Drawing Colour Expression
14.0 14.0 6.5 12.5

SAQ
Q1
Write R code to create a list with two matrix components A and B, where
2 0 0 5 4 3
A 0 3 0 and B 1 2 2
0 0 5 8 6 5
Also, compute the transpose of both the matrices in a single line command.

8.3 The sapply() Function


The sapply() function is a user-defined version of the lapply() function
only. This works similar to lapply() function. The only difference is that it
simplifies the obtained output. Here, by simplifies we mean that instead of a
list output, the default output will be either a vector or a matrix object. The
main arguments of interest of the sapply() function are as follows:
#The sapply() function
sapply (X, #vector or a list object
FUN, #function to be applied on each element of X
simplify #It indicates whether to simplify the result
#or not
...) #Other arguments of FUN and sapply functions

Note that by default, the simplify


p ify argument is TRUE,
simpli
mp TRUE
TR E, which means that we
will get the output in simplified form. If we explicitly mention
simp
si m li
lifyfy=F
fy FALSE
simplify=FALSE, SEE, then the obtained output will be same as lapply()
lapp
lap ly y()
(
function (a list object), otherwise this function simplifies the output in more
easily executable objects like vector or matrix.
Note: The simplify fy argument of the function is used to indicate whether the
output
t t iis tto b
be simplify
i lif tto a vector,
t matrix
t i or hi
higher
h di dimensional
i l array (if
possible). If the output cannot be simplified then a list is returned.
To understand the execution of this function, consider the previously quoted
illustrations again. Firstly, for the painters data frame we again compute the
quantiles of the first four columns of the data frame using the sapply()
function and keep the simplify argument as TRUE to get simplified output as
follows:
#Computing the quantiles of the first 4 columns of the painters
#data using sapply() function
> sapply(X=painters[,1:4],
, FUN=quantile,
, probs=c(0.25,
, 0.50,
,
0.75),
, simplify=TRUE)
)
Composition Drawing Colour Expression
25% 8.25 10.0 7.25 4.0
50% 12.50 13.5 10.00 6.0
75% 15.00 15.0 16.00 11.5
239
Functions, Conditional Statements, Loops and Descriptive Statistics with R

You can compare the obtained output with previously obtained output (in
Section 8.2) and observe, why we use the term simplified output.
Also, it can be verified that the assignment simplify=FALSE will give the
same output as obtained by using the lapply() function in Section 8.2.
#An alternative to the lapply() function
> sapply(X=painters[,1:4],
, FUN=quantile,
, probs=c(0.25,
, 0.50,
,
0.75),
, simplify=FALSE)
)

In the next illustration we shall use the sapply() function (to get simplified
output) together with the split() function to compute the column means of
the first four columns of the painters data frame according to the grouped
defined by the factor variable School as follows:
#Loading the MASS package
> require(MASS)
)
#Splitting the painters data frame according to School variable
> split.data <
<- split(painters, painters$School)
#Creating a function to compute column means
> Fun1 <- function(x) colMeans(x[,1:4])

After splitting the data and creating a function named Fun1,


Fu 1, we finally use the
un1
sapply()
sappply
ly() ) function, to get the column means according to the groups defined
by the variable School of the data as follows:
#Computing the column means using sapply() function
> sapply(X=split.data, FUN=Fun1, simplify=TRUE)
A B C D E F
Composition 10.4 12.166667 13.166
13.166667
6 667 9.1 13.571429 7.25
Drawing 14.7 14.333333 13.500000 9.9 12.857143 10.25
Colour 9.0 7.333333 7.500000 16.1 11.857143 9
9.
9.50
50
Expression 8.2 8.166667 7.16
7.166667
66667 3.2 8.142857
5 7.75
G H
Composition 13.85714 14.0
Drawing 10.42857 14.0
Colour 14.85714 6.5
Expression 10.00000 12.5

Compare the obtained output with the output obtained in previous section and
observe the difference.
Next, we use the testing function is.matrix() (discussed in the Unit 3 of
MST-015 course), to verify whether the obtained output is a matrix object or
not as follows:
#Testing for matrix object
> is.matrix(sapply(X=split.data,
, FUN=Fun1,
, simplify=TRUE))
)
[1] TRUE

Since the obtained output is a matrix object, so we can use matrix function on
it. Let us compute the transpose of the obtained matrix using the t() function
discussed in the Unit 2 of MST-015 course as follows:
240
Apply Family In R

#Computing the transpose of the obtained output


> t(sapply(X=split.data,
, FUN=Fun1,
, simplify=TRUE))
)
Composition Drawing Colour Expression
A 10.40000 14.70000 9.000000 8.200000
B 12.16667 14.33333 7.333333 8.166667
C 13.16667 13.50000 7.500000 7.166667
D 9.10000 9.90000 16.100000 3.200000
E 13.57143 12.85714 11.857143 8.142857
F 7.25000 10.25000 9.500000 7.750000
G 13.85714 10.42857 14.857143 10.000000
H 14.00000 14.00000 6.500000 12.500000

SAQ
Q2
Consider the admission data given in the Unit 4 of MST-015 course and write
code to compute the average percentage score of the students according to
the Gender variable of the data.

8.4 T
The
he apply()
apply() Fun
Function
nction
The apply()
apppl
plyy() function is mainly used on an array or matrix (also accepts data
frame) objects. This output of the apply()
apply() function is either a vector or array
or list object. This function is mainly used to apply a function assigned to the
UN argument on the margins (specified by MARGIN argument) of the arrays
FUN
FU
or matrix objects, X. The mainly used arguments of the apply()
ap
ppl
ply ( function are
y()
as follows:
#The apply() function
apply (X, #matrix or array or data frame
MARGIN, #an integer vector specifying the margins
FUN, #function to be applied on each element of X
simplify, #whether to simplify the result or not
...)
) #
#Other
h arguments of
f FUN and
d apply
l ffunctions
i

Note: The MARGIN argument of the apply() function can take values 1, 2 or
c(1,2). The value 1 is used to indicate rows, value 2 is used to indicate
columns and c(1,2) indicates both rows and columns. The margin c(1,2)
is mainly used if the X argument is an array.
Next, we illustrate its execution. To do so, we take help of a matrix with
following elements.
2 4 1
0 3 2
8 1 2

Next, we use the apply() function to compute the minimum of each row and
maximum of each column of the matrix. To do so, we assign the X argument of
the function as the matrix, the MARGIN argument as 1 (for rows) and 2 (for
columns). Also, we assign the FUN argument as min() (for minimum) and
max() (for maximum) as follows: 241
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Computing minimum of each row


>apply(X=matrix(c(2,4,1,0,3,2,8,1,2),
, ncol=3,
, byrow=TRUE),
,
MARGIN=1,
, FUN=min)
)
[1] 1 0 1

#Computing maximum of each column


>apply(X=matrix(c(2,4,1,0,3,2,8,1,2),
, ncol=3,
, byrow=TRUE),
,
MARGIN=2,
, FUN=max)
)
[1] 8 4 2

In the next illustration, we assign an array object to the X argument of the


apply() function and illustrate the use of all the possible values of the
MARGIN argument. To do so, we first create an arbitrary array named Arr as
follows:
#Creating an array
> Arr
r <-
- array(data=1:18,
, dim=c(3,2,3));Arr
r
, , 1

[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12

, , 3
[,1] [,2]
[1,] 13 16
[2,] 14 17
[3,]
[3 ] 15 18

Now, we compute the sum of each row and column of Arr as follows:
#Computing the sum of each row
> apply(Arr,
, 1,
, function(x)
) sum(x))
)
[1] 51 57 63

#Computing the sum of each column


> apply(Arr,
, 2,
, function(x)
) sum(x))
)
[1] 72 99

From the obtained output, observe that 51 is the sum of the first rows of the 3
matrices of Arr, similarly, remaining output can be understood.

Next, we compute the index/position wise sum of the elements of the three
matrices of Arr. To do so, we assign the MARGIN argument as 1:2 in the
following manner:
242
Apply Family In R

#Computing the sum of same indexed elements of 3 matrices


> apply(Arr,
, 1:2,
, function(x)
) sum(x))
)
[,1] [,2]
[1,] 21 30
[2,] 24 33
[3,] 27 36

From the obtained output, observe that 21 is the sum of the elements present
at the first row and first column in each of the 3 matrices of the Arr, similarly,
other terms of the output can be understood.

Moreover, as discussed earlier recall that the apply() function also accepts
a data frame argument. So, we next assign a data frame argument to the X
argument of the apply() function. For the illustration purpose, we consider
the subpart (consisting of first 20 rows of the data) of the built-in data set
airquality available in datasets package. But first we seek help on the
data set as follows:
#Seeking help on the data set
> ?airquality
starting httpd help server ... done

You can see the details on the airquality data frame and about its
columns from the R documentation page. For the sake of convenience, before
using the apply() function on the subpart of the data set, we assign it to an
object named SubAir as follows:
#Extracing and assigning first 20 rows of airquality data
> SubAir
r <-
- airquality[1:20,];SubAir
r
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7 243
Functions, Conditional Statements, Loops and Descriptive Statistics with R

8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20

Next, we use the summary() function to compute the summary (with missing
values) of the variables of the SubAir data as follows:
#Computing the summary of the variables of the SubAir data
> summary(SubAir)
Ozone Solar.R Wind Temp
Min. : 6.00 Min. : 19.0 Min. : 6.90 Min. :56.00
1st Qu.:11.25 1st Qu.: 99.0 1st Qu.: 9.05 1st Qu.:61.75
Median :17.00 Median :194.0 Median :11.50 Median :66.00
Mean :19.22 Mean :197.1 Mean :11.64 Mean :65.15
3rd Qu.:26.75 3rd Qu.:299.0 3rd Qu.:13.35 3rd Qu.:68.25
Max. :41.00 Max. :334.0 Max. :20.10 M
Ma
Max.
x. :74.00
NA's :2 NA's :3
Month Day
Min. :5 Min. : 1.00
1st Qu.:5 1st Qu.: 5.75
Median :5 Median :10.50
Mean :5 Mean :10.50
3rd Qu.:5 3rd Qu.:15.25
Max. :5 Max. :20.00

From, the summary() function output, it can be seen that the Ozone and
Solar.R variables are having 2 and 3, NA’s, respectively. Therefore, before,
using any function on the columns or rows of the SubAir data, it is necessary
to remove NA’s (or otherwise we will get NA output).
The na.omit() function available in the stats package, can be used
efficiently to remove all the rows of a data frame SubAir, which are consisting
of NA’s. Additionally, after the removal of the NA’s, we assign the updated data
to SubAirUp as follows:
#Removing NA’s in a single command
> SubAirUp
p <-
- na.omit(SubAir);SubAirUp
p
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
244
Apply Family In R

2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20

On comparing the SubAirUp and SubAir data frames, you will find out that
the row numbers 5, 6, 9 and 10 consisting of the NA’s
NA’s values are now
removed. So, in total four rows have been removed by the na.omit()
na.omit( ()
function. The same can be verified using the nrow()
nrow ) function as follows:
w()
#Difference between the number of rows of SubAir and SubAirUp
> nrow(SubAir)-nrow(SubAirUp)
[1] 4

Hence, the obtained output confirms that four rows have been removed by the
na.omit()
naa.omit(
t()
t( ) function. After
A removing the missing values, we now comp
compute
m ute thee
row means and column sums from the SubAirUp data fr frame the
rame using th
he
apply()
a
ap () function. To compute the row means, we assign the X argument
ply()
pl arrgument of
the app y() function as SubAirUp,
apply()
ppl SubA
bAi
bAirUp M RGIN argument as 1 (for rows)
p, the MARGIN
MA
and the FUN
FUN argument as mean
mea
me an in the following manner:
#Computing the row means after omitting rows
w consisting NA’s
> apply(X=SubAirUp, MARGIN=1, FUN=mean) #Or rowMeans(SubAirUp)
1 2 3 4 7 8
51.90000 40.16667 42.60000 68.91667 67.93333 33.96667
9 12 13 14 15 16
20.35000 61.28333 65.70000 64.31667 29.03333 74.08333
17 18 19 20
73.50000 30.40000 75.91667 25.28333

If you do not want to use the na.omit() function, then alternatively you can
remove NA elements from the SubAir data frame before the computations
starts, by using the na.rm argument of the mean() function (discussed in Unit
2 of MST-015). To do so, we supply the na.rm argument as additional
argument to the apply() function as follows:
#Computing the row means by removing NA’s elements
> apply(X=SubAir,
, MARGIN=1,
, FUN=mean,
, na.rm=TRUE)
)
1 2 3 4 5 6
51.90000 40.16667 42.60000 68.91667 20.07500 23.98000 245
Functions, Conditional Statements, Loops and Descriptive Statistics with R

7 8 9 10 11 12
67.93333 33.96667 20.35000 57.32000 20.78000 61.28333
13 14 15 16 17 18
65.70000 64.31667 29.03333 74.08333 73.50000 30.40000
19 20
75.91667 25.28333

Thus, using both the approaches we get the same output.


Next, to compute the column sums, we assign the X argument of the apply()
function as SubAirUp, the MARGIN argument as 2 (for columns) and the FUN
argument as sum in the following manner:
#Computing the column sums of SubAirUp
> apply(X=SubAirUp,
, MARGIN=2,
, FUN=sum)
) #Or colSums(SubAirUp)
Ozone Solar.R Wind Temp Month Day
311.0 3157.0 188.1 1038.0 80.0 178.0

Next, we assign the quan le() function to the FUN argument to compute
quantile()
a til
the quantiles of the columns of the SubAirUp data as follows:
#Computing the quantiles of the columns of SubAirUp
> apply(SubAirUp, 2, quantile)
Ozone Solar.R Wind Temp Month Day
0%
% 6.00 19.00 7.400 57.00 5 1.00
25%
% 11.75 93.75 9.575 61.75 5 6.25
50%
% 17.00 223.00 11.500 65.50 5 12.50
75%
% 24.75 301.00 12.750 68.00 5 16
1
16.25
.25
100% 41.00 334.00 20.100 74.00 5 20.00

SSAQ
SA
AQ 3
Write R code to compute the row sums, column means, minimum and
maximum off the rows off SubAirUp
b i data frame
f (shown
( in S
Section 8.2).
)

8.5 The tapply() Function


This function works similar to the sapply() function (when used with the
split() function). It applies a function assigned to its FUN argument, to each
group of values given by a unique combination of the levels of a factor
assigned to INDEX argument. The main three arguments of the tapply
functions are X, INDEX and FUN. The X argument is used to assign an R
object for which split method exists, the second argument INDEX is used to
assign a factor (according to which grouping is performed) and the third
argument FUN is used to assign the function name or function definition.
Note: The tapply function is mainly used to apply a function assigned to the
FUN argument, on the subset (grouped defined by a factor variable) of an
object for which split method exits.

246
Apply Family In R

The main arguments of this function can be summarized as follows:


#The tapply() function
tapply (X, #an R object for which a split method exists.
(Generally, data frame)
INDEX, #a factor or a list of factors (if not then
they are coerced to factors)
FUN, #function to be applied
simplify #TRUE by default. It indicates whether to
simplify the output or not
...) #Other arguments of FUN and tapply functions

To explain the execution of the tapply() function, we first create a data


frame named DF1 with two columns. The first column x is of numeric type and
the second column y is of factor type (created using gl() function) as follows:
#Creating a data frame
> DF1 <
<- data.frame(x
data.frame(x=c(1,3,2,3,1,5,6,2,1,4),
c(1,3,2,3,1,5,6,2,1,4),
y=gl(2,5,labels=c("Treatment", "Control")));DF1
x y
1 1 Treatment
2 3 Treatment
3 2 Treatment
4 3 Treatment
5 1 Treatment
6 5 Control
7 6 Control
8 2 Control
9 1 Control
10 4 Control

Then the mean according to the subsets defined by the


h factor variable y of the
DF1 data frame can be obtained by assigning the X argument as x variable,
the INDEX argumentt as y variable
th i bl and the FUN argumentt as mean()
d th () iin th
the
tapply() function as follows:
#Computing the means according to the levels of a factor
> tapply(X=DF1$x,
, INDEX=DF1$y,
, FUN=mean)
)
Treatment Control #Or tapply(DF1$x, DF1$y, mean)
2.0 3.6

For more clarification, we now present another illustration in which we shall get
the subsets of each columns of a data frame one-by-one according to a factor
variable and apply the mean() function on each subsets using the tapply()
function. Consider the painters data frame discussed in the beginning of
this unit again. We now use the tapply() function on the first four variables
of the painters data frame using a for loop as follows:
#Loading the package
> library(MASS)
)
247
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Computing the means according to the levels of a factor for


#first four columns of the data frame
> for(i
i in
n 1:4){
{
+ cat(names(painters)[i],
, "\n");
;
+ print(round(tapply(painters[,i],
, painters$School,
,
mean),2))}
}
Composition
A B C D E F G H
10.40 12.17 13.17 9.10 13.57 7.25 13.86 14.00
Drawing
A B C D E F G H
14.70 14.33 13.50 9.90 12.86 10.25 10.43 14.00
Colour
A B C D E F G H
9.00 7.33 7.50 16.10 11.86 9.50 14.86 6.50
Expression
A B C D E F G H
8.20 8.17 7.17 3.20 8.14 7.75 10.00 12.50

Observe that, while printing, we have rounded the output till 2 decimal places
(for the sake of convenience).

SSAQ
SA
AQ 4
Write R code to create a data frame named warp_breaks consisting of the
subpart of an built-in data set warpbreaks{datasets} of R, wher
where
re wool
and tensions columns are of factor type and the breaks variable is of
numeric type:
breaks wool tension
26 A L
30 A L
54 A L
25 A L
70 A L
52 B M
51 B M
26 B M
67 B M
18 B M
Write R code, using tapply function to compute the wool-wise and tension-
wise minimum of the data.

8.6 The mapply() Function


The mapply() function is a multivariate version of sapply() function.
Except that this function has different arrangement or order of arguments from
the sapply() function. The main difference is that, the FUN argument of the
mapply() function comes foremost then all other arguments of the function.
248
Apply Family In R

Note that the mapply() function runs the function assigned to FUN argument
by taking values from the supporting arguments consecutively.
#The mapply() function
mapply (FUN, #function to be applied
MoreArgs, #list of other arguments of FUN
SIMPLIFY, #TRUE by default. It indicates whether to
#simplify the result or not
...) #Other arguments of FUN and mapply functions

If we want to generate five sequences starting from 1 and ending to 10, but
with the jump of 1, 2, 3, 4 and 5, respectively. Then it can be done by using
the five seq() function commands as follows:
#Generating five sequences using seq() function commands
> seq(from=1,
, to=10,
, by=1)
) #Jump=1
[1] 1 2 3 4 5 6 7 8 9 10

> seq(from=1, to=10, by=2)


[1] 1 3 5 7 9

> seq(from=1, to=10, by=3)


[1] 1 4 7 10

> seq(from=1, to=10, by=4)


[1] 1 5 9

> seq(from=1, to=10, by=5) #Jump=5


[1] 1 6

The mapply()
ma
app
p ly y() loop function is a very efficient loop function, which will
generate all these 5 sequences in a single command. To do so, we assign the
FUN argument as seq()
seq( ) function and its supporting arguments, namely,
q()
q(
om as 1, to as 10 and the by
from y argument as 1 to 5 in the following manner:
#Generating five sequences using mapply() function
> mapply(seq, from=1, to=10, by=1:5, SIMPLIFY=FALSE)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10

[[2]]
[1] 1 3 5 7 9

[[3]]
[1] 1 4 7 10

[[4]]
[1] 1 5 9

[[5]]
[1] 1 6

Observe that the values of the from and to arguments of the seq() function
are fixed as 1 and 10 (so they recycled themselves according to the length of

249
Functions, Conditional Statements, Loops and Descriptive Statistics with R

the by argument). The by argument is assigned as a vector 1:5, so its length


is 5.
Next, we present another illustration in which we generate 5 different
sequences starting from and ending to different numbers, like starting from 1
and ending to 10, starting from 3 ending to 30, starting from 6 and ending to
60 and so on. Also, we use different jumps for generating these sequences as
1 to 5, using the seq() and mapply() functions as follows:
#Generating sequences using seq() function
> seq(from=1,
, to=10,
, by=1)
)
[1] 1 2 3 4 5 6 7 8 9 10

> seq(from=3,
, to=30,
, by=2)
)
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29

> seq(from=6,
, to=60,
, by=3)
)
[1] 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60

> seq(from=9, to=90, by=4)


[1] 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
[20] 85 89
> seq(from=12, to=120, by=5)
[1] 12 17 22 27 32 37 42 47 52 57 62 67 72 77
[15] 82 87 92 97 102 107 112 117

The same task can be done very efficiently using the mapply() function as
follows:
#Generating sequences using ma
m
mapply()
pp
ply() function
> mapply(seq, from=c(1,3,6,9,12), to=c(
to=c(10,30,60,90,120),
(10,30,60,90,120), by
=1:5, SIMPLIFY=FALSE)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10

[[2]]
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29

[[3]]
[1] 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60

[[4]]
[1] 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
[20] 85 89

[[5]]
[1] 12 17 22 27 32 37 42 47 52 57 62 67 72 77
[15] 82 87 92 97 102 107 112 117

In the next illustration we shall generate 3 different matrices of different orders,


say, 2x8, 4x4 and 8x2, using the mapply() function in a single command. To
do so, we create a user defined function named fun using the matrix()
function (to create a matrix). Remaining data (matrix elements) we assigned to
the MoreArgs argument of the mapply() function as follows:
250
Apply Family In R

#Elements of the matrices


> x <-
- 1:16
6

#Function to generate matrices with specified rows and columns


> fun
n <-
- function(r,c,x){
{
+ matrix(x,
, nrow=r,
, ncol=c,
, byrow=TRUE)
)
+ }

#Generating 3 matrices using mapply function


> mapply(fun,
, c(2,4,8),
, c(8,4,2),
, MoreArgs=list(x=x),
,
SIMPLIFY=FALSE)
)
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 2 3 4 5 6 7 8
[2,] 9 10 11 12 13 14 15 16

[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16

[[3]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
[7,] 13 14
[8,] 15 16

Hence, we get 3 matrices of orders 2x8, 4x4 and 8x2 using the mapply()
function. Also, in all the above mapply() function statements, we have
assigned SIMPLIFY as FALSE to get each output separately.

SAQ
Q5
The runif(n,min,max) function is used to generate n random numbers in
the range min to max. Use mapply() function to generate 5, 10 and 15
uniform numbers in the range 0 to 1 in a single command.

8.7 SUMMARY
The main points discussed in this unit are as follows:
Different methods of using lapply() function are discussed.
Advantage of using sapply() function as compared to the lapply()
function is discussed and illustrated.
251
Functions, Conditional Statements, Loops and Descriptive Statistics with R

The apply() and tapply() functions with their general syntaxes and
methods of their usage are discussed and illustrated.
The advantage of using the mapply() function, parallelly reducing the
complexity of understanding the execution of the function is discussed
and illustrated.

8.8 TERMINAL QUESTIONS


1. In which package the quantile() function is available?
2. Write R code to compute the row means and column means of a data
frame named Ln using the apply() function

3. Write R code to create a data frame named df of the following data.


x : 9, 5, 8, 6, 2
y: 4, 9, 7, 5, NA
z: 1, 1, 1, 2, 2
After creating the data frame, do the following tasks:
(i) Write R command to convert the z element as factor and then use it in
the split() function to for splitting the data frame according to z
variable.
(ii) Write R command using sapply() function to compute the row
means of the first two columns of the data frame after splitting the data
frame df.
4. Write R code to create a list named Lst with following components:
x: 4.9, 6.1, 5.8, 6.5, 5.2
y: 22.4, 3.2, -11.7, 0.7, 1.0
z: 1.4, 0.9, 0.7, 0.5, 0.8
Use a suitable loop function to compute the square root of only those
grea
e ter than
elements (numbers) of the list components, which are greater t an equal
th
to zero.
5
5. Write R code to get the subsets of first four columns of a iris data
frame one-by-one according to a factor variable Species and apply the
mean() function on each subsets function using for loop.

8.9 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. We first create a list named Lst as follows:
Lst <- list(diag(c(2,3,5)), matrix(c(5,1,8,4,-
2,6,3,2,5), ncol=3))
Then we can compute the transpose of both the matrices using
252 lapply() function using following R statement:
Apply Family In R

lapply(X=Lst, FUN=t)
2. We first create a data frame using the following code:
Adm.data <- data.frame(Name= c("Shreyash","Prithu",
"Yuvaan","Advika","Pawan","Pehu"),
Gender= as.factor(c("Male", "Male", "Male", "Female",
"Male", "Female")),
Percentage= c(88.55, 80.13, 85.31, 75.22, 65.04, NA),
AgeG30= c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))
Then to compute the average percentage score of the students
according to the Gender variable of the data, we split the data in the
following manner and assigned to split.data:
split.data <- split(Adm.data, Adm.data$Gender)
print(split.data)
Thereafter, we create a function named Fun1 to get the 3rd column,
which is showing the percentages, i.e., x[,3]. Then we use the
sapply() function to get average percentage score of the students
according to the Gender variable of the data as follows:
Fun1 <- function(x) mean(x[,3], na.rm=TRUE)
sapply(X=split.data, FUN=Fun1)
3. We can compute the vectors of row sums, column means, minimum and
maximum of the SubAirUp data frame using following commands:
For a vector of row sums:
apply(X=SubAirUp, MARGIN=1, FUN=sum)
For a vector of column means:
apply(X=SubAirUp, MARGIN=2, FUN=mean)
For a vector of minimum of rows:
apply(X=SubAirUp, MARGIN=1, FUN=min)
For a vector of maximum of rows:
apply(X=SubAirUp, MARGIN=1, FUN=max)
4. Firstly, we create a data frame named warp_breaks can be created as
follows:
warp_breaks <-
data.frame(breaks=c(26,30,54,25,70,52,51,26,67,18),
wool=c(rep("A",5),rep("B",5)),
tension=c(rep("L",5),rep("M",5)))
Then the group wise minimum according to wool variable ( 2nd column)
can be obtained using the following command:
tapply(warp_breaks[,1], warp_breaks[,2], min)
Then the group wise minimum according to tension variable ( 3rd
column) can be obtained using the following command:
tapply(warp_breaks[,1], warp_breaks[,3], min)
5. Uniform random numbers of sizes 5, 10 and 15 in the range 0 to 1 can
be generated in a single command using the mapply() function by
writing the following statement:

253
Functions, Conditional Statements, Loops and Descriptive Statistics with R

mapply(runif, n=seq(5,15,5), min=0, max=1,


SIMPLIFY=FALSE)

Terminal Questions (TQs)


1. The quantile() function is available in the stats package.
2. The row means of the data frame Ln can be computed using
apply(Ln, 1, mean)
The column means of the data frame Ln can be computed using
apply(Ln, 2, mean)
3. Both the parts of the question can be solved using the following code:
(i) The data frame named df can be created as follows:
df <- data.frame(x = c(9, 5, 8, 6, 2),
y = c(4, 9, 7, 5, NA),
z = c(1, 1, 1, 2, 2))
Then we are asked to coerced the z variable of the data frame to a
factor variable, which can be done by writing the following command:
df$z <- as.factor(df$z)
Thereafter, the data frame can be split according to the z variable and
the grouped data can be assigned to split_df as follows:
split_df <- split(df, df$z)
(ii) In the second part of the question we are asked to use sapply()
function to compute the row means of the first two columns, that can be
done by writing following command:
sapply(X=split_df, FUN=func
FUN=function(x)
n tion(x) rowMeans(x[,c(1,2)],
na.rm=TRUE))
Here, na.rm=TRUE is used as the data frame has missing values.
4. Firstly, a list named Lst can be created as follows:
5.2),
Lst <- list(x = c(4.9, 6.1, 5.8, 6.5, 5.2)
),
y = c(22.4, 3.2, -11.7, 0.7, 1.0),
z = c(1
c(1.4,
4 00.9,
9 00.7,
7 00.5,
5 00.8))
8))
Thereafter the square root of greater than equal to zero elements can
computed by using the following condition in the lapply() function:
x[x>=0] #for extracting greater than equal to 0 elements
Then the square root of each element of the three components of Lst
can be computed in a single command using the lapply() function as
follows:
lapply(X=Lst, FUN=function(x) sqrt(x[x>=0]))
5. The given task can be accomplished by writing the following code.
for(i in 1:4){
cat(names(iris)[i], "\n");
print(tapply(iris[,i], iris$Species, mean))}

254
UNIT 9
DESCRIPTIVEE STATISTICSS AND
D
CORRELATION N WITH HR

Structure
9.1 Introduction Skewness

Expected Learning Outcomes Kurtosis

9.2 Mean 9.8 Correlation and Bivariate


Plots
Arithmetic Mean
Karl Pearson’s Correlation
Geometric Mean
Coefficient
Harmonic Mean
Spearman’s Correlation
9.3 Median Coefficient
9.4 Mode 9.9 Summary
9.5 Variance and Standard 9.10 Terminal
Term
minal Questions
Deviation
9.11 Solutions/Answers
9.6
9 6 Range,
Range Quartile Deviation
and Mean Deviation
9.7 Skewness and Kurtosis

9.1 INTRODUCTION
In this unit, we shall discuss the methods of computing the mean, median,
mode, variance, standard deviation, range, quartiles, quartile deviation and
mean deviation. We shall also discuss skewness and kurtosis to infer the
shape and spread of the data. Thereafter, to find whether the two variables are
correlated or not, correlations and bivariate plots are discussed. To do so, we
shall consider previous eight units of MST-015 (Introduction to R Software)
course as basics and illustrate the computations of the aforementioned
statistical measures in this unit.
Moreover, in this unit the methods of computing the aforementioned statistical
measures are illustrated for the ungrouped and grouped data. Recall that, the
255
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R

grouped data can further be categorized as discrete frequency data and


continuous frequency data. So, illustrations based on at least one of them are
quoted in this unit. Mostly, illustrated problems are on ungrouped and
continuous frequency data. As the problems on discrete frequency data can
be solved on the same lines.
For the computation of some of the statistical measures, some built-in
functions are available in specific packages and for some problems packages
may not be available. But, not to worry as we can create a user-defined
function to accomplish an intended task.

Expected Learning Outcomes


After studying this unit, you should be able to
compute mean, median and mode from the data;
compute variance and standard deviation from the data;
compute range, quartiles, mean and quartile deviations from the data;
check whether the distribution of the data is skewed or not;
check whether the distribution of the data is leptokurtic, mesokurtic,
platykurtic; and
obtain the Karl-Pearson and Spearman’s correlation coefficients.

9.2
9 .2 MEAN
ME
E AN
Means or averages are single values that describes the characteristics of the
central
entire data. Mean and median are the most useful measuress of centtral
tendency as they show the tendency off some central value around which data
clusters. They also facilitate the comparisons.
In this section, we shall discuss the method of computations of the mean of
grouped and ungrouped data. Recall tha that
h t to compute the mean
a of the
continuous frequency data, we must have Xi’s (middle values) in advance to
compute the mean. But, in the case of the discrete frequency data, we can
take given Xi’s directly for the computation purpose and then the mean can be
computed on the same lines of the continuous frequency data.

9.2.1 Arithmetic Mean


The arithmetic mean is the most popular and widely used measure of central
tendency, which represents the entire data by one value. We already know
that the arithmetic mean is computed by adding all the observations together
and then dividing this total by the number of observations. Mathematically, for
ungrouped data the arithmetic mean of the data x 1, x 2 ,..., x n of size n is given
by the following expression:
n
xi
Arithmetic mean = i 1
…(9.1)
n
Now we discuss the method of computing arithmetic mean of the ungrouped
data. To do so, we compute the mean of the monthly income data (in
256
Descriptive Statistics and Correlation with R

thousands of rupees) of 15 employees working in a company, which are as


follows:
55, 45, 31, 45, 33, 34, 44, 38, 41, 38, 53, 38, 32, 37, 55
So, we first assign the income data to a vector named Income as follows:
#Assigning the income data
> Income
e <-
- c(55,
, 45,
, 31,
, 45,
, 33,
, 34,
, 44,
, 38,
, 41,
, 38,
, 53,
, 38,
,
32,
, 37,
, 55)
)

Then we use the sum() function to get the sum of the observations and
length() function to get n. Then compute the mean of the data using (9.1)
as follows:
#Computing the arithmetic mean of the income data
> sum(Income)/length(Income)
)
[1] 41.26667

Hence the average or mean income of the employees is 41.26667.


As an alternative method, the arithmetic mean of the data can also be
computed using the me an() function. The generic function m
mean()
m () for
mean()
ean()
computing the arithmetic mean is available in the ba ase package. We first see
base
the internal structure of the me
mean()) function using str() function (or
str()
otherwise we can take help on the function) as follows:
#Internal structure
> str(mean)
function (x, ...)

The main arguments of interest of the mean()


a () function are x and na.rm,
mean na.r rm,
where x argument is an R object, mainly vector and na.rm is a logical
argument indicating the presence of missing values, i.e., NA
N values in the data.
Then to compute the mean of the income data, we assign the x argument of
n() function as Inco
the mean
mean() ome object as follows:
Income
#Computing the mean of the income data
> mean(x=Income)
)
[1] 41.26667

Hence, the obtained mean is same as earlier.


In case of missing values the na.rm argument (with default value FALSE) of
the mean() function can be used as saviour. To compute the arithmetic mean
of the available observations we assign the na.rm as TRUE. For more
clarification, consider the following example in which we shall compute the
arithmetic mean of the following observations;
1, 2, 3, NA, 5
#Computing the mean in presence of a missing value
> mean(x=c(1,
, 2,
, 3,
, NA,
, 5),
, na.rm=TRUE)
)
[1] 2.75

It can be verified that the mean is computed using the non-missing values only
by writing the following mean() function statement: 257
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Verification
> mean(x=c(1,
, 2,
, 3,
, 5))
)
[1] 2.75
Hence, it is verified.
Note: If we do not use the na.rm argument of the mean() function then we
get NA output. See the following for clarification
#Computing the mean of the data consisting of NA value
> mean(x=c(1,
, 2,
, 3,
, NA,
, 5))
)
[1] NA

Next, we discuss the computations of the arithmetic mean for the grouped
data. Recall that the mean of the grouped data is computed using the following
formula.
n
fi x i
Arithmetic mean = i 1
, …(9.2)
N
n
where, N fi and n is a number of class intervals.
i 1

For the illustration purpose, we consider the following data representing the
figures of incentives (in thousand) earned by 1400 employees of a company:

Class Intervals Frequency


20-40 500
40-60 300
60-80 280
80-100 120
100-120 100
120-140 80
140-160 20

To compute the average incentives using data frames, we first create a data
frame named data with three columns. The 1st column named ll represents
the lower limits of the intervals, the 2nd column named ul represents the upper
limits of the intervals and freq column represents given frequencies in the
following manner:
#Creating a data frame of the incentives data
> data
a <-
- data.frame(ll=seq(20,
, 140,
, 20),
, ul=seq(40,
, 160,
, 20),
,
freq=c(500,
, 300,
, 280,
, 120,
, 100,
, 80,
, 20))
)

Let us first attach the data frame data to use its column names without data
frame name as follows:
#Attaching the data frame
>attach(data)
)

Then the arithmetic mean can be easily computed using given formula. To do
so, we compute the middle values and thereafter use (9.2) to compute the
mean as follows:
258
Descriptive Statistics and Correlation with R

#Computing the middle values after attaching the data frame


> xi
i <-
- (ll+ul)/2;
; xi
i
[1] 30 50 70 90 110 130 150

#Computing the mean of the grouped data


> sum(xi*freq)/sum(freq)
)
[1] 60.57143

Alternatively, the weighted.mean() function available in the stats package


can be used efficiently to compute the mean of the grouped data. We first see
its internal structure as follows:
#Internal Structure
> str(weighted.mean)
)
function (x, w, ...)

The x argument of the weighted.mean() function is used to assign middle


f
values (Xi) and the w argument is used to assign the weights i .
N
Next, we compute the average incentives (arithmetic mean) using the
weighted.mean()
weight ted
ed.mea n() function. To do so, we assign the x argument of the
e n(
weighted.mean()
weig
ighted
ig ed.me
ed () function as the middle value of the corresponding class
ean(
n(
interval and the w argument as given frequencies divided by the total
frequency as follows:
#Computing the mean of the grouped data
> weighted.mean(x=(ll+ul)/2, w=freq/sum(freq))
[1] 60.57143

Hence using both the approaches we get the same result.


resultt.
After computing the mean, we detach the data frame as follows:
#Detaching the data frame
>detach(data)

Note: This problem could have been solved without using the data frames as
well. In that case, class intervals and frequencies will be assigned to vectors
first. After that, Xi’s can be computed. Thereafter, using Xi’s and frequencies
mean can be computed.

9.2.2 Geometric Mean


It is known that the geometric mean of a set of n observations (for ungrouped
data) is the nth root of their product. Mathematically,
Geometric mean= (x1x 2 ...x n )1/n …(9.3)

Next, we discuss the method of computing the geometric mean of the given
ungrouped data. To go so, we consider the following production data of fans
noted for 7 days and compute the geometric mean from it.

Day 1 2 3 4 5 6 7
No. of units produced 4000 2000 1500 3500 2000 1900 3000
259
Functions, Conditional Statements, Loops and Descriptive Statistics with R

To compute the geometric mean, we first assign the number of units produced
to a vector named Production. Then use the prod() and length()
functions to compute the geometric mean using the formula given in (9.3) as
follows:
#Assigning the production data
> Production
n <-
- c(4000,
, 2000,
, 1500,
, 3500,
, 2000,
, 1900,
, 3000)
)

#Computing the geometric mean from the production data


> (prod(Production))^(1/length(Production))
)
[1] 2414.789

Hence, 2414.789 is the geometric mean.


Note: The prod() function is used to compute the product of the vector
elements.
Recall that, the geometric mean can also be computed by taking help of
logarithms using the following formula:
1 n
loge (Geometric mean) loge xi ,
n i 1

n
1
Or, Geometric mean exp l ge xi
log …(9.4)
n i 1

Note that, the formula of the geometric mean is appearing analogues to mean
formula given in (9.1), here instead of x i ’s, we have loge x i ’s. So, the
geometric mean given in (9.4) can be computed using the mean(), log()
p ) functions as follows:
and exp(
exp()
#Computing the geometric mean using logarithms
> exp(mean(log(Production)))
[1] 2414.789

You can observe that using both the approaches we get the same geometric
g ometric
ge
mean as 2414.789.
Next, we discuss the computation of the geometric mean for the grouped data.
We already know that the geometric mean for the grouped data is computed
using the following formula:
GM (x 1f1 x1f2 ...x1fn )1/N , …(9.5)
n
where, N= fi and n is the number of class intervals.
i 1

For the illustration purpose, we consider the following arbitrary data:


Class Intervals Frequency
1-3 4
3-5 3
5-7 2
7-9 1
9-11 1
11-13 8
13-15 2
260
Descriptive Statistics and Correlation with R

To compute the geometric mean from this grouped data using data frames, we
first create a data frame named data with three columns consisting of lower
limits, upper limits and frequencies as follows:
#Creating a data frame of the data
> data
a <-
- data.frame(ll=seq(1,
, 13,
, 2),
, ul=seq(3,
, 15,
, 2),
,
freq=c(4,
, 3,
, 2,
, 1,
, 1,
, 8,
, 2))
)

Note that, the first two columns namely, ll and ul of the data frame
represents the lower and upper limits of the intervals. Also, the third column
freq represents the frequencies given in the data.
Additionally, since here we have not attached the data frame data, so we
need to use data frame name together with the column name to use any
column (or otherwise we can use indices of the data frame). Here to compute
the geometric mean, we first compute the middle values and assign them to
xi. Then we extract the frequencies and assign them to fi. Moreover, we
compute the sum of the frequencies and assign it to N as follows:
#Computing the mid values
> xi <- (data$ll+data$ul)/2; xi
[1] 2 4 6 8 10 12 14

#Assigning frequencies
> fi <- data$freq;fi
[1] 4 3 2 1 1 8 2

#Computing total frequency N


> N <- sum(data$freq);N
[1] 21

After getting all these quantities, finally we compute the geometric mean using
the formula given in (9.5) as follows:
#Computing the geometric mean
> (prod(xi^fi))^(1/N)
[1] 6
6.735228
735228

An alternative method of computing the geometric mean for the grouped data
by using logarithms is as follows:
1 n
loge (Geometric mean) fi loge xi ,
Ni1
1 n
Or, Geometric mean exp fi loge xi . …(9.6)
Ni1

#Computing the geometric mean using (9.6)


> exp(sum(fi*log(xi))/N)
)
[1] 6.735228

Hence, by using both the versions of the formula we get the same result.
In addition to all these note that we can also use the weighted.mean()
function to compute the geometric mean. To do so, we assign the x argument
of the weighted.mean() function as the logarithms of the middle values of
261
Functions, Conditional Statements, Loops and Descriptive Statistics with R

the corresponding class intervals and the w argument as the frequencies


divided by the total frequency as follows:
#Computing the geometric mean using weighted.mean() function
> exp(weighted.mean(x=log(xi),
, w=fi/N))
)
[1] 6.735228

9.2.3 Harmonic Mean


We already know that the harmonic mean of a number of observations (none
of which is zero) is reciprocal of the arithmetic mean of the reciprocal of the
individual observations. Thus, the harmonic mean for the grouped and
ungrouped data is given by the following formulae:
n
Harmonic mean= n
, for ungrouped data …(9.7)
1
i 1 Xi

and,,
N
Harmonic mean= , for grouped data …(9.8)
n
fi
i 1 Xi

Now, we discuss the method of computing the harmonic mean. For the
illustration purpose, we compute the harmonic mean of the following arbitrary
ungrouped data:
15, 25, 35, 70, 50
We first assign the data to a vector named x, then use mean()
mean
an()
an funcction to
() function
compute the harmonic mean using formula given in (9.7) as follows:
#Assigning the data under the name x
> x <- c(15, 25, 35, 70, 50)

#Computing the harmonic mean for the ungrouped data


a
> 1/mean(1/x)
[1] 29
29.49438
49438

Next, we shall consider an arbitrary grouped data to illustrate the method of


computation of harmonic mean. Consider the following data.

Class Intervals Frequency


0-10 8
10-20 15
20-30 20
30-40 4
40-50 3

To compute the harmonic mean from given data, we first create a data frame
named data1 on the same lines as earlier. Then we compute the middle
values and assign them to xi. Also, assign the frequencies to fi and total of
the frequencies to N as follows:
262
Descriptive Statistics and Correlation with R

#Creating a data frame of the grouped data


> data1
1 <-
- data.frame(ll=seq(0,
, 40,
, 10),
, ul=seq(10,
, 50,
, 10),
,
freq=c(8,
, 15,
, 20,
, 4,
, 3))
)

#Computing the middle values


> xi
i <-
- (data1$ll+data1$ul)/2;
; xi
i
[1] 5 15 25 35 45

#Assigning the frequencies


> fi
i <-
- data1$freq;
; fi
i
[1] 8 15 20 4 3

#Computing total frequency


> N <-
- sum(data1$freq);
; N
[1] 50

Finally,
y we compute
p the harmonic mean using
g the formula given
g in (9.8)
( ) and
by using the weighted.mean() function as follows:
#Computing the harmonic mean using formula
> N/sum(fi/xi)
[1] 13.96277

#Computing the harmonic mean using weighted.mean() function


> 1/(weighted.mean(x=1/xi, w=fi/N))
[1] 13.96277

SSAQ
SA
AQ 1
(i) Write R code to compute the geometric mean of the following data:

Observation Frequency

x1 f1

x2 f2

x3 f3

x4 f4

x5 f5

(ii) Write the output of the following statement:


mean(c(2,4,NA,6,1), na.rm=TRUE)

9.3 MEDIAN
Median is a positional average, which comes under the measures of central
tendency. Median appears in the ‘middle’ of an ordered sequence of values.
Median is a value that divides the data, in such a way, so that, half of the
observations in a data are lower than it and half are greater than it. So, after
sorting the data, the median of an ungrouped data is given by following
formula:
263
Functions, Conditional Statements, Loops and Descriptive Statistics with R
th
n 1
observation, if n is odd
2
Median= th th …(9.9)
n n
1
2 2
observation, if n is even
2
So, depending on n (the number of observations) two options are available to
compute the median. One out of two is to be selected. So, to compute the
median we can create a conditional statement, i.e., an if-else statement.
For the illustration purpose, we use (9.9) to compute the median of the
following data of wages of 8 workers:
5580, 5600, 4600, 4607, 5034, 4666, 5612, 5123
To do so, we first assign the data to a vector named x and compute it length
using the length() function as follows:
#Assigning the data under the name x
> x<-c(5580, 5600, 4600, 4607, 5034, 4666, 5612, 5123)

#Assigning the length to n


> n<-length(x); n
[1] 8

Next, we sort x and assign the sorted data to sx


x as follows:
#Sorted x
> sx<-sort(x); sx
[1] 4600 4607 4666 5034 5123 5580 5600 5612

After sorting the data, we now create an if-else


if se statement
s atement to compute the
st
median, depending on, whether n is even or odd d using (9.9). To check
whether n is even or not, we can use the remainder operator and check the
condition n%
n%%2==0.
n%%2===0. If this condition is TR E then n is even otherwise it is odd.
TRUE
TRUE
UE
#Creating an if-else statemen
statement
nt to compute median
> if(n%%
if(n%%2==0)
%2==0) {
+ cat("Median",
(" di ", ( (sx[n/2]+sx[(n/2)+1])/2,"\n")
[ /2] [( /2) 1])/2 "\ ")
+ } else
e {
+ cat("Median",
, sx[(n+1)/2],"\n")}
}
Median 5078.5
Hence, the computed median is 5078.5.
Or alternatively we can use the median() function to compute the median for
an ungrouped data as follows:
#Computing the median of the data
> median(x)
)
[1] 5078.5

Hence you can observe that, while using the median() function we just need
to supply the vector as an argument to the function and we do not need to sort
it as well. Moreover, this function works well for even as well as odd number of
observations see the following statements for more clarification.
264
Descriptive Statistics and Correlation with R

#Computing median of odd number of terms


> median(1:7)
)
[1] 4
#Computing median of even number of terms
> median(1:8)
)
[1] 4.5

Next, we consider the case when we have continuous frequency data. In case
of continuous frequency data, the median class is computed first, where the
median class is the class in which cumulative frequency is just greater than
N n
, where N fi . Also, recall that the median is computed using the
2 i 1

following formula
h N
Median = l C , …(9.10)
f 2

where, l is the lower limit of the median class,


f is the frequency of the median class,
h is the magnitude of the median class,
C is the cumulative frequency of the class preceding the median
class.
Now we shall illustrate the method of computing the median of the following
continuous frequency data of wages using formula:
Wages (in Rs) Number of Workers
Worker
es
200-300 4
300-400 6
400-500 25
500-600 15
600-700 8

We can create a data fame to assign the given wages data, but for a change
we now assign the given data to three vectors named wages_ll, wages_ul
and fi, consisting of lower limits, upper limits and frequency data
respectively.
#Assigning the data into different vectors
> wages_ll
l <-
- seq(200,
, 600,
, 100)
)
> wages_ul
l <-
- seq(300,
, 700,
, 100)
)
> fi<-c(4,
, 6,
, 25,
, 15,
, 8)
)

Thereafter, we compute the cumulative frequency using the cumsum()


function and the total of the frequencies. Also, assign them to Cum_fi and N
as follows:
#Computing cumulative frequencies
> Cum_fi
i <-
- cumsum(fi);
; Cum_fi
i
[1] 4 10 35 50 58
265
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Computing total frequency


> N<-sum(fi);N
N
[1] 58

Next, we use Cum_fi and N to compute the median class at which the
cumulative frequency is just greater than N/2 using min() and which()
functions. Also assign the computed index of the middle class to ind as
follows:
#Computing the median class
> ind
d <-
- min(which(Cum_fi>N/2));
; ind
d
[1] 3

After computing the median class, we next compute the magnitude of the
median class and assign it to h as follows:
#Computing the magnitude of the median class
> h <-
- wages_ul[ind]-wages_ll[ind];
; h
[1] 100

Finally, after computing all these quantities, we compute the median using
(9.10) as follows:
#Computing the median
> wages_ll[ind]+(h/fi[ind])*(N/2-Cum_fi[ind-1])
[1] 476

Hence, 476 is the median of the given data.

SSAQ
SA
AQ 2
Create a function to compute the median of the ungrouped data.

9.4
9 .4
4 MODE
MODE
It is well known that the mode represents a value, which occurs most
frequently in a data. But there may be cases when maximum occurrence
concept does not work, like the case when the maximum frequency is
repeated and others. In such situations, we solve the problems using method
of grouping (not discussed here). Here, the considerations are given to the
case when the distribution of the data is unimodal and the mode can be
computed just by checking the maximum occurrence of an observation.
For the illustration purpose we now discuss the method of computing the
mode of the following data:
15, 16, 16, 15, 15, 15, 14, 12, 15, 15, 16, 14, 13, 12, 13, 12, 15, 11, 16, 15,
12, 13, 17, 14, 14, 13, 14, 12, 16, 17, 13, 15, 11, 15, 15, 13, 11, 17, 16, 14,
16, 12, 14, 15, 15, 14, 14, 12, 13, 14, 14, 14, 15, 14, 16, 16, 14, 13, 13, 14,
11, 15, 17, 15, 15, 17, 16, 15, 16, 13, 14, 17, 16, 15, 13, 11, 15, 13, 13, 12,
14, 14, 14, 15, 13, 14, 16, 12, 16, 13, 14, 17, 12, 13, 15, 14, 16, 14, 14, 13,
14, 17, 14, 16, 15
Firstly, we create a vector of the data and named it as x in the following
266 manner:
Descriptive Statistics and Correlation with R

#Assigning the data


> x<-c(15,
, 16,
, 16,
, 15,
, 15,
, 15,
, 14,
, 12,
, 15,
, 15,
, 16,
, 14,
, 13,
, 12,
,
+ 13,
, 12,
, 15,
, 11,
, 16,
, 15,
, 12,
, 13,
, 17,
, 14,
, 14,
, 13,
, 14,
, 12,
, 16,
,
+ 17,
, 13,
, 15,
, 11,
, 15,
, 15,
, 13,
, 11,
, 17,
, 16,
, 14,
, 16,
, 12,
, 14,
, 15,
,
+ 15,
, 14,
, 14,
, 12,
, 13,
, 14,
, 14,
, 14,
, 15,
, 14,
, 16,
, 16,
, 14,
, 13,
, 13,
,
+ 14,
, 11,
, 15,
, 17,
, 15,
, 15,
, 17,
, 16,
, 15,
, 16,
, 13,
, 14,
, 17,
, 16,
, 15,
,
+ 13,
, 11,
, 15,
, 13,
, 13,
, 12,
, 14,
, 14,
, 14,
, 15,
, 13,
, 14,
, 16,
, 12,
, 16,
,
+ 13,
, 14,
, 17,
, 12,
, 13,
, 15,
, 14,
, 16,
, 14,
, 14,
, 13,
, 14,
, 17,
, 14,
, 16,
,
+ 15)
)

To get the frequencies corresponding to each unique number present in the


data, we use the table() function (already discussed in Unit 5 of MST-015
course) as follows:
#Computing the frequencies
> table(x)
)
x
11 12 13 14 15 16 17
5 10 17 26 23 16 8

From this output it is clear that the maximum occurring frequency is 14, which
has occurred 26 number of times. Therefore, the mode of the data is 14.
The same result can be computed by taking help of the which()
whic () function as
ich(
ic
follows:
#Computing the maximum occurring number with its index
> which(table(x)==max(table(x)))
14
4

From this output we conclude that the maximum frequency 26 corresponds to


the number 14, which means 14 is the mode and is present at the 4th place of
the ta () function output.
table()
t ble(

SAQ 3
Explain the execution of the which() function with an example.

9.5 VARIANCE AND STANDARD DEVIATION


Variance is used to describe the variability or dispersion of the observations.
Also, two or more data may show same means, but there can be wide
disparities in the shape of the distribution of the data, which can be seen with
the help of variance. Variance is defined as the arithmetic mean of the squares
of the deviations of the data from its arithmetic mean. Variance in R can be
computed using the var() function available in the stats package, but this
function uses the denominator as (n-1) instead of n.
Recall that the variance for the ungrouped and grouped data are obtained by
using the following formulae:
267
Functions, Conditional Statements, Loops and Descriptive Statistics with R

1 n
2
Variance= xi x , for ungrouped data …(9.11)
n i 1

and
1 n
2
Variance= fi xi x , for grouped data …(9.12)
N i 1

Now we illustrate the method of computing the variance. To do so, we once


more consider the Income’s data discussed in Section 9.2 and compute the
variance of the data using formula given in (9.11) as follows:
#Computing the variance
> sum((Income-mean(Income))^2)/length(Income)
)
[1] 60.86222

Note that here to compute the variance, we have used the sum(), mean()
and length() functions. The variance can also be computed using the same
formula with the help of mean() function only as follows:
#Computing the variance
> mean((Income-mean(Income))^2)
[1] 60.86222

Hence, using both the approaches we get the same variance.


Next, we use that var()
va ) function to compute the variance of the Income’s
ar()
data and see what we get:
#Computing the variance using var() function
> var(Income)
[1] 65.20952

A clear difference between the computed values of the variance, 60.86222


and 65.20952 can be seen. This difference arises due to the change in
denominator, because the var( r() function uses the denominator as (n-1)
var()
r( (n-
(n-1)
1
instead of n. This can be verified from the following R statement:
#Computing the length
> n <- length(Income)

#An alternative to var() function


> sum((Income-mean(Income))^2)/(n-1)
)
[1] 65.20952

The variance with denominator n can be computed by multiply the var() by


(n-1) and dividing it by n as follows:
#For computing the variance using var() function
> (n-1)*var(Income)/n
n
[1] 60.86222

Hence, verified.
Note: The var() function also support na.rm function argument for handling
missing values.
Now, we shall illustrate the method of computing the variance for the
continuous frequency data. For the illustration purpose consider the following
268
Descriptive Statistics and Correlation with R

example in which the breaking strength of 69 ‘test pieces’ of a certain alloy is


given.
Breaking Strength Number of Pieces
54-56 4
56-58 22
58-60 25
60-62 19
62-64 9

Now, we illustrate the method of computing the variance of the breaking


strength of the alloy. To do so, we first create 3 vectors BS_ll, BS_ul and fi
consisting of lower limit, upper limit and frequencies of breaking strength data
as follows:
#Assigning breaking strength data
> BS_ll
l <-
- seq(54,
, 62,
, 2)
)
> BS_ul <- BS_ll+2
> fi <- c(4, 22, 25, 19, 9)

Next, we compute the middle values using lower and upper limits and assign
i as follows:
them to xi
#Computing the middle values
> xi <- (BS_ll+BS_ul)/2
After assigning all these variables, now we use them to compute the variance
of the breaking strength data using the formula given in (9.12) as follows:
#Computing the variance of the grouped data
> sum(fi*(xi-mean(xi))^2)/sum(fi)
[1] 4.708861

Standard Deviation:
The standard deviation is the positive square root of the arithmetic mean of the
squares of the deviations of the data from its arithmetic mean.
The standard deviation of the data can be computed using the following
formula.
n
1
Standard Deviation= Variance = (xi x)2 , …(9.13)
n i 1

where n is the length of the data.


With the help of the standard deviation and mean, we can also compute the
most popular measure of dispersion ‘coefficient of variation’ for the
comparison of two or more data in variability. We have already discussed the
method of computing variance, so to compute standard deviation, we just
need to compute its square root. For the illustration purpose, we now compute
the standard deviation of the Income’s data as follows:
#Computing standard deviation
> sqrt(mean((Income-mean(Income))^2))
)
[1] 7.801424 269
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Hence, the standard deviation of the data is 7.801424.


Alternatively, the standard deviation of data can be computed directly by using
the sd() function available in the stats package. But this function uses
(n-1) in the denominator of the formula instead of n.
Which means the sd() function computes the following expression:
n
1
(xi x)2 .
n 1i 1

Therefore, if we want to compute the expression given in (9.13), we must need


n 1
to multiply the output computed from the sd() function with .
n

The internal structure of the sd() function is as follows:

#Internal structure
> str(sd)
)
function (x, na.rm = FALSE)

From this internal structure it is clear that this function also supports the logical
argument na.rm m with default value as FAFALSE.
FALSLSEE. This argument can be used on
the similar lines (as earlier) in case of presence of missing values.
s () function gives the same results as of sqrt(var(
Note: The sd()
sd sqrt
t(v var ). For
)).
r( ))
more clarification we compute the standard deviation of the Income’s data
using both approaches.
#Computing the standard deviation using sd() function
> sd(Income)
[1] 8.075241

#Verification of the obtained result


> sqrt(var(Income))
[1] 8.075241

Hence, it is verified that the sd() function returns the square root of the
var() function result
result.
Furthermore, the cov() function can also be used to compute the variance of
data using the following fact:
Variance(X) = Covariance(X, X) …(9.14)
So the variance of the income data using the cov() function can also be
computed as follows:
#Computing the variance using the cov() function
> cov(Income,
, Income)
) #Same as var(Income)
[1] 65.20952

Note: The cov() function also uses (n-1) in the denominator of the
expression as var() function.

SAQ
Q4
270 Write R code to compute the standard deviation of the following data:
Descriptive Statistics and Correlation with R

116, 151, 116, 179, 141, 197, 191, 197, 160, 175, 162, 137, 122, 194,
128, 115, 140, 165, 123, 151, 179, 178, 189, 185, 143, 152, 195, 152,
117, 165, 199, 163, 173, 178, 172, 173, 179, 159, 191, 158

9.6 RANGE, QUARTILE DEVIATION AND MEAN


DEVIATION
Measures of dispersion (scatteredness) are mainly used to have an idea about
the homogeneity or heterogeneity in the data. In this section we shall discuss
the method of computing range, quartile deviation and mean deviation of the
data.

9.6.1 Range
The range of the ungrouped data in R can be computed using several ways.
We know that the range of the data is defined as the difference between the
maximum and minimum values. So, we can easily compute the range in R by
using the max() and min() functions. For the illustration purpose we now
compute the range of the following monthly number of grocery items
purchased data:
500, 600, 250, 700, 650, 800, 790
#Assigning the data
> purchase <- c(500, 600, 250, 700, 650, 800, 790)
#Computing range
> max(purchase)-min(purchase)
[1] 550

Hence, the range of the data is 550. The range of the data can also be computed
using the range()
r ng
ra nge(() and summary()
e() summ
mma
mmary(() functions. The range()
rang
ngge(() and summary()
su
umm
mmar
ary(
y()
y( )
functions are available in the base
se package. The range()
bas rang
n e(() function returns the
smallest and the greatest observation present in tthe he data and its internal
structure is as follows:
#Internal structure of the range() function
> str(range)
)
function (..., na.rm = FALSE)
Note: From the obtained internal structure, it is clear that the range() function
also supports the function argument na.rm to handle missing values.
Next, we compute the range of the purchase data using the range() function
as follows:
#Using range() function on data
> range(purchase)
)
[1] 250 800

The first value of this output is the smallest observation of the data and the
second value is the largest observation of the data. Hence the range()
function is returning the smallest and the largest observations of the data. The
difference between the two computed observations can be obtained using the
diff() function as follows: 271
Functions, Conditional Statements, Loops and Descriptive Statistics with R

#Computing the range


> diff(range(purchase))
)
[1] 550
Next, we take an arbitrary data consisting of missing values and compute
range from it using the range() function as follows:
#Computing the range of the data
> range(c(30,
, NA,
, 25,
, 70,
, 55,
, 80,
, NA),
, na.rm=TRUE)
)
[1] 25 80
> diff(range(c(30,
, NA,
, 25,
, 70,
, 55,
, 80,
, NA),
, na.rm=TRUE))
)
[1] 55

Hence, 55 is the range of the data.

9.6.2 Quartile Deviation


It is well known that the quartile deviation of the data is given by the following
Q Q1
expression: 3 ,
2
where Q1 and Q3 are the first and third quartiles of the data.
The quartiles
r can be recalled in one glance from the following figure:

Fig. 9.1: Quartiles positioning

The quartiles of the given data can be computed using the quantile()
function available in stats package. The main arguments of interest of the
quantile() function are as follows:
#The quantile() function
quantile(x, #numeric vector
prob, #vector of probabilities
names, #used to show percentiles #names
na.rm, #to handle missing values
type, #algorithm to compute quantile
...) #other arguments of the function

The quantile() function is used to compute the sample quartiles with given
probabilities of a vector assigned to the x argument of the function
272 corresponding to the given probabilities. These probabilities are assigned to
Descriptive Statistics and Correlation with R

the prob argument of the function. Also, from the R documentation page we
know that the smallest observation corresponds to a probability of 0 and the
largest to a probability of 1. Additionally, there is one more argument type
with default value 7 which takes integer values from 1 to 9. These integer
values are representing the algorithm of the quantile computation. Values 4 to
9 are used for continuous sample quantile and values 1 to 3 are used for
discontinues sample quantile. So, it is better to used type as 1.
Now, we use the quantile() function to compute the first quartile (Q1),
median (Q2) and third quartile (Q3) of the purchase data discussed earlier by
assigning the x argument of the quantile() function as purchase vector,
the prob argument as 25% (for Q1), 50% for median (for Q2) and 75% (for Q3).
Also, the type argument as 1 in the following manner:
#Computing the quartiles of the purchase data
> quantile(x=purchase,
, prob=c(0.25,
, 0.50,
, 0.75),
, type=1)
)
25% 50% 75%
500 650 790

Hence, 500 is the first quantile, 650 is the median and 790 is the third quantile
of the pu ase data.
purchase
pur
rcha
Q3 Q1
Then the quartile
r deviation of the data, i.e., , can be easily computed
2
using the d iff f() function by assigning the la
diff() ag argument (default as 1) of it
lag
as 2 in the following manner:
#Computing quartile deviation
> diff(quantile(x=purchase, prob=c(0.25,0.50,0.75), type=1,
names=FALSE), lag=2)/2
[1] 145

Hence, the quartile deviation of the purchase data is 145.


Note that here we have assigned the names argument (TRUE
(TR
TRU
TR UE by default) as
FALSEE as we do not want to show the first row consisting of the percentage
names.
The first and third quartiles from the data can also be easily computed by
using the summary() function available in base package. The
summary() function is a generic function also used to produce results which
are summaries of the results of various model fitting functions. Recall that,
when this function is used on a numeric vector, it provides the six number
summary of the data, which includes, minimum, maximum, mean, median, first
and third quartiles. We now compute the quantiles of the purchase data
using the summary() function and assign it to Qcom as follows:
#Computing the summary of the data
> Qcom
m <-
- summary(purchase,
, quantile.type=1);
; Qcom
m
Min. 1st Qu. Median Mean 3rd Qu. Max.
250.0 500.0 650.0 612.9 790.0 800.0

#Computing the quantile deviation


> unname((Qcom[5]-Qcom[2])/2)
)
[1] 145 273
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Hence, again we get the same quartile deviation as 145. Thus, the result
verifies that using both the approaches we get the same result.
Note: (i) To compute the six number summary of the purchase data, we
have just supplied the purchase data as an argument to the summary()
function. But note that, earlier we have computed the quartile by considering
the type argument as 1 (whose default value is 7). So, by default,
summary() function will also give quartiles using the type of the quartile as
7. To specifically use algorithm type as 1, we have assigned the
quantile.type argument of the summary() function as 1.
(ii) The unname() function is used to remove the name of the computed
quartile deviation. Also, Qcom[5] and Qcom[2] are Q3 and Q1 of the data.
(iii) The quartile deviations for ungrouped data and grouped data can be
computed on the same lines of mean using their formulae.

9.6.3 Mean Deviation


The mean deviation of the data x1, x 2 ,..., x n about a point A usually, (mean or
median) is computed using the following expression:
1 n
Mean deviation= xi A …(9.15)
ni1
This expression can be easily computed in R with the help of mean()
) and
abs() functions as follows:
#Mean deviation of the x data about A point
mean(abs(x-A))

Now, for the illustration purpose, we compute the mean deviation about
abo
b ut mean
of the p
purchase
urcchaasee data.
#Computing the mean deviation about mean of the purchase data
> mean(abs(purchase-mean(purchase)))
[1] 139.5918

Thus, the computed mean deviation about mean is 139.5918.


Note: On the similar lines the mean deviation about any point can be computed.

SAQ
Q5
Write the output of the following two statements and also differentiate between
them.
(i) diff(na.omit(c(2,4,NA,6,1)), lag=1, na.rm=TRUE)
(ii) diff(c(2,4,NA,6,1), lag=3, na.rm=TRUE)

9.7 SKEWNESS AND KURTOSIS


In previous sections, we have discussed a number of measures, but all these
measures do not reveal everything about the frequency distribution of the
data. It may be possible that mean and standard deviation may come out to be
same, but still the frequency distribution differs in shape, in such case we see
the skewness and kurtosis of the data.
274
Descriptive Statistics and Correlation with R

9.7.1 Skewness
Skewness refers to a departure from symmetry (asymmetry of the distribution).
Skewness plays a very important role because the statistical theory is often
based on the assumption of the normal distribution and normal distribution is a
symmetric distribution.

Fig. 9.2: Types of skewness


2 2
The constant 1 3 / 2 is used as a measure of skewness. This constant
rd nd
involves 3 and 2 moments about mean. We know that r , the rth moment
about mean is given by
n
(xi x)r
i 1
, for ungrouped data
n
r n
…(9.16)
r
fi (xi x)
i 1
, for grouped data
N
As 1 is always positive, it does not tell us about the direction of skewness.
Therefore, Karl Pearson’s 1 is calculated using 1 1 , recall that the sign
of skewness depends upon the value of 3 .

For the illustration purpose, we consider an ungrouped data consisting of the


mother's
th ' weighti ht iin pounds
d att llastt menstrual
t l period,
i d ii.e., l t variable
lwt i bl off th
the
birthwt data available in the MASS package. To visualize the skewness of
the data we create a density histogram of the lwt variable of the birthwt
data as follows:
#Loading the package
> library(MASS)
)
#Creating histogram of the lwt variable
> hist(birthwt$lwt,
, prob=TRUE,
, col="lightpink")
)

The obtained density histogram is shown in Fig. 9.3.


From 9.3, it can be seen that the data is positively skewed. Now, we shall
illustrate both the ways of obtaining a quantitative measure of the skewness
(using formula and by using a function of moments package).

Note: Whenever, histogram is created to see the probability distribution, use


the prob argument as TRUE to create a density histogram.
275
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Fig. 9.3: Density histogram of the weight data

To compute the skewness of the ungrouped data x consisting of the lwt


variable data of the birthwt data, we first create a function named mom using
(9.16) (for ungrouped data) as follows:
#Assigning x
> x <- birthwt$lwt

#Creating a function to compute the rth moments about mean


> mom <- function(x, r){
+ sum((x-mean(x))^r)/length(x)
+ }

Recall that creation of a function is al


a
already
re
eady discussed in the Unit 6 of MST-015
m m,, we next use it to compute the 2nd and
course. After creating the function mom,
mo
3rd moments. Also, we assign the computed values of Mu2 and Mu3 as follows:
#Computing 2nd moment about mean
> Mu2 <- mom(x, 2); Mu2
[1] 930.1509

#Computing 3rd moment about mean


> Mu3
3 <-
- mom(x,
, 3);
; Mu3
3
[1] 39455.9

Thus, from the obtained results we get 3 = 39455.9 and 2 =930.1509.


Finally, we compute 1 and 1 . Additionally, we assign them to Beta1 and
Gamma1 as follows:
#Computing the constant beta 1
> Beta1
1 <-
- Mu3^2/Mu2^3;Beta1
1
[1] 1.934478

#Computing the constant gamma 1


> Gamma1
1 <-
- sqrt(Beta1);Gamma1
1
[1] 1.390855

The sign of 3 is positive and 1 is greater than zero indicates the positive
276 skewness. The same can be verified from thedensity histogram as well.
Descriptive Statistics and Correlation with R

Alternatively, the skewness() function available in the moments package


can be used to compute the skewness of the data. Note that the moments
package does not come as part of binaries for base distribution, so we first
need to install it and load it before using it as follows:
#Installing the package
>install.packages("moments")
)

#Loading the package


> require(moments)
)

Then the skewness of the lwt variable of the birthwt data can be computed
using the skewness() function as follows:
#Computing the skewness of the lwt variable
> skewness(birthwt$lwt)
)
[1] 1.390855

Hence, the obtained result confirms that the skewness() function returns the
value of the 1 .

Next, we discuss the method of computing the skewness for the continuous
frequency data. To do so, we again consider the breaking strength data
discussed in Section 9.5 of this unit. So, proceeding with the same objects
names and assigned data, we get.
#Recalling breaking strength data
> BS_ll <- seq(54, 62, 2) #lower limit
> BS_ul <- BS_ll+2 #upper limit
> fi <- c(4, 22, 25, 19, 9) #frequencies

To compute the skewness of the grouped data, we compute the middle values
first and assign them to xi. Additionally, to compute the moments about mean
we use the already created function momf f as follows:
#Computing the middle values
> xi<-(BS_ll+BS_ul)/2

#Recalling the function to compute the moments about mean


> momf
f <-
- function(xi,
, fi,
, r){
{
+ N <-
- sum(fi)
) #Computing total frequency
+ m <-
- sum(fi*xi)/N
N #Computing mean
+ sum(fi*(xi-m)^r)/N
N #Computing rth moment about mean
+ }

Next, we use it to compute the 2nd and 3rd moments. Also, we assign the
computed values to Muf2 and Muf3 as follows:
#Computing the 3rd moment about mean
> Muf3
3 <-
- momf(xi,
, fi,
, 3);
; Muf3
3
[1] 1.254521

#Computing the 2nd moment about mean


> Muf2
2 <-
- momf(xi,
, fi,
, 2);
; Muf2
2
[1] 4.677456 277
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Thus, from the obtained results we get 3 =1.254521 and 2 =4.677456.


Finally, we compute 1 and 1 . Also, assign them to Beta1 and Gamma1 as
follows:
#Computing beta 1
> Beta1
1 <-
- Muf3^2/Muf2^3;
; Beta1
1
[1] 0.01537897

#Computing gamma 1
> Gamma1
1 <-
- sqrt(Beta1);
; Gamma1
1
[1] 0.124012
It can be observed, that the computed value of 1 is close to zero, and 1 is
also close to zero which indicates the moderate symmetry of the distribution of
the data.

9.7.2 Kurtosis
Recall that the normal curve is known as ‘mesokurtic’ curve. A curve, which is
more peaked than a normal curve is known as ‘leptokurtic’ and the curve
flatter (flat-topped) than the normal curve is known as ‘platykurtic’.

Fig. 9.4: Types of kurtosis


2
The constant 2 4 / is used as a measure of kurtosis. Karl Pearson’s 2
2

is calculated using 2 2 3 . Further, it is easier to interpret 2 for kurtosis


as compare to 2 . For symmetrical distributions (mesokurtic) 2 3 or 2 0 .
If 2 3 , then the curve of the data is leptokurtic and if 2 3 then the curve
is platykurtic.
For the illustration purpose, we now compute the kurtosis of the lwt variable
of the birthwt data. Recall that a function to compute the moments about
mean in case of ungrouped data is already created with name mom. Next, we
use mom to compute the 4th moment about mean of the data and assign it to
Mu4 as follows:
#Computing the 4th moment about the mean and assigning to Mu4
> Mu4
4 <-
- mom(x,
, 4);
; Mu4
4
278 [1] 4593401
Descriptive Statistics and Correlation with R
After computing 4 we next compute 2 as 2 is already computed and its
value is assigned to Mu2 as follows:
#Computing Beta 2
> Beta2
2 <-
- Mu4/(Mu2)^2;
; Beta2
2
[1] 5.309181

Since the obtained value is more than 3, we can say that the curve of the data
is moderately leptokurtic.
Alternatively, The kurtosis() function available in the moments package
can be used to compute the kurtosis of the data as follows:
#Computing the kurtosis of the data
> kurtosis(birthwt$lwt)
)
[1] 5.309181

Hence, the obtained result confirms that the kurtosis() function returns the
value of the
a ue o e 2.

Next, we discuss the method of computing the skewness from the grouped
data. For the illustration purpose we again consider the breaking strength
frequency data discussed earlier and we use the same objects names and
assigned data with user-defined function mo
m
momff for grouped data to compute
mf
the ku
kurtosis()
kurttosis
is()
is () as follows:
#Computing 4th moment about mean and assigning to Muf4
> Muf4 <- momf(xi, fi, 4); Muf4
[1] 48.65873

#Computing the kurtosis


> Beta2 <- Muf4/(Muf2)^2; Beta2
[1] 2.224034

Since the obtained value is less than 3, we can say the curve of the data is
platykurtic.

SAQ 6
Write R code to compute the skewness of the following data:
0.7 1.2 -0.5 0.8 -0.1 0.6 -1.8 0.9 -2.8 0.2

9.8 CORRELATION AND BIVARIATE PLOTS


Here in this unit, we are mainly interested in the Pearson’s and Spearman’s
correlation coefficients only. The cor() function available in the stats
package is used to find the correlation between two variables, say, X and Y.
Let us first see the internal structure of this function as follows:
#Internal structure
> str(cor)
)
function (x, y = NULL, use = "everything", method =
c("pearson", "kendall", "spearman"))
279
Functions, Conditional Statements, Loops and Descriptive Statistics with R

The x and y arguments of the cor() function are used to assign the two
variables, say X and Y, whose correlation is to be computed. Another
important argument of the cor() function is the method argument, it is used
to specify the method to be used to compute the correlation. Moreover, the
use argument (with default value "everything") of the function is used to
assign a character string which specifies the method of computing in the
presence of NA values. Additionally, the cor() function can also be used on
matrices or data frames. In that case, the correlation between the columns of
X and the column of Y (when X and Y both are matrices or data frames) are
computed.

9.8.1 Karl Pearson’s Correlation Coefficient


The Karl Pearson’s correlation coefficient is a quantitative measure of linear
relationship between two variables, say X and Y. A British Biometrician ‘Karl
Pearson’ developed the formula to get this quantitative measure, which is as
follows:

xi x yi y
i
r …(9.17)
2 2
xi x yi y
i i

Next, we consider the following problem based on correlation in which a


psychologist wants to compare two methods X and Y of teaching. She
selectes a random sample of 20 students. She grouped them into 10 pairs so
that the students in a pair have roughly equal scores on an intelligence test. In
each pair, one student was taught by method X and the othe er was ta
other aught by
taught
method B. These 10 pairs of students were examined after the course. The
marks obtained by each pair of students are as ffollows:
ollows:
Pairr 1 2 3 4 5 6 7 8 9 10
X 23 30 18 15 29 18 28 29 19 2
27
Y 36 34 15
5 26 22 29 18 19 18 10

Now we compute the Pearson’s correlation coefficient between the two sets
Now,
of scores using (9.17) and the cor() function. To do so, we first assign the
data to x and y vectors as follows:
#Assigning the data
> x <-
- c(23,
, 30,
, 18,
, 15,
, 29,
, 18,
, 28,
, 29,
, 19,
, 27)
)
> y <-
- c(36,
, 34,
, 15,
, 26,
, 22,
, 29,
, 18,
, 19,
, 18,
, 10)
)

To compute the Pearson’s correlation coefficient using (9.17), we use the


mean(), sum() and sqrt() functions as follows:

#Computing the Karl-Pearson’s correlation coefficient


> sum((x-mean(x))*(y-mean(y)))/sqrt(sum((x-mean(x))^2)*sum((y-
mean(y))^2))
)
[1] -0.05191304

Hence, -0.05191304 is the computed Person’s correlation coefficient. Also, the


correlation value near to zero indicates that there is no correlation between
280 two sets of scores.
Descriptive Statistics and Correlation with R

Next we compute the correlation using the cor() function. To do so, we


assign the first two arguments of the cor() function as x and y. Most
importantly to compute the Karl-Pearson’s correlation coefficient we select the
method as "pearson" in the following manner:
#Computing the Karl-Pearson’s correlation coefficient
> cor(x,y,
, method="pearson")
)
[1] -0.05191304

The output verifies that we get the same result as earlier.


In addition to all these, it is not always possible to obtain complete details on
each variable. In that case, missing values comes into the picture. In such a
situation, the learners may have incomplete data for the analysis purpose, i.e,
the data consists of NA values. If we will compute the correlation of the
incomplete data, then we are surely going to get NA output from the cor()
function. As a saviour, we can use the use argument of the cor() function
and assign it as "pairwise.complete.obs",
pairwise.complete.obs , which means that for the
computation of the correlation coefficient use only those pairs which are
complete. For more clarification consider the following arbitrary data with
missing values.
#Data with missing values
> x <- c(23, 30, 18, 15, 29, NA, 28, 29, NA, 27)
> y <- c(36, 34, 15, NA, 22, 29, 18, 19, 18, 10)

#Computing the correlation with missing values


> cor(x, y, method="pearson", use="pairwise.complete.obs")
[1] 0.1323423

It can be seen from the output that computation of correlation coefficient is


done only using those pairs which do not have missing values. The same can
be verified by computing the Karl Pearson’s correlation coefficient between the
pairwise complete observations as follows:
#Verification of the computed result
> x <-
- c(23,
, 30,
, 18,
, 29,
, 28,
, 29,
, 27)
)
> y <-
- c(36,
, 34,
, 15,
, 22,
, 18,
, 19,
, 10)
)
> cor(x,
, y,
, method="pearson")
)
[1] 0.1323423

9.8.2 Spearman’s Correlation Coefficient


Spearman’s rank correlation coefficient gives the concentration of association
between two qualitative characteristics. Spearman rank correlation coefficient
is generally used when, the ranks data say x1, x 2,..., x n and y 1, y 2,..., y n on two
variable X and Y are given.
In Spearman’s rank correlation coefficient, we shall consider two cases one is
when ranks are not repeated known as non tied rank case and another is
when the ranks are repeated known as the tied rank case.
In the non-tied rank case, it is assumed that two or more individuals or units do
not have same rank and it is computed using the following formula: 281
Functions, Conditional Statements, Loops and Descriptive Statistics with R

6 di2
rs 1 , where, n is the number of values; …(9.18)
n(n2 1)
di is the difference between the ranks, i.e., x i y i for i=1,2,…,n.

Furthermore, in the tied rank case Spearman’s rank correlation coefficient is


computed using the following formula:
m(m2 1)
6 di2 ...
12
rs 1 …(9.19)
n(n2 1)

n is the number of values;


di is the difference between the ranks, i.e., x i y i for i=1,2,…,n,

m is the number of times an item is repeated.


m(m2 1)
Further, the corrector factor is to be added for each repeated value
12
in both the X and Y data.
The cor() function can also be is used to compute the Spearman’s rank
correlation coefficient. To compute the Spearman’s rank correlation coefficient
using the cor() function, we assign the method
m thod
me d argument as “sp a ”.
“spearman”.
spea
sp earm
ea man
Note: In R, we do not need specifically rank data here, only the variables are
to be supplied to the cor() function to compute the Spearman’s correlation
coefficient.
Let us discuss the method of computing Spearman’s rank correlation
coefficient of a data. For the illustration purpose consider
con nsider the
e students
s uden
st nts
examination data discussed in Subsection 9.8.1 again. Proceeding with the
same object names and assigned data, we comp compute
pute the Spearman’s rank
correlation coefficient from it by assigning the first two arguments of the cor()
cor(
()
function as x and y. Most importantly, the me method
eth od as "spearman"
thod "spe ear
rma n" in the
man"
following manner:
#Computing the Spearman’s correlation coefficient
> cor(x
cor(x,
, y
y,
, method="spearman")
[1] 0.09174355

Learners can recall that mathematically, Spearman’s rank correlation is the


Karl Pearson’s correlation coefficient only, but applied to the rank data. The
same can be verified from the following statement:
#Computing the Spearman’s correlation coefficient
> cor(rank(x),
, rank(y),
, method="pearson")
)
[1] 0.09174355

Hence, we get the same result.


Moreover, a bivariate plot can also be created of the students examination
data to see the relationship graphically using the plot() function as follows:
#Creating a scatter plot between the the rank data
> plot(x,
, y,
, pch=18,
, col="red",
, cex=2)
)

282
Descriptive Statistics and Correlation with R

Fig. 9.5: Scatterplot of the student’s data

Hence, the bivariate plot also indicates that there is no correlation between x
and y variables.
As mentioned earlier, the cor() function can also be used on matrices or
data frames. This means the function arguments x and y will be assigned as
matrices or data frames instead of vectors. In that case, the correlation
between the columns of x and the columns of y will be computed. There may
be situations in which we might be interested in finding the correlation in
between the columns of x only. In that case, it is enough to supply only one
matrix or data frame object as function argument, cor(r(x,
r( x,x)
x,
cor(x,x)x) or coor((x), as
cor(x),
both will yield the same result.
For the illustration purpose, we create a correlation matrix using the co
c
cor()
r()
r( )
function of the first 20 rows of the U ests data. To
USArrests
SAr
rre T do so, we first assign
the extracted data to Ext tUs. Then compute the correlation
ExtUs. o matrix between the
columns of the E xtUs data as follows:
ExtUs
#Assigning the data
> ExtUs <- USArrests[1:20,]

#Computing the correlation matrix


> cor(ExtUs) #Or cor(ExtUs
cor(ExtUs,ExtUs)
ExtUs)
Murder Assault UrbanPop Rape
Murder 1.00000000 0.6953169 0.08000693 0.5073718
Assault 0.69531691 1.0000000 0.31796396 0.6996077
UrbanPop 0.08000693 0.3179640 1.00000000 0.3471705
Rape 0.50737177 0.6996077 0.34717050 1.0000000
Note that, by default, the Karl Pearson’s correlation coefficients between the
columns of ExtUs data frame is computed. Also, the obtained matrix is a
square symmetric matrix. Now, we would like to verify whether the correlation
between the Assault and Murder variables of ExtUs data is 0.6953169 or
not. For the verification purpose. we compute the correlation between
Assault and Murder variables using the cor() function as follows:
#Computing the correlation between two variables
> cor(ExtUs$Assault,
, ExtUs$Murder)
)
[1] 0.6953169
283
Functions, Conditional Statements, Loops and Descriptive Statistics with R

The matrix of scatter plot of the ExtUs data can be created using the
pairs() function discussed in Unit 5 of MST-015 course as follows:

Fig. 9.6: Pairs plot

From the correlation matrix and scatter plots, it can be seen that highest
correlation is between Assault
A saul
As t and Murder
ult Muurd
rder r variables and that lowest
correlation is between UrbanPop
U ban
Ur op and Murder
nPo Mu r er variables.
urd
Note: After studying this unit, learners can also compute other statistical
measures such as percentile, quantile, decile and measures of variations such
as coefficient of variation and others measures on their own.

SSAQ
SA
AQ 7
Consider the following women data available in the datasets package and
write R code to do the following tasks:

(i) Extract the first 10 rows of the data frame and assign it under the name
ExtWo.
(ii) Compute the Karl Pearson’s correlation between the columns of ExtWo.
(iii) Create scatter plot of all the variables of ExtWo in single plot.

9.9 SUMMARY
The main points discussed in this unit are as follows:
Methods of computing different measures of central tendencies such as
arithmetic mean, geometric mean, harmonic mean, median and mode
284 are discussed.
Descriptive Statistics and Correlation with R

The method of monitoring or computing the variance and standard


deviation are discussed.
Computations of the Skewness and Kurtosis are illustrated.
Computations of range, quartile, mean and quartile deviations are
illustrated.
Computations of the Karl Pearson’s and Spearman’s correlation
coefficient are also discussed in this unit.

9.10 TERMINAL QUESTIONS


1. Create a function named average, which computes the arithmetic mean
(AM), geometric mean (GM) and harmonic mean (HM) of a number of
ungrouped observations.
2. Write R code to compute the median of the following discrete frequency
distribution.
Observation Frequency
x1 f1
x2 f2
x3 f3
x4 f4
x5 f5

3. Write R code to compute the variance of the following discrete frequency


distribution xi fi , i=1, 2, 3, 4, 5.

4. Write R code to compute the constants 1 and 2 , constants of


skewness and kurtosis of the ungrouped data x1, x2, x3, x4, x5 using
functions and lists. Also, give the printing command using the cat()
function for printing the constants 1 and 2 .

5. Write R code to compute the Karl Pearson’s correlation between the


columns of the following data using a data frame:
X Y Z
x1 y1 z1
x2 y2 z2
x3 y3 z3
x4 y4 z4
x5 y5 z5
Also, write in the obtained correlation matrix, what each element
represents.
6. Which statistical measures the summary() function provides as output
when it is used on a numeric data.
7. Create a function named CorrFac to compute the correction factor of
Spearman’s correlation coefficient.
8. State whether TRUE or FALSE
285
Functions, Conditional Statements, Loops and Descriptive Statistics with R

(i) Arithmetic mean for the grouped data can be computed using
weighted.mean() function.
(ii) The cov() function can be used to obtain the variance of the
ungrouped data.
(iii) The R statements sd(x) and sqrt(var(x)), where x is a vector,
will give different outputs.

9.11 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. (i) In this problem we are asked to write a code to compute the
geometric mean of the given discrete frequency data. Clearly, in this
problem we don’t need to write any statement to compute the middle
values as xi’s are already given. So, we first assign xi’s and fi’s to x and
f, respectively.
x <- c(x1,x2,x3,x4,x5)
f <- c(f1,f2,f3,f4,f5)
Then, the total frequency can be obtained using the sum() function as
follows:
N <- sum(data$freq)
Finally, the geometric mean of the data can be computed using one of
the following two formulae:
(prod(xi^fi))^(1/N)
Or,
exp(weighted.mean(x=log(xi), w=fi/N))
(ii) The output of the given statement is 3.5.
2. To compute the median of the
e ungrouped data a function
n named
e
MEDIAN can be created as follows:
MEDIAN <- f
function(x){
n <- length(x)
sx <- sort(x)
if(n%%2==0) {
cat('Median', (sx[n/2]+sx[(n/2)+1])/2,'\n')
} else {
cat('Median', sx[(n+1)/2],'\n')}
}
3. The which() function is available in the base package. It gives us
TRUE indices of a logical object. The object can be a vector or an array.
For example, the following code will give the indices as 4 and 5 since the
elements greater than 13 are present at the 4th and 5th indices.

286 x<-11:15; which(x>13)


Descriptive Statistics and Correlation with R

4. We first create a vector of the data and assign it under the name S as
follows:
S<-c(116, 151, 116, 179, 141, 197, 191, 197, 160,
175, 162, 137, 122, 194, 128, 115, 140, 165, 123,
151, 179, 178, 189, 185, 143, 152, 195, 152, 117,
165, 199, 163, 173, 178, 172, 173, 179, 159, 191,
158)
To compute the standard deviation of the data using sd() function, we
shall be required the length of the data, i.e., n. Therefore, we compute it
using the length() function and assign its value under the name n as
follows:
n<-length(S); n
Finally, we can compute the standard deviation of the data using the value
of n and sd() function using the following statement.
sd(S)*sqrt((n-1)/n)
5. The output of the given statements are as follows:
(i) 2 2 -5
(ii) 4 -3
The difference between the two statements is that, in (i) we have the
na.omit() function used on the vector and the lag argument is 1, but
in (ii), we do not have na.omit() function and lag argument is
specified as 3.
So, in (i) firstly the NA values will be removed due to the na.omit()
function and we get the vector will elements 2, 4, 6 and 1. Next, as
lag=1, the consecutive difference between the terms will be computed
as 2, 2, -5.
In the ne
nextt statement lag is 3,
3 so the difference will computed
ill be comp 6-2
ted as 6 2
and 1-4 (with gap between the observation as 3). Since NA’s will not be
removed from data so the output will be 4 and -3.
6. We first assign the given data to a vector named x as follows:
x <- c(0.7,1.2,-0.5,0.8,-0.1,0.6,-1.8,0.9,-2.8,0.2)
Then by writing the following two statements we can easily compute the
coefficient of skewness.
n <- length(x)
mean((x-mean(x))^3)^2/mean((x-mean(x))^2)^3
7. First, we extract the data using the following statement.
ExtWo <- women[1:10,]
Then the Pearson’s correlation coefficient can be computed using the
cor() function in following manner.
287
Functions, Conditional Statements, Loops and Descriptive Statistics with R

cor(ExtWo)
Or, cor(ExtWo$height, ExtWo$weight)
Lastly, we can create the scatter plot using the following plot() function
command.
plot(ExtWo$height, ExtWo$weight, col="blue", cex=2)

Terminal Questions (TQs)


1. In this problem we are asked to create a function for computing the
arithmetic mean (AM), geometric mean (GM) and harmonic mean (HM)
of a number of ungrouped observations. So, firstly we assign the data
under the name data as follows:
data <- c(x1, x2, x3, x4, x5)
Then we create a function named average to compute AM, GM and
HM of the data as follows:
average
g <- function(data){
( ){
AM <- mean(data) #for computing AM
GM <- (prod(data))^(1/length(data)) #for computing GM
HM <- 1/mean(1/data) #for computing HM
cat("Arithmetic mean=",AM,'\n',"Geometric mean=",
GM,'\n',"Harmonic mean=",HM,'\n') #for printing the result
}
Then the function can be called as follows:
average(data)
2. In this problem we are asked to wrw ite a code
write d to computu e the median of
compute
the given discrete frequency data. Clearly, in this problem we don’t need
to write the R statement to compute the middle values as xi’s s are already
given. So, we first assign the given data as follows:
x <- c(x1, x2, x3, x4, x5)
f <- c(f1, f2, f3, f4, f5)
Then a null vector can be created with name data as follows:
data <- c()
Next, we create a vector of all the observations using the discrete
frequency data as follows:
for(i in 1:5){
data<-c(data, rep(x[i], f[i])}
Then either we can compute the median using the median() function or
by using the MEDIAN function created earlier (refer SAQ 2).
median(data) #function call
3. In this problem we are asked to write a general R code to compute the
variance of the discrete frequency data. We, firstly assign the given data
to variables as follows:

288 x <- c(x1, x2, x3, x4, x5)


Descriptive Statistics and Correlation with R

f <- c(f1, f2, f3, f4, f5)


Then, the variance of the data can be compute using the following R
statement:
sum(fi*(xi-mean(xi))^2)/sum(fi)
4. In this problem we are asked to use list specifically. Thus, we can create
a list of the given data as follows:
x <- list(c(x1, x2, x3, x4, x5));x
As this list is having only one element, so the elements can be extracted
using x[[1]]. Then the length of the data can be computed using the
following statement:
N <- length(x[[1]])
Thereafter, the 2nd , 3rd and 4th moments about the mean can be
computed using the following:
Mu2 <- sum((x[[1]]-mean(x[[1]]))^2)/N
Mu3 <- sum((x[[1]]-mean(x[[1]]))^3)/N
Mu4 <- sum((x[[1]]-mean(x[[1]]))^4)/N
The constants 1 and 2 can be computed using the moments as
follows:
Beta1 <- Mu3^2/Mu2^3
Beta2 <- Mu4/(Mu2)^2
The computed constants 1 and 2 can be printed
printted using the following
command.
cat("Beta1=",Beta1,"\t","Beta2=",Beta2,"\n")
5. In this problem we are asked to compute the Karl Pearson’s correlation
coefficient between the columns of the given data. As we are asked to
use data frames, therefore, we can create a data frame named df to
assign the data as follows:
df <- dataframe(x=c(x1,x2,x3,x4,x5),
y=c(y1,y2,y3,y4,y5), z=c(z1,z2,z3,z4,z5))
By default, if the method is not specified in the cor() function, then the
Karl Pearson’s correlation can be computed. Therefore, the correlation
between the columns of the data frame df can be computed using the
following command:
cor(df) #for computing the correlation matrix
Execution of the cor() function, will yield the following matrix of
correlation coefficients.
cor(x, x) cor(x, y) cor(x,z)
cor(y, x) cor(y, y) cor(y,z)
cor(z, x) cor(z, y) cor(z,z)

289
Functions, Conditional Statements, Loops and Descriptive Statistics with R

Additionally, what each entry of the correlation matrix is representing can


be clearly seen from this matrix.
6. When the summary function is used on a data the following six statistical
measures are provided as output:
(i) Minimum
(ii) First Quartile
(iii) Median
(iv) Mean
(v) Third Quartile
(vi) Maximum
7. The correction factor for spearman’s correlation coefficient can be
computed using the following loop
CorrFac <- function(x){
mx=0;n=length(x)
i=1
while(i<n-1)
{ m=1
for(j in (i+1):n)
{
if(x[i]==x[j]) m=m+1
}
mx=mx+(m^3-m)/12
i=i+m
}
mx }
CorrFac(x)
8. The answers are as follows:
(i) TRUE
(ii) TRUE
(iii) FALSE

290

You might also like