Professional Documents
Culture Documents
Introduction to R Software
Indira Gandhi National Open University
School of Sciences
BLOCK 1
Fundamentals of R Language 3
BLOCK 2
Functions, Conditional Statements, Loops
and Descriptive Statistics with R 179
INTRODUCTION TO R SOFTWARE
BLOCK 1: Fundamentals of R Language
Unit 1: Introduction to R
Unit 2: Nitty-Gritty of R
Block
1
FUNDAMENTALS OF R LANGUAGE
UNIT 1
Introduction to R 9
UNIT 2
Nitty-Gritty of R 29
UNIT 3
Membership Testing, Coercion and Lists in R 69
UNIT 4
Data Frames, Reading and Writing in R 101
UNIT 5
Graphical Representation of Data with R 141
Curriculum and Course Design Committee
Prof. Sujatha Varma Prof. Rakesh Srivastava
Former Director, SOS Department of Statistics
IGNOU, New Delhi M. S. University of Baroda, Vadodara (GUJ)
Formatted and CRC Prepared by Dr. Taruna Kumari and Ms Preeti, SOS, IGNOU
Course Coordinator: Dr. Taruna Kumari
Programme Coordinators: Dr. Neha Garg and Dr. Prabhat Kumar Sangal
Print Production
Mr. Rajiv Girdhar Mr. Hemant Parida
Assistant Registrar Section Officer
MPDD, IGNOU, New Delhi MPDD, IGNOU, New Delhi
Acknowledgement: From the depth of my heart I render my gratitude to my family, specially, my father Mr. Puran
Chand, my mother Mrs. Raj Rani, my husband Mr. Anupam Pathak and my son Prithu for providing me necessary
comfort to overcome the ups and downs during the development of this material. Also, I extend my thanks to my
former graduate and post graduate students for their feedbacks and questions, which enabled me to get into
detailed explanation.
April;, 2023
© Indira Gandhi National Open University, 2023
ISBN-978-81-266-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the Indira Gandhi National Open University
Further information on the Indira Gandhi National Open University may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110068 or visit University’s website http://www.ignou.ac.in
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by the Director, School
of Sciences.
INTRODUCTION TO R SOFTWARE
R is a high level language. A language whose popularity is increasing day by day. It can also
be referred as an environment specially used for statistical analysis of the data and graphics
facilities. You may feel astonish to know that, R language has been around us since 1993.
The R language is dialected from the S language. 1The S language was developed at Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. The evolution of the S
language is described by the four books of John Chambers and coauthors. 2For John
Chambers efforts the Association for Computing Machinery (ACM) awarded him with its
Software System Award, that mentioned that this languge is “forever altered how people
analyze, visualize and manipulate data”. R was written by Ross Ihaka and Robert
Gentleman at the Department of Statistics of University of Aukland in New Zealand.
There are several reasons for the popularity of R. We are stating some of them here:
R is an interpreted language, which is free.
An outstanding and magnificent software, which is easy to use as well.
Work on Windows, Unix, Mac and Linux.
A number of statistical packages are available for handling statistical data analysis.
Comes with several data sets.
Quality of support and back-up available (via web-pages, R documents and books) on
functions and packages.
Widely accepted by many researchers, industralists and professors for the data
analysis purpose.
The main reason for impressive growth in the popularity of the R language now a days is,
emergence of data science as a career because data is everywhere and experts are needed
to sort and anlayze that day. So, together with the knowledge of computing, the knowledge of
the statistical methods and machine learning are also required.
This course is mainly written for the learners who are beginners in R computing g software.
Throughout the development of this course the emphasis are given to the packages which
comes with base distribution (i.e., precompiled binary di d stributions of the base sy
distributions ssystem)
stem)
during installation. It is essential for the learners to understand the basics of R b efore,
before,
switching to more complicated problems, such as discussed in the lab courses, i.e., MSTL- MSTS L-
011: Statistical Computing Using R-I, MSTL-012: Statistical Computing Using R-II, MSTL-
013: Statistical Computing Using R-III and MSTL-015: Statistical Computing Using R-V. The
content of this course is organized into self-explainatory 9 units. First five units are the part of
the Block 1 (Fundamentals of R Language) and next 4 units are the part of the Block 2
(Functions, Conditional Statements, Loops and Descriptive Statistics with R). These units
can be summarized as follows:
Unit 1 (Introduction to R): It comprises of installation procedure, methods of seeking help
and details on basic terminologies of R
Unit 2 (Nitty-Gritty of R): The second unit discuss about the R objects such as different types
of vectors, matrices, factors and arrays. It also throw light on missing values, arithmetic and
logical operations.
Unit 3 (Membership Testing, Coercion and Lists in R): As clear from the name in this unit
discuss membership: testing and coercion of R objects. Additionally, the lists objects are also
discussed in this unit.
Unit 4 (Data Frames, Reading and Writing in R): This unit given extensive details on data
frames objects, methods of reading and writing from/to a file and formatting commands.
1 Refer “An Introduction to R” manual by R Core Team
2
Refer“R Language Definition” manual by R Core Team
Unit 5 (Graphical Representation of Data with R): Different types of graphical functions that
are used to create plots of Scatterplot, Boxplot, Histogram, Barplot, Stripchart, Stem and
Leaf plot, Pie chart, pairs plot, coplot, cloud plot etc are discussed in this unit.
Unit 6 (Functions in R): The method of creating your own function is discussed in this unit by
taking some suitable examples.
Unit 7 (Control-Flow Constructs of R): Control-flow constructs such as conditional
statements, different types of loops and method of putting additional control on the loops
using the next and breaks statements are discussed in this unit with examples.
Unit 8 (Apply Family in R): This unit comprises of details on the usage and importance of the
apply family functions.
Unit 9 (Descriptive Statistics and Correlation with R): Unit 9 comprises of details on
measures of central tendency and dispersion together with examples on correlation
computations with R.
To develop this course, we have used Window operating system and the R commands
written in this course are run on R version 4.1.1. In a Window system, we interact with R
through the R console. Futhermore, the written commands can be easily saved. More details
on it are given in Unit 1 of this course.
In this course, the written codes, associated outputs and names of the functions, R objects,
packages, operators are written in ‘Lucida Console’ font type and theory is written in ‘Arial’
font type. Additionally, the R commands are written in bold and associated outputs are
unbold. Note that, the lines starting with ‘ # ’ written before the R commands are the
unexecuted commands, written to give clear understanding of the code part. Furthermore,
while studying this course do all the illustrations on the computers, preferably by writing the
commands in R script files (in an integrated editor) available on R Graphical User Interface
((RGui).
(RGui)). Then do all the SAQs and TQs, without using g computers.
It is important to note that, if you use any R function in your research/publications for data
analysis purpose then you should cite that package, in you written w work. example
ample to
ork. Say for exa
cite the used package base firstly get the citation details
e ails using the citation() function
det
and then use the obtained reference for citation purpose as follows:
In case, if the citation details are accessible (or available) via citation() function at the
prompt them learners may visit the CRAN (Comprehensive R Archive Network) page to get
the details of the contributors (such as author’s names, year and title) for citation purpose.
Lastly, in this introduction page I would like to express my deepest gratitude and thanks to
the R core team, Bill Venables, David M. Smith, John Chambers, Robert Gentleman, Ross
Ihaka, Martin Maechler and other contributors for providing access to enormous R sources
and for their substantial contribution in R language, which has extremely benefited the world.
The MST-015 (Introduction to R Software) is a 2 credit self-explained course, which is
developed for self-study. But still if you want to refer to additional books or references on
discussed topics you may refer to the following books and references.
Suggested Further Reading
1. Braun, W. j. & Murdoch, D. J. (2007). A First Course in Statistical Programming with R.
Cambridge.
2. Crawley, M. J. (2012). The R book. John Wiley & Sons.
3. Albert, J. & Rizzo, M. (2012). R by Example. Springer
4. Teetor, P. (2011). R Cookbook. O’REILLY.
5. Lafaye de Micheaux, P., Drouilhet, R., & Liquet, B. (2013). The R software:
Fundamentals of programming and statistical analysis. Springer.
6. Zuur, A., Ieno, E. N., & Meesters, E. (2009). A Beginner's Guide to R. Springer Science
& Business Media.
7. Heumann, C., Schomaker, M. & Shalabh (2016). Introduction to statistics and data
analysis: With Exercises, Solutions and Applications in R. Springer International
Publishing Switzerland.
8. Dalgaard, P. (2002) Introductory Statistics with R. New York: Springer- Verlag.
References
The packages used for the development of this course matrial can be referred from the
following references:
1. R Core Team (2021). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
2. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.
Springer, New York. ISBN 0-387-95457-0.
3. Mirai Solutions GmbH (2023). XLConnect: Excel
x ell Connector for R. R package version
Exc
1.0.7. https://CRAN.R-project.org/package=XLConnect
https://CRAN.R-project.org/pa
p ckage=
e=XLLCoonn
nnec
ectt
4. Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New
York. ISBN 978-0-387-75968-5
5. Lukasz Komsta and Frederick Novomestky (2022). moments: Moments, Cumulants,
Skewness, Kurtosis and Related Tests. R package version 0.14.1. https://CRAN.R-
project.org/package=moments
Expected Learning Outcomes
After completing this course, you should be able to:
Install R, take helps on functions and data sets, create R scripts and learn some basic
aspects of R;
create R objects and know the different data types and learnt to use membership:
testing and coercion functions;
read and write from/to a file;
do graphic representation of data with R;
do looping, create control statements and functions in R; and
compute descriptive statistics and correlation with R.
Structure
1.1 Introduction Recalling Previous Commands
1.1
1 .1
1 INTRODUCTION
INTRODUCTIO
ON
This unit provides an introduction to the main features of R language. Also, we
do not assume any familiarity of the learner with the computer programming
while learning from this unit. The present unit sets the ground work for the
other units. It explains the procedures of downloading, installing and running
R. It also explains some of the important basic concepts related to R language
which are objects, classes of objects, case sensitivity of the language etc.
Most importantly, it explains how to find help on R constants, reserved words,
data sets and functions, which leads to the path of getting the answer to your
queries.
Many books on R programming language assumes that you are familiar with R
fundamentals, such as syntax, functions, operators, data types and so on. The
speciality of MST-015 (Introduction to R Software) course is that, it does not
require prior knowledge of any computing software. The R programming
language is discussed here from the scratch.
Note that, R is a free (open source) interpreted language. It is specially
designed for handling statistical computations and for graphical representation
of data. It also provides interface to other languages and debugging facilities. 9
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
Nowadays R is used by enormous people daily to perform data analysis. R
has now become a tough competitor to almost all the commercial statistical
software’s.
We would recommend that you to study the course introduction pages to get
aware with the development of R language and its contributors, which has
tremendously benefited the world.
1.2
1 .2 D
DOWNLOADING
OWN
NLO
OADIN
NG A
AND
ND IINSTALLING
NSTALLING R
The most convenient way to download R in your system is to obtain base
distribution from the R Website, which is as follows:
https://www.r-project.org/
g
When you will go to the above link (assuming you have access
acce
ess to the
e internet)
it will take you to the following page:
10
Introduction to R
From the screenshot you can observe that several important information’s
related to Download, R Project, R Foundation, Help with R, FAQs
(Frequently asked questions), R Manuals and others are available on this
Website.
To download R, click on CRAN (Comprehensive R Archive Network) (under
Download), then you will be directed to a list consisting of CRAN mirror site
organized by country. You need to select a site near to you.
In some of the books, you may find to download R directly from the following
link, which direct you to the download R by selecting CRAN
A mirror site as
Austria.
https://cran.r-project.org/
After selecting the CRAN Mirrors, you will be directed to the following
downloading page:
11
Fundamentals of R Language
On the CRAN page you will find some precompiled binary distributions of the
base system and contributed packages for Linux, macOS and Windows
operating systems. Choose one of the suitable options from the available
options under “Download and Install R” to download R. Here, we are
explaining the method of downloading R for Windows, as all the commands
written in this course are executed (or evaluated) on Windows.
12
Introduction to R
This page gets updated time to time and you will always find the latest version
of R to download on your system (right now R-4.3.1 is the latest version of R,
which is available to download).
After downloading R (see the location where downloads are saved), run the
setup program on PC by double-clicking on the downloaded application .exe
file (see the screenshot of the downloaded application with opened properties
shown below). Then follow the instructions and wait to get it installed
successfully (click on Finish to complete the installation process).
Note: Alternatively, you can get the set-up from your friends or known persons
and run on your PC to install it.
In case, if you are installing R for the macOS, then click on “Download R for
macOS” under “Download and Install R”. Then click on the .pkg file for the
latest version of R, download it and install it by double-clicking on the .pkg file.
Or otherwise, if you are installing R for the Linux, then click on “Download R
for Linux” under “Download and Install R”. The major Linux distributions like
Debian, Redhat, Ubuntu etc have packages for installing R. You just need to
use the system’s package manager to download and install the package.
Note: To download R, you can also type on Google “Download
“D
Dowwnload R” and get all
the important links, which helps you to download R on yo
your
our system.
m
SSAQ
SA
AQ 1
What is the basic difference between an interpreted language and a compiled
language. Also, give an example of each one of them.
Save to save the file (save it as you save any other file, like word, excel etc).
The created file will be saved with .R extension. It can be later accessed as
follows:
Click on File → click on Open script → go to the location (where it is saved)
and select the required file to open.
Furthermore, a number of commands written on the R script editor can be
evaluated by firstly selecting them and then pressing Ctrl+R, which means
pressing control key with R. Or otherwise, if you want to run only a single
command then you can put the cursor at that R command (which you want to
evaluate), then press Ctrl+R. The R script editor is mainly useful when you
want to save retyping, and these files are easily manageable.
Note: The icon on desktop will be visible with its version. If you have not opted
for the creation of icon on desktop, then you can go to Programs and then to
R and thereafter find R icon and double-click on it to run R. Furthermore, it is
always better to visit the CRAN page to get latest version of R.
Note that in a Window system users interact with R through R console. When
we double-click on the R icon, the following page will appear:
As the paragraph also suggests the method to get the citation details on any
used package. We now try to get citation details for the base package using
citations() ) function by supplying base
e in double quotes (as character
b se
ba
string) in the following manner:
Other functions which are shown in the paragraph written on R console page
are demo(), q(), help() and help.search(). The demo() function is a
user-friendly interface for running some demonstrated R scripts, thus as the
name suggest it is for demonstration purpose. For more clarification, you can
run the following command:
#Getting demonstration on graphics package
> demo("graphics")
)
Quitting R:
We can quit R, by writing the command q() at the prompt. As you press enter
you will be asked if you want to save the current workspace or not (you can 15
Fundamentals of R Language
respond yes, no or cancel). If you want to resume your current work later at
the point you are leaving it then you can select yes otherwise no. You can
also cancel the quitting request by selecting cancel option.
Note: An alternative way to interact with R is using RStudio, which can be
downloaded from the following link:
https://www.rstudio.com
The RStudio can be downloaded and installed for all the operating system for
which R software is downloadable. Like R software, it also supports a script
editor where we can write complex programs. But for this course, we
recommend the use of the R Software.
SAQ
Q2
Write a command to get the citation details on the lattice package.
#Printing A
> print(A)
[1] 10
#Printing a
> print(a)
)
Error in print(a) : object 'a' not found
Hence, it is verified that upper- and lower-case letters are different in R, i.e., R
is a case sensitive language. Consider another example in which we assign a
character string "OM" to a variable named name and print it by combining
upper and lower characters of the variable name as follows:
#Assigning a character string
> name
e <-
- "OM"
"
#Printing name
> print(name)
)
[1] "OM"
16
Introduction to R
> print(NaMe)
)
Error in print(NaMe) : object 'NaMe' not found
Hence, name, nAME and NaMe are not same due to the case sensitivity of the
language.
Help on R:
Recall that in the written paragraph on the R console it was mentioned that get
“ help() for online help, or help.start() for an HTML browser interface to
help”. Actually, R has a built-in help facility, which can be easily accessed
using the help() function or by using the ‘ ? ’ symbol. For the illustration
purpose suppose that we are interested in finding help on a function named
prod(), then it can be achieved either by using the help(prod) command
or by writing ?prod as follows:
#Seeking help
> help(prod)
> ?prod #An alternative
Note: To get help using ‘ ? ’ write the name of the function without parenthesis
‘ ( ) ’ after ‘ ? ’.
When the help()
help() ) function command is executed, the R Documentation page
consisting of the details on the function and its arguments together with the
examples and other necessary details will pop up as follows:
Hence, from the help page we get that the prod() function is available in the
base package and it is used to compute the product of all the elements
present in its arguments.
Next, we seek help on reserved word (maybe referred as keywords), R
constants and data sets using help() function as follows:
When you want to take help on a data set, say USArrests data set, then it
can be done by writing the following command: 17
Fundamentals of R Language
Note: Even if you are not connected to the internet, then also you can access
R Documentation pages via help.
Next, we discuss the use of help.search()
help
he l .search() function. This function is
particularly useful, when we do not know the exact name of the function and
only recall a subpart of the function or data set or keywords names. This
function only accepts a character string as its argument. As an alternative to
this, we can also use a more convenient way of finding help, which is using ‘
?? ’ in front of the subpart of name of the function. For the illustration purpose,
suppose we want to seek help on the rowMeans()
r wM
ro ans() function, but we can only
wMea
recall a subpart rowMea
rowMMea of it, so we proceed to take help in the following
manner:
#Seeking help
> ??rowMea
18
Introduction to R
SAQ
Q3
Write a command to get help on if reserved word (used in conditional
statements).
Note: (i) The print() function is used to print an R object and it is discussed
in detail in the Unit 4 of MST-015 course.
(ii) If an expression is evaluated in R, say x+y (which is 15), until unless its
value is assigned to some variable the value will be lost. So, if you want to
reuse any value further, better to assign it to some variable.
(iii) The two assignment operators, ‘ <-
< ’ and ‘ = ’ are used interchangeably. In
this course we will use ‘ <- ’ assignment operator for the assignment purpose.
As ‘ <- ’ operator is quite convenient and preferred by many books, therefore,
from this point onwards, we use this operator for variable assignment purpose.
(iv) The ‘ <- ’ assignment operator consists of two characters ‘ < ’ (less than)
and ‘ - ’ (minus), occurring strictly side-by-side. It should be remembered that
there should not be any blank space in-between both the characters.
(v) In some reference books you may find ‘ = ’ as assignment operator. But do
not confuse between ‘ = ’ and ‘ == ’ operators. The first one is the assignment
operator and the second one is the relational operator.
1.5.2
1 .5
5.2
2 Writing
Wriiting Comments
Com
mmentts
Comments in any programming language plays the following two ver
very
ey
important roles.
1. It helps the user in explaining the R code to other people. Analogously, it
facilitates the programmer to make the R code more readable.
2. There may be situations in which the user would like to prevent the
execution of certain code parts or executable statements (Generally
while testing the code), that time as well comments play a very important
role.
Comments in R starts with ‘ # ’ and can be put anywhere in the programme.
When any R code gets executed, it ignores the line or R statements which
starts with ‘ # ’ (hence prevents execution). For the illustration purpose, we
now create a variable named pincode to save the pin code of head office of
IGNOU, then we can use a comment in front of it to specify the location to
which it belongs as follows:
#Assigning pin code
> pincode
e <-
- 110068
8 #IGNOU headquarter pin code
It is important to note that in the beginning of the output [1] is written, it can be
read as “first value of the first line of the output”. It is generally useful when we
have vectors of several elements, which you will observe in coming units and
lab sessions.
Furthermore, the statement written after ‘ # ’ is not executed. Similarly, ‘ # ’
can be used before any R command as follows:
#Preventing execution of a assignment statement using ‘ # ’
> #x
x ->1
1
> x
Error: object 'x' not found
There may be situations when we would like to remove some specific or may
be all objects used in R workspace. This can be achieved using the rm()
function. Note that, all the objects from the work space can be removed using
the following command:
#To remove all the objects
j available to use in the workspace.
p
rm(list = ls())
To remove specific objects, say objects named x and y,y, we supply their
names as arguments to the r () function as follows:
rm()
m(
#Removing x and y
> rm(x, y)
1.5.6
1 .5.6
6 Recalling
Recalling Previous
Prreviou
us Commands
Commands
We can recall the R commands using the vertical forward, vertical backward,
horizontal right and horizontal left arrows as follows:
1. Vertical forward and backward keys (↑ and ↓) can be used to scroll
forward and backward through a command history to locate a particular
command.
2. After locating the command, the horizontal right and left arrows (→ and
←) can be used to move the cursor within the command for editing
purpose.
It should be noted that command can be edited either by deleting characters
with DEL key and or adding other characters.
which means installation of specific packages need the installation of the other
dependent packages first. So, no specific commands need to be given to
download dependencies. They will be downloaded automatically, when a
specific package installation command is executed.
Next, we discuss the method of installing a new package in your R software. If
you are connected to the internet, then the package installation task can be
completed by using the install.packages() command. In the parenthesis
‘ ( ) ’ of this function, we should write the name of the package as character
string (in double quotes), which we are interested to install.
For the illustration purpose, we explain the method of installing the MASS
package. To do so, we should write "MASS" as a function argument to
install.packages() function as follows:
#Installing a package named MASS
> install.packages("MASS")
)
Alternatively, we can also use the R menu bar to install a package. To do so,
we use the following path
Go to menu bar → click on Pakages → click on install.package(s) → double-
click to select a CRAN mirror for use in this session (A place close to your place)
→ double click on the package which you want to install.
The number of packages installed in your R software can be viewed using the
insstallle
ed.pa c ages() function. We can also see the available packages
p ck
installed.packages()
from the menu bar of the RGui as follows:
Go to menu bar → click on Packages → click on Load packages → Select a
package which is to be loaded from the list.
Note: (i) Any data set or function available in a specific package can also be
accessed using the double colon ‘ :: ’ operator. For example the ships data
set available in the MASS package can be accessed aa MASS::ships.
(ii) The currently loaded package in your session can be accessed using the
search() function. 23
Fundamentals of R Language
Moreover, we can remove a installed package from the library (where packages
are stored) using the remove.packages() function, which is available in the
utils package.
1.5.9 R Manuals
There are several manuals available on R language written by R core team,
which can be accessed from the menu bar of the R software as follows:
Go to menu bar → click on Help → click on Manuals (in PDF)
1 Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), The New S Language.
24 Chapman & Hall, New York. This book is often called the “Blue Book”.
Introduction to R
So, the first manual is “An Introduction to R”, which will give you an
introduction to the R language, its objects, data types, function and other
important information. Each manual consists of some important aspects of the
R language, which can be accessed according to the requirement of the
learner.
The Manuals on R can also be accessed from the CRAN page using the
following link:
https://cran.r-project.org/manuals.html
Note: In addition to manuals, the menu bar and CRAN page can also be
accessed to read the “FAQ on R” and “FAQ on Windows” (since I am
working on Window operating system). Here, FAQ stands for Frequently
Asked
A k dQ Questions.
ti Note
N that,
t th t Rh has th
the ffollowing
ll i th three collection
ll ti off answers tto
FAQ, which can be access using the following link:
https://cran.r-project.org/faqs.html
25
Fundamentals of R Language
SSAQ
SA
AQ 4
Let us suppose that when you run the ls() function command, you get the
following objects in your working environment. Write a command remove the
data and Name objects:
> ls()
[1] "A" "data" "Name" "x" "xy"
Note: Learners are advised to visit and read the CRAN page carefully to get R
history and other important details.
1.6
1 .6
6 SUMMARY
SUMMARY
Y
The main points discussed in this unit are as follows:
We have discussed the method of downloading, installing, running and
quitting R.
Methods of taking help on R reserved words, functions, data sets and R
constants are discussed.
Case sensitivity of the language and way of accessing the contributions
of R core team is discussed in this unit.
Other important aspects such as assignment operators, way of writing a
comment, editing a written command, R packages etc, are also
discussed in this unit.
Points to remember when working on R console:
The enter key is used to run or evaluate a typed command (after prompt
‘ > ’) on R console.
Semi-colon (‘ ; ’) is the command separator.
26
Introduction to R
1.8 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. The main difference between the Interpreted languages and Compiled
languages is that, interpreted language converts the commands (source
code) into machine code line by line. So, it means a single can be run in
an interpreted language but in compiled language you need to write
entire program first, then the entire source code (as a program) will be
run in a single command (By source code we mean set of commands
written in any programming language).
The C programming language is an example of compiled language and
R is an example of an interpreted language.
27
Fundamentals of R Language
2. The citation details on the lattice package can be obtained using the
following command:
citation(lattice)
3. Help on the reserved word if can be obtain using the following
command:
help("if")
4. We can use the rm() function to remove the objects named data and
Name as follows:
rm(data, Name)
Terminal Questions (TQs)
1. (i) FALSE
(ii) FALSE
(iii) FALSE
(iv) TRUE
2. (i) R internal type or storage mode
(ii) remove.packages()
(iii) #
3. Refer Subsection 1.5.9.
4. Refer Section 1.3.
5. To load a package named pack, we use the library()
y ) and
library(
require() functions as follows:
library(pack)
require(pack) #Alternatively
28
UNIT 2
NITTY-GRITTY
Y OFF R
Structure
2.1
2 1 INTRODUCTION
INT
TRODUCTION
In Unit 1 of MST-015 (Introduction to R Software) course, you have learnt the
installation procedure of R Software, taking help on built-in data sets,
constants, reserved words and functions using help(), ‘ ? ’ and ‘ ?? ’; and
some important fundamental aspects of R. In this unit, we shall make you
familiar with the nitty-gritty of R, such as, how to create vectors, matrices,
arrays and factors in R. Additionally, we shall discuss the vector operations,
matrix operations, logical operators and relational operators. Further, we shall
discuss the extraction of elements of vectors. In addition to this, we shall
discuss the extraction of sub-vectors and sub-matrices from matrices and
arrays. Furthermore, we shall illustrate the method to handling the missing
values in R. Two types of missing values are there in R. First one is NA, the
values which are not available. The second type of missing value is NAN, the
values which are not numbers.
Before studying this unit of Block 1, we expect that you have studied Unit 1 of
MST-015 thoroughly. 29
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
2.2 VECTORS
In this section, we first define a vector and then discuss some of the commonly
used R function on vectors. So, let us define a vector. ‘It is the basic type of
data structure/object in R, which is a sequence of elements of the same class
class’..
In R there are six types of vectors, namely, numeric, integer, character, logical,
complex and raw. Next, the question arises, ‘How is it created in R’? and the
answer is, vectors in R can be created by several methods. We now discuss
the most commonly used methods to create different types of vectors.
2.2.1 Numeric
Num
meric and
an
nd Integer
In
nte
ege
er Vectors
Vecto
ors
One of the simplest method of creating a vector is using the concatenation
(). This function creates a vector by concatenating the
function, i.e., c(
c().
elements or vector objects together.
For the illustration purpose, let us create
creatte a vector with elements 0.4, 0.6, -0.8
-0.
08
and 22.7 using the c() function as follows:
#Creating a numeric vector
> c(0.4, 0.6, -0.8, 22.7)
[1] 0.4 0.6 -0.8 22.7
Note: All the elements of this vector are of the same type
type, limited to one
decimal place and are separated by comma. That is why, in the output all the
elements are printed till one decimal places. Also, this vector is of numeric
type as all of its elements are of real/decimal type and called a numeric vector.
Additionally, note that the numeric vectors are treated as double precision real
numbers.
Furthermore, note that the c() function can also be used to concatenate two
or more vectors or elements. For the illustration purpose, we next create a
vector with elements c(0.4,0.6), c(-0.8,22.7) and 12.3 using the c()
function as follows:
#Concatenating two vectors with a numeric element
> c(c(0.4,
, 0.6),
, c(-0.8,
, 22.7),
, 12.3)
)
[1] 0.4 0.6 -0.8 22.7 12.3
In both the created vectors the elements were of the same precision. Let us
now create a numeric vector using c() function, whose elements are of
30 different precision, with elements 0.13, 0.3102, -0.110002 and 13.1.
Nitty-Gritty of R
#Creating a numeric vector with different precision elements
> c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
[1] 0.130000 0.310200 -0.110002 13.100000
From this output it is clear that, by default, a numeric vector will be printed with
the same precision as of the highest precision element.
If you are interested in saving the recent created vector with name x. Then it
can be done by assigning the vector to x using the assignment operator ‘ <- ’ ,
which is already discussed in the Unit 1 of MST-015 as follows:
#Assigning a vector to x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
After assigning the vector, we now check whether the vector is successfully
assigned to x or not, by printing x. For printing, either we can use the
print() function or simply can write the name of the created vector as
follows:
#Explicit printing
> print(x)
[1] 0.130000 0.310200 -0.110002 13.100000
#Auto printing
> x
[1] 0.130000 0.310200 -0.110002 13.100000
Note that, when prp t ) function is used to print any R object, the process of
print()
in
nt(
printing is called explicit printing and if we only write the name of an object for
printing, this process is called auto printing. For more details on printing
printin
ng
functions, you can refer to Unit 4 of MST-015. Next, we ccheck heck the class()
clas
cl a s(
as s()
)
and typeof()
typpeof ( of a numeric vector as follows:
f()
#Checking the class and type of a numeric vector
> class(x)
[1] "numeric"
> typeof(x)
[1] "double"
#Internal structure
> str(round)
)
function (x, digits = 0)
Thus, by using the round() function, an object with name x, is rounded till
the number of decimal places specified by digits, whose default value is 0.
Note: The str() function is used to compactly display the internal structure
of an R object. It can be consider as an alternative to the summary() function
(which will be discussed later).
Let us now round off the earlier created vector x to 2 and 1 decimal places as
follows:
#Rounding x till 2 decimal places
> round(x,
, 2)
[1] 0.13 0.31 -0.11 13.10
Let us get the length of the earlier created vector x using the length()
leng
gth()
(
function as follows:
#Getting the length of the vector
o x
> length(x)
[1] 4
From these outputs, it is clear that the length() function also counts NA
(missing) values. Also, the length can be set as smaller or larger than the
original size of the vector. More details on the length() function can be seen
32 in the Unit 3 of MST-015. Next, we discuss integer vectors.
Nitty-Gritty of R
An integer vector in R can be created by several ways, the most popular way
is by appending L at the end of each element of the vector. Consider the
following example for illustration purpose, in which we check the class and
type of a vector whose elements are written by appending L.
#Checking the class of a vector
> class(c(1L,
, 2L))
)
[1] "integer"
From this output it is clear that, c(1L, 2L) is an integer vector. Let us next
see, what happens, if we do not append L at the end of each element of it.
#Checking the class of a vector
> class(c(1,
, 2))
)
[1] "numeric"
> class(c(1, 2L))
[1] "numeric"
Hence, if we do not append L at the end of each element, then the created
vector will be of numeric type.
An integer vector can also be created using the colon ‘ : ’ operator. So, if you
want to generate a vector using the colon operator, then you need to specify
the first and the last values of the sequence or vector, which you are intended
to create.
#Generating a sequence from 0 to 10
> 0:10
0
[1] 0 1 2 3 4 5 6 7 8 9 10
Note that, if the first value is smaller than the last value of the sequence, then
the generated sequence will be an increasing sequence and if the first value is
larger than the last value, then the generated sequence will be a decreasing
sequence. Moreover, both increasing or decreasing sequence will be
generated in steps of 1 and the starting and ending values will be separated
by a colon. Consider a few more examples for understanding purpose.
#Generating a decreasing sequence from 6 to -6
> 6:-6
6
[1] 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6
33
Fundamentals of R Language
Note that either the sequence is increasing or decreasing, the values are
increased or decreased by one. Thus, whenever a vector is created using the
colon ‘ : ’ operator, the steps of the generated sequence will be always 1 (by
default). If you want the steps of the generated sequence to be other than 1.
Then you should use some other method to create a vector. One of the
commonly used method is by using the seq() function.
#The seq() function
seq(from, #starting value of the sequence
to, #last value of the sequence
by, #steps or increment/decrement
length, #desired length of the sequence
along, #a vector whose length is to be used
...) #other arguments
It should be noted here that, if the starting value assigned to the from
argument is smaller (larger) than the last value assigned to the toto argument,
then a positive (negative) value should be assigned to the byby argument of the
seq()
seq( () function. For example:
#Generating a sequence from 1 to 2 with an increment of 0.1
> seq(from=1, to=2, by=0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
#Generating a sequence from 5 to 1 with a decrement of 2
> seq(from=5, to=1, by=-2)
[1] 5 3 1
Next, we illustrate the use of the along argument of the seq() function.
Consider a vector x with elements 10, 22, 14, 40, 98 and 11. Clearly, the
length of x is 6. In case, if we want to generate a sequence of same length,
i.e., 6, starting from 1 with an increment of 0.1, then we can use the along
argument of the seq() function as follows:
#Generating a sequence using the along argument
> x <-
- c(10,
, 22,
, 14,
, 40,
, 98,
, 11)
)
> seq(from=1,
, by=0.1,
, along=x)
)
[1] 1.0 1.1 1.2 1.3 1.4 1.5
34
Nitty-Gritty of R
From this output it is clear that the created vector is of length 6, which is same
as the length of x.
Note: If the length or along argument is specified in seq() function, then
we do not need to assign the to argument of the function.
Furthermore, we often want to create special type of vectors, whose elements
are repeats of some specific number(s) or character(s). In such situations, the
rep() function available in the base package can be used efficiently. The
main arguments of interest of the rep() function are as follows:
#The rep() function
rep(x, #a vector which is to be replicated
times, #number of times elements are to be repeated
each, #the repetition of each element of x
length) #length of the output vector
Similarly, we can create an integer zero vector of length five by assigning the
mode argument as "integer" and the length argument as 5 in the
following manner:
#Creating an integer vector of zeros
> vector(mode
e = "integer",
, length=5)
)
[1] 0 0 0 0 0
Note: The mode can be "logical" and "character" as well. For more
detail you can see help on this function.
#Alternative method
> c(T, T, F, F, T)
[1] TRUE TRUE FALSE FALSE TRUE
Next, we create a logical vector with elements
elem
e ents TRUE, TRUE, FALSE, FALSE
and FALSE using c()
c() and rep()
rep(() functions
funcctions as follows:
#Creating a logical vector
> c(rep(TRUE, 2), rep(FALSE,3))
[1] TRUE TRUE FALSE FALSE FALSE
Lastly, we check the class and type of a logical vector c(TRUE,FALSE) as
follows:
#Checking the class and type
> class(c(TRUE,
, FALSE))
)
[1] "logical"
> typeof(c(TRUE,
, FALSE))
)
[1] "logical"
The same vector can also be created using single quotes as follows:
#Creating character vector using single quotes
> c('Sunday','Monday','Tuesday',
, 'Wednesday',
, 'Thursday',
,
'Friday',
, 'Saturday')
)
[1] "Sunday" "Monday" "Tuesday" "Wednesday"
[5] "Thursday" "Friday" "Saturday"
Note that, by default, the output is printed in double quotes. Consider another
example, in which we create a character vector with elements ‘A a 1’, ‘B b 2’,’
C c 3’ and ‘D d 4’ as follows:
#Creating a character vector
> c('A a 1','B b 2','C c 3', 'D d 4')
[1] "A a 1" "B b 2" "C c 3" "D d 4"
#Creating a vector x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
Next, we discuss the method of extracting the 1st element, then 1stt and 5th
elements together, thereafter, 2nd to 4th elements together as ffollows:
ollows:
#Extracting 1st element of LETTERS
> LETTERS[1]
[1] "A"
#Extracting 1st and 5th elements of LETTERS
> LETTERS[c(1,5)]
]
[1] "A" "E"
#Extracting 2nd to 4th elements of LETTERS
> LETTERS[2:4]
]
[1] "B" "C" "D"
Most importantly, note that a negative sign with position number in the
brackets after the name of the vector is used to drop particular positioned
element(s) as follows:
#Dropping 2nd element of x
> x <-
- c(0.13,
, 0.3102,
, -0.110002,
, 13.1)
)
> x[-2]
]
38 [1] 0.130000 -0.110002 13.100000
Nitty-Gritty of R
In the next subsection, we shall discuss the method of appending element(s)
in already created vector.
Note that the elements in a vector can also be appended using the append()
function available in base package. For the illustration purpose, we append
values 1 and 2 after the 4th element as follows:
#Appending values 1 and 2 after the 4th element
> x <- c(0.13, 0.3102, -0.110002, 13.1)
> append(x, values=c(1,2), after = 4)
[1] 0.130000 0.310200 -0.110002 13.100000 1.000000
[6] 2.000000
SSAQ
SA
AQ 1
Write the output of the following statements:
x <- c(0.2, c(0.1, -1.21), c(0.2, 1.3, 1))
(i) print(x[c(2,5)])
(ii) print(x[-5])
(iii) class(x)
(iv) append(x, values=2, after=5)
(v) seq(from=1, to=2, along=x)
(vi) x[c(-2, -5)]
(vii) x[1:5]
+
Addition
^ -
Exponent Subtraction
* /
Multiplication Division
%% %/%
Remainder Integer division
Next, we illustrate the exponent operator. Note that whenever the exponent
operator is used on a vector, the exponent of each element of the vector is
computed. For example, let us compute x2, where x is a vector using the ‘ ^ ’
operator as follows:
#Obtaining the positive power of the elements of a vector x
> x <-
- c(2,
, 4,
, 5,
, 6)
)
> x^2
2
[1] 4 16 25 36
We first compute the minimum and maximum of vector z using the min() and
max() functions respectively as follows:
#Computing the minimum
> min(z)
[1] 4
#Creating vector y
> y <-
- c(pi/6,
, pi/4,
, pi/3,
, pi/2)
)
SAQ
Q2
Write the output of the following statements:
(i) 3+2*c(1,2)
(ii) min(c(0.2, -0.2, 0.0))
(iii) tan(c(pi/6, pi/4))
(iv) c(1,4,8,3)%/%c(2,5,2,2)
2.4 MATRICES
A matrix is a two-dimensional rectangular layout of the collection of data
elements of the same class. Matrices in R, can be created using several
methods. The most commonly used method of creating a matrix is using the
ix() function available in the ba
matrix()
m atrix
ix base
ase e package. Note that, whenever a
matrix is created using the m matrix()
atrix() ) function, the elements of the matrix, by
default, will be filled along with the column orientation. Also, the dimension of
the matrix is defined by passing or supplying appropriate values to the nrow
nrow
n ol arguments of the matrix()
and ncol
nc matri ix() ) function. These arguments are used to
specify the number of rows and columns of the matrix. The main arguments of
interest of the matrix()
matri
ma ( function are as follows:
ix()
#The matrix() function
matrix(data, #data vector of matrix elements
nrow, #number of rows of created matrix
ncol, #number of columns of created matrix
i
byrow, #control the filling of data elements in matrix
dimnames
dimnames, #gives names to rows and columns
...) #other arguments
The data argument of the matrix function is used to assign the data vector,
the nrow and ncol arguments are used to assign the number of rows and
columns of the created matrix, the byrow argument is a logical argument of
the function. If byrow =TRUE, then the elements of the data will be filled row-
wise in the created matrix, or otherwise will be filled column-wise. Lastly, the
dimnames argument is used to give names to the rows and columns of the
matrix. Note that, in dimnames a list of two components consisting of the
names of the rows and columns is assigned.
Note: Lists objects are discussed in the Unit 3 of MST-015 course.
Next, we illustrate the method of creating a matrix of dimension 2x3 with
elements -1, 3, -2, 5, 4 and 2 using the matrix() function. To do so, we
assign the data argument of the function as the vector consisting of these
44 elements, the nrow argument as 2 and the ncol argument as 3, as follows:
Nitty-Gritty of R
#Creating a matrix in which elements are filled column-wise
> matrix(data=c(-1,
, 3,
, -2,
, 5,
, 4,
, 2),
, nrow=2,
, ncol=3)
)
[,1] [,2] [,3]
[1,] -1 -2 4
[2,] 3 5 2
(ii) The class and type of a matrix object can be seen as follows:
#Checking the class and type
> class(matrix(c(-1, 3, -2, 5, 4, 2), 2, 3))
[1] "matrix" "array"
Note that, whenever both the function arguments nrow and ncol are
specified, the product of both of them should be equal to the length of the
data, or otherwise you may get a warning message. Consider the following
example, in which the length of the data is 8, which is larger than the product
of the dimensions 2x3, i.e., 6 for illustration purpose only.
#Creating a matrix
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4,
, 2,
, 4,
, 5),
, nrow=2,
, ncol=3,
,
byrow=TRUE)
)
[,1] [,2] [,3]
[1,] -1 3 -2
[2,] 5 4 2 45
Fundamentals of R Language
Warning message:
So, from this output we can see that, whenever the length of the data is more
than the product of nrow and ncol in the matrix() function, in that case the
extra data elements will be discarded with a warning message.
Next, we illustrate, what happens if the length of the data is less than the
product of nrow and ncol in the matrix() function. In that case the data will
start to replicate itself until it matches the product of nrow and ncol with a
warning message. For example:
#Creating a matrix
> matrix(c(-1,
, 3,
, -2,
, 5,
, 4),
, nrow=2,
, ncol=3,
, byrow=TRUE)
)
[,1] [,2] [,3]
[1,] -1 3 -2
[2,] 5 4 -1
Warning message:
In matrix(c(-1, 3, -2, 5, 4), nrow = 2, ncol = 3, byrow = TRUE)
:
data length [5] is not a sub-multiple or multiple of the number
of rows [2]
#Assigning matrix A
> A <- matrix(seq(from=1, to=17, by=2), nrow=3); A
[,1] [,2] [,3]
[1,] 1 7 13
[2,]
[2 ] 3 9 15
[3,] 5 11 17
#Assigning matrix B
> B <-
- matrix(seq(from=1,
, to=18,
, by=1.5),
, nrow=3,
, ncol=4);
; B
[,1] [,2] [,3] [,4]
[1,] 1.0 5.5 10.0 14.5
[2,] 2.5 7.0 11.5 16.0
[3,] 4.0 8.5 13.0 17.5
#Assigning matrix C
> C <-
- matrix(c(2,
, 3,
, 1,
, 4,
, 5,
, 1,
, 2,
, -1,
, 5),
, nrow=3);
; C
[,1] [,2] [,3]
[1,] 2 4 2
[2,] 3 5 -1
[3,] 1 1 5 47
Fundamentals of R Language
Observe that the two matrices A and C are comfortable for matrix addition and
subtraction. So, let us perform addition and subtraction of matrices A and C
using the ‘ + ’ and ‘ - ’ operators as follows:
#Performing matrix addition
> A+C
C
[,1] [,2] [,3]
[1,] 3 11 15
[2,] 6 14 14
[3,] 6 12 22
Next, we perform matrix multiplication using the ‘ %*% ’ operator. Since the
matrix C is comfortable for matrix multiplication with matrix B, therefore, we
next compute their product as follows:
#Performing matrix multiplication
> C%*%B
[,1] [,2] [,3] [,4]
[1,] 20.0 56 92.0 128
[2,] 11.5 43 74.5 106
[3,] 23.5 55 86.5 118
Recall that, whenever a scalar is multiplied with a matrix. All the elements of
that matrix will be multiplied by the scalar. So, if k is a scalar, whose value is 3
and A is the earlier created matrix, then we can compute their pr product
p oduct using
the ‘ * ’ operator as follows:
#Assigning scalar
> k <-
- 3
> L <-
- matrix(letters[1:20],
, nrow=4,
, ncol=5);
; L
Note that, the built-in constant letters consist of 26 lower-case letters of the
Roman alphabet. We have used only first 20 for the illustration purpose.
Before, we illustrate the method of extraction, it is important for you to
understand the terms, ‘row indices’ and ‘column indices’. The first place in
brackets ‘ […] ’ after the name of the matrix is known as the place for row
indices (also referred as margin 1) and the second place, which is separated
by a comma from row indices is known as the place for column indices (also
referred as margin 2). The row and column indices are used to specify
particular row(s) or column(s) or both of a considered matrix.
Now, we show the method of extraction of 2nd row of the matrix L. To extract
the 2nd row from L
L,, we write 2 at the row indices place and leave the column
indices place empty in brackets as follows:
Similarly, we can also extract the 3rd column of the matrix L by leaving the row
indices place empty and writing 3 at the column indices
indice
es place in brackets as
follows:
In case, if you are interested in extracting the 4th element of the 3rd row of
matrix L, then, it can be extracted by writing 3 at the row indices place and 4 at
the column indices place in brackets as follows:
#Extracting the 4th element of the 3rd row of L
> L[3,4]
]
[1] "o"
Most importantly, note that, leaving any indices place (row or column) empty
leads to selection of the full range of that subscript. Moreover, in general, the
extraction of a sub-vector and a particular positioned element from a given
matrix can be done, by writing the following after the name of the matrix:
49
Fundamentals of R Language
[i,
, ] For extracting the ith row vector
[i,j]
] For extracting the (i, j)th element
You can note that the matrix shown in the rectangular box is appearing in the
1st and 2nd rows of L. In addition to this, its columns are appearing in the 3rd
and 4th columns of L, which means, we can easily extract the matrix shown in
the rectangular box by writing the row indices as c(1,2)
(1,2) and the column
c(
indices as c(
c 4) in the following manner:
c(3,4)
3,4
#Extracting the submatrix shown in the rectangular box
> L[c(1,2), c(3,4)]
[,1] [,2]
[1,] "i" "m"
[2,] "j" "n"
From this print statement, it is clear that the replacement of the 2nd column of
the matrix L is successfully
y performed.
2.4.3
2.
.4.3 Matrix
Ma
atrix Functions
Fun
ncttion
ns
In this subsection, we shall discuss some important matrix functions and
illustrate the execution of each one of them one-by-one by giving some
suitable examples. A list of most popular matrix functions (with their objective
in front) are as follows:
Matrix
Objective
Function
t()
) Obtain the transpose of a matrix.
nrow()
) Obtain the number of rows of a matrix.
ncol()
) Obtain the number of columns of a matrix.
dim()
) Obtain the dimension of a matrix.
rbind()
) Combine vectors/matrices vertically.
51
Fundamentals of R Language
det()
) Compute the determinant of a matrix.
solve()
) Obtain the inverse of a matrix.
diag()
) For multiple purpose depending on argument supplied to this
function. The arguments can be scalar, vector or a matrix.
52 [1] 5 -3 8
Nitty-Gritty of R
#Obtaining a vector of row means
> rowMeans(A)
)
[1] 1.666667 -1.000000 2.666667
Similarly, the sum of elements of the first, second and third columns of the
matrix A are 3+4+2, 1-4+2, 1-3+4, i.e., 9, -1 and 2, respectively. Which yields
the column means as 9/1, -1/3 and 2/3, respectively. Thus, the vectors of
column sums and column means are (9, -1, 2) and (3, -0.3333333,
0.6666667), respectively. The same can be obtained by using the colSums()
and colMeans() functions as follows:
#Obtaining a vector column sums
> colSums(A)
)
[1] 9 -1 2
Note: If the determinant of a matrix is zero then the matrix is called singular
(which means inverse does not exist), and if the determinant of a matrix is non-
zero then the matrix is called non-singular (which means inverse exists).
As the computed value of the determinant is non-zero, therefore, the matrix A
hence, its inverse exists.
is non-singular and hence Next, we compute the inverse of
exists Next
A using the solve() function as follows:
#Computing the inverse of a matrix A
> solve(A)
)
[,1] [,2] [,3]
[1,] 0.2777778 0.05555556 -0.02777778
[2,] 0.6111111 -0.27777778 -0.36111111
[3,] -0.4444444 0.11111111 0.44444444
You can verify whether the computed inverse is correct or not by verifying the
following result:
Next, we illustrate the use of the diag() function. This function can take 3
types of arguments, namely, scalar, vector and a matrix. Whenever, a scalar k
is supplied as a function argument to diag() function, it creates a kxk
53
Fundamentals of R Language
#Assigning a vector
> x <- c(1, 2, 3, 4) # x <- 1:4
Note that any number of matrices or vectors can be combined vertically (row
wise) or horizontally (column wise), using the of rbind() and cbind()
functions, respectively. Another, important point is whether the matrices are
54
Nitty-Gritty of R
combined or the vectors are combined the obtained output will always be a
matrix object.
#Combining two matrices A and B row wise
> rbind(A,B)
)
[,1] [,2] [,3]
[1,] 3 1 1
[2,] 4 -4 -3
[3,] 2 2 4
[4,] 1 1 1
[5,] 1 1 1
[6,] 1 1 1
These two matrices can also be combined column wise using the cbind()
function as follows:
#Combining two matrices A and B column wise
> cbind(A,B)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 1 1 1 1
[2,] 4 -4 -3 1 1 1
[3,] 2 2 4 1 1 1
SSAQ
SA
AQ 3
(a) Write the output of the following code:
A <- matrix(1:4,nrow=2); A
B <- matrix(5:8, nrow=2, ncol=2); B
C <- matrix(rep(1,4),ncol=2); C
A-C+B%*%C
(b) Define matrix in R and create a matrix with follo
following
owing elements.
1 0 3
0 1 2
2 4 0
2.5 ARRAYS
From previous sections, you can observe that a vector is a one-dimensional
arrangement of data elements and a matrix is a two-dimensional arrangement
of data elements, i.e., when data are presented in rows and columns. Arrays in
R provides the more generalized way of presenting the data in one, two or
more than two dimensions. In fact, an array with one and two dimensions are
same as a vector and a matrix, respectively. Arrays in R can be created using
the array() function available in the base package.
#The array() function
array(data, #data vector of elements
dim, #to specify dimension
...) #other arguments
55
Fundamentals of R Language
For creating a matrix using the array() function, we should assign the data
argument as vector x and the number of rows and columns of the created
matrix to the dim argument such that the product of the number of rows and
columns should be equal to the number of elements in data argument as
follows:
#Creating an array of two-dimension
> array(data=x, dim=c(3,6))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 10 13 16
[2,] 2 5 8 11 14 17
[3,] 3 6 9 12 15 18
, , 1
[,1] [,2]
[1,] 1.0 3.0
[2,] 1.5 3.5
[3,] 2.0 4.0
[4,] 2.5 4.5
, , 2
[,1] [,2]
[1,] 5.0 7.0
[2,] 5.5 7.5
[3,] 6.0 8.0
[4,] 6.5 8.5
SAQ
Q4
Create an array of two dimension with the following elements:
2 0 1
1 1 2
3 0 1
4 1 0
1 2 1
1 0 1
After creating it save it under the name B. Also, extract the row shown in the
rectangle.
2.6 FACTORS
A factor in R, is a special type of object, which provides an easy way to specify
discrete classification of the elements of vectors of the same length. It
provides an easy way of handling categorical (nominal) data. The possible
values it can take can be seen from its levels. For example, in general, any
statistical data may always have categorical variables, which indicates the
subdivision of the data under consideration on the basis of social class, cancer
stage etc. Factors in R, are created using the factor()
fa
actor() ) function. For
example, consider an illustration, in which we create a factor object consisting
of the data of social status of 7 individuals using the factor() function as
follows:
#Creating a factor
> factor(c("Medium", "Low", "Medium", "High", "High", "Low",
"Medium"))
[1] Medium Low Medium High High Low Medium
Levels: High Low Medium
Note that, whenever a factor is created, the levels of that factor are displayed
in alphabetical order (the Levels: High, Low, Medium are in alphabetical
order). The levels argument of the function factor() can be used to set
the order of level of factors as follows:
#Setting the order of levels of a factor
>factor(c("Medium",
, "Low",
, "Medium",
, "High",
, "High",
, "Low",
,
"Medium"),
, levels=c("Low",
, "Medium",
, "High"))
)
[1] Medium Low Medium High High Low Medium
Levels: Low Medium High
You can clearly observe the difference between the presentation of the
Levels. In the first example the levels are printed in alphabetical order, but in
the second example Levels are printed in the order assigned by us.
Let us next discuss another method of creating a factor in R. Factor in R, can
also be created using the gl() function available in the base package. The
main three arguments of interest of the gl() function are n, k and labels.
The n argument is an integer used to assign the number of levels, the k
58 argument is used to assign the number of replications of each level and the
Nitty-Gritty of R
labels argument is used to set the labels, that are to be given to the
Levels. To understand it more clearly, let us create a vector with levels Low,
Medium and High, such that each level is replicated 2 times using the gl()
function as follows:
#Generating a factor of length 6 with 3 levels
> gl(n=3,
, k=2,
, labels=c("Low",
, "Medium",
, "High"))
)
[1] Low Low Medium Medium High High
Levels: Low Medium High
Hence it can be seen that in the created factor, there are 3 levels and each
level is replicated 2 number of times and the labels are "Low", "Medium",
"High".
SAQ
Q5
Generate a factor of length 10 with 2 levels YES and NO. Each level should be
replicated 5 number of times.
Observe that the sum of the elements of the vector is coming out as NA, as the
vector consists of NA
A values. So, to compute the sum of non-missing values of
the vector, we can use the na.rm argument of the sum()
sum(() function as follows:
#Computing the sum by using the na.rm argument
> sum(c(1:4, NA, NA, 7:10), na.rm=TRUE)
[1] 44
SAQ
Q6
Write the output of the following:
(i) is.na(c(NA,NaN))
(ii) is.nan(c(NA,NaN))
==
Equals to
< >
Less than Greater than
<= >=
Less than or Greater than or
equal to equal to
!=
Not equals to
Next we first do the comparison between two scalars a and b using the
Next,
relational operators as follows:
#Checking for inequality
> a!=b
b #10.34 != 20.45
[1] TRUE
Similarly other lines other relational operators can be used between x and y.
Also, whenever a relational operator is applied between two vectors, the
obtained result will always come out to be in TRUE and FALSE, which is
computed by element wise comparison of the vectors. The relation of a vector
can also be checked with a scalar as well, in that case each element of the
vector will be compared with the scalar and the obtained result will be a vector
of TRUE and FALSE, i.e., a logical vector (the scalar will replicate itself until its
length becomes equal to the length of the vector). For the illustration purpose
consider the following example:
#Comparing each element of a vector with 20
> c(31.45, 40.23, -14.230, 20) <= 20
[1] FALSE FALSE TRUE TRUE
Next, we discuss the following logical operators, which are available for use in
R programming:
!
Logical NOT
|
||
Element-wise
logical OR Logical OR
&
&&
Element-wise
Logical AND
logical AND
Note that, these logical operators can only be applied on an expression which
results in TRUE and FALSE. Also, by default, a non-zero number means TRUE
and zero value means FALSE. For the illustration purpose, let us assign a
scalar c as -10 first, then by using different expressions we shall show the
execution of these logical operators as follows:
#Assigning c
> c <-
- -10
0
#Logical NOT
> !(c<1) #As c<1 is True and !TRUE is FALSE
[1] FALSE
#Logical OR
> (c<1)
) ||
| (c>2) #As c<1 is True || c>2 is FALSE
[1] TRUE #TRUE || FALSE
62
Nitty-Gritty of R
#Logical AND
> (c<1)
) &&
& (c>2) #TRUE && FALSE
[1] FALSE
#Element wise OR
> c(TRUE,
, FALSE,
, TRUE,
, FALSE)
) | c(TRUE,
, TRUE,
, FALSE,
, FALSE)
)
[1] TRUE TRUE TRUE FALSE
Here, ex1 and ex2 are the expressions which results in TRUETRUE and FALSE.
FAL
FA LSE.
Now, we illustrate the procedure of extraction of elements with the help of
logical operators, to do so, consider the following arbitrary vector y of 5
elements:
#Creating a vector y
> y <- c(1, 4, 2, 6, 3)
Next, we obtain a logical matrix of TRUE and FALSE using the relational
operator as follows:
#Getting a logical matrix
> A!=2
2
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE TRUE
[2,] FALSE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE
This matrix of TRUE and FALS SE created with the help of relational operator can
further be used for extracting specific elements of the matrix A.
A. For example,
elements which are not equal to 2 can be extracted easily as follows:
#Extracting elements which are not equal to 2
> A[A!=2]
[1] 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16
SSAQ
SA
AQ 7
Write the output of the following
g statements:
(i) c(TRUE, FALSE) & c(FALSE, TRUE)
(ii) c(TRUE, FALSE) | c(FALS
c(FALSE,
SE, TRUE)
(iii) !c(TRUE, FALSE)
(iv) c(1, 0, 1, 0) > c(0, 2,-1, 2)
(v) x <- c(seq(1,10,2),4,2:10); x[x%%2==0]
2.9 SUMMARY
The main points discussed in this unit are as follows:
Methods of creating different types of vectors and associated vector
operations are discussed.
Method of creating matrices and associated matrix operations is
discussed.
Method of creating of an array in R is discussed.
Methods of extraction of elements/subparts from vectors, matrices and
arrays have been discussed in this unit.
64 Handling of missing values is discussed.
Nitty-Gritty of R
Different types of arithmetic operators, mathematical functions, relational
and logical operators are discussed.
Method of creating a factor object is also discussed.
Finally, elements extraction using relational operators are discussed in
this unit.
(v) diag(A)
(vi) diag(k*A)%*%diag(k), where k=2
(vii) rbind(A,B)
(viii) cbind(A,B)
9. Write any two differences between NA and NAN.
2.11 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. The outputs of the given statements are as follows:
0.1 1.3
(iii) "numeric"
2. (i) 5 7
(ii) -0.2
(iv) 0 0 4 1
[,1] [,2]
[1,] 12 14
[2,] 15 17
(b) See section 2.4 for definition and the given matrix can be created as
follows: matrix(c(1, 0, 2, 0, 1, 4, 3, 2, 0), ncol=3)
Or
matrix(c(1, 0, 3, 0, 1, 2, 2, 4,0), 3, 3, byrow=TRUE)
4. B <- array(data=c(2, 1, 3, 4, 1, 1, 0, 1, 0, 1, 2, 0,
1, 2, 1, 0, 1, 1), dim=c(6,3))
The row shown in the rectangular box can be extracted using following
code: B[3,]
66
Nitty-Gritty of R
(iii) A%*%B
[,1] [,2]
[1,] 1 1
[2,] 1 1
(iv) A-B
[,1] [,2]
67
Fundamentals of R Language
[1,] 0 -1
[2,] -1 0
8. (i) t(A)
[,1] [,2]
[1,] 1 0
[2,] 0 1
(ii) dim(A)
2 2
(iii) rowSums(A)
1 1
(iv) det(A)
1
(v) diag(A)
1 1
(vi) [,1] [,2]
[1,] 2 2
(vii) cbind(A,B)
[,1] [,2] [,3] [,4]
[1,] 1 0 1 1
[2,] 0 1 1 1
(viii) rbind(A,B)
[,1] [,2]
[1,] 1 0
[2,] 0 1
[3,] 1 1
[4,] 1 1
9. See sec 2.7
68
UNIT 3
MEMBERSHIPP TESTING,,
COERCION
N AND
D LISTSS IN
NR
Structuree
3.1
3 .1
1 IINTRODUCTION
NTRODUCTION
In previous two units (Units 1 and 2) of MST-015 (Introduction to R Software)
course, you have learnt some important aspects of R programming such as
method of creating R objects, namely, vectors, matrices, factors and arrays.
Additionally, you have studied different types of operators, such as arithmetic
operators, relational operators and logical operators; and learnt the method of
using them on scalars and vectors. Moreover, with the help of previous unit,
you got familiar with two types of missing values, namely, NA and NaN.
The main objective of the present unit is to make you familiar with a number of
functions used for testing of membership of different R objects. Here, we shall
also discuss a few functions used for coercion of classes of different R objects.
Moreover, we shall discuss the method of creating a list and methods of
extraction of list components (or elements) and merging them. Lastly, different
attributes of R objects are explained in this unit.
Before studying this unit, we expect that you have studied Units 1 and 2 of
MST-015 thoroughly.
69
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
In the previous unit, you have learnt that, a vector consisting of numeric values
will be of numeric type. The same can be verified using the testing function
is.numeric() as follows:
#Testing for numeric type
> is.numeric(x)
)
[1] TRUE
Since, the obtained output is TRUE, it confirms that the created vector x is of
numeric type. Let us next observe, what output other testing functions will
give, if we supply x as their argument as follows:
#Testing for integer type
> is.integer(x)
)
[1] FALSE
From the obtained outputs, you can observe that the output obtained from the
is.vector() testing function is TRUE and the outputs obtained from other
testing functions are FALSE. Thus, testing confirms that x is a vector object.
Note: After performing testing on a vector object. Next, we perform testing on
other objects, such as, matrix, array, factor, data frame, list and time series,
each one-by-one. We supply them as argument to testing function and
observe the outputs.
To perform membership testing on a matrix object, we create a matrix named
A of order 3x3,, with elements 2,, 3,, 4,, 1,, 2,, 1,, 7,, 8 and -1 ((arranged
g column
wise) as follows:
#Creating a matrix
> A<-matrix(c(2, 3, 4, 1, 2, 1, 7, 8, -1), ncol=3); A
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 3 2 8
[3,] 4 1 -1
Note that the output obtained from the is.numeric() testing function is
TRUE and the outputs obtained from other testing functions are FALSE. So,
72
Membership Testing, Coercion and Lists in R
these testing functions confirms that the matrix A is of numeric type. Next, we
perform testing for different objects as follows:
#Testing for vector object
> is.vector(A)
)
[1] FALSE
From the obtained outputs, it is observed that, the output of the testing
function is.logical() is TRUE and the output of other testing functions is
FALSE. Hence, the testing confirms that the matrix B is of logical type. Next,
we test for different objects as follows:
#Testing for vector object
> is.vector(B)
[1] FALSE
From the obtained outputs, it is confirmed that, the object B is a matrix or array
object.
Next, we perform testing on factor and character objects. To do so, we first
create a factor object named fac using the gl() function discussed in Unit 2
of MST-015. We also create a character vector named Blessed with
elements "BEST" and "WISHES" as follows:
#Creating a factor
> fac
c <-
- gl(3,
, 2);
; fac
c
[1] 1 1 2 2 3 3
Levels: 1 2 3
74
Membership Testing, Coercion and Lists in R
Next, we perform testing for factor and character vector, by supplying the fac
and Blessed objects as arguments to the following testing functions:
#Testing for factor
> is.factor(fac)
[1] TRUE
On the similar lines testing for the memberships with other objects can be
performed.
Next, we perform testing on an array object and observe the obtained outputs.
To do so, we use an built-in data set data : HairEyeColor. Note that,
datasets::HairEyeColor.
asetsts::
the HairEyeColor data is available in the datasets
datasets package, that is why,
ata
we have written datasets
t sets with ‘ :: ’ and the name of the data set. This data
data
da
consists of distribution of hair and eye color and sex in 592 statistics students.
For more detail on the data set, you can see the associated R documentation
page, by taking help on this function as follows:
#Seeking help of HairEyeColor data
> ?HairEyeColor
starting httpd help server ... done
Recall that, a built-in data set of R can be called in the current working session
by including the associated package first in the working session. Before that
we should always load the package in our session either by using the
library() or the require() function. Let us first view the HairEyeColor
data as follows:
#Loading the datasets package
> require(datasets)
)
#Viewing the HairEyeColor data
> HairEyeColor
r
75
Fundamentals of R Language
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
Next, we check its membership with R objects. To do so, we supply the name
of the data set to each of the testing functions as follows:
#Testing for vector object
> is.vector(HairEyeColor)
[1] FALSE
#Testing for matrix object
> is.matrix(HairEyeColor)
[1] FALSE
#Testing for array object
> is.array(HairEyeColor)
[1] TRUE
#Testing for list object
> is.list(HairEyeColor)
[1] FALSE
#Testing f
for data f
frame object
> is.data.frame(HairEyeColor)
)
[1] FALSE
Hence, from the obtained outputs it is confirmed that the HairEyeColor data
set is an array object.
Lastly, we supply a list object and a frame object as function arguments to the
testing functions. The method of creating a list will be discussed later in this
unit, but for now, you must know that a list is created using the list()
function. Let us create a list named ylist with two components (or elements)
class and school using the list() function as follows:
#Creating a list
76
Membership Testing, Coercion and Lists in R
> ylist<-list(class=c(1,
, 2,
, 3),
, School=c("X",
, "Y",
, "Z"));
; ylist
t
$class
[1] 1 2 3
$School
[1] "X" "Y" "Z"
Next, we view the internal structure of the ylist object using the str()
function as follows:
77
Fundamentals of R Language
Next, we check its membership with R objects using the testing function as
follows:
#Testing for vector object
> is.vector(sleep)
)
[1] FALSE
#Testing for matrix object
> is.matrix(sleep)
)
[1] FALSE
#Testing for array object
> is.array(sleep)
)
[1] FALSE
#Testing for list object
> is.list(sleep)
)
[1] TRUE
#Testing
g for data frame object
j
> is.data.frame(sleep)
[1] TRUE
#Testing for time series object
> is.ts(sleep)
[1] FALSE
79
Fundamentals of R Language
Similarly, testing for the membership with other objects can be performed and
it can be verified that the data is of numeric type.
Note: When the membership testing functions for numeric, integer, character,
factor and logical are used on a lists or data frame objects the output will be
FALSE as it combines columns (in case of matrices and data frames) or
components (in case of list) belonging to different classes.
SAQ
Q1
Consider the factor object fac and a character vector Blessed created in
Section 3.2. Perform membership testing using all the discussed testing
functions and verify the fact that factors are not vectors.
3.3 COERCION
N FUNCTIONS
We may encounter a situation, in which we would like to combine elements or
vectors of different classes under the same name. In such a situation implicit
type conversion take place. By implicit coercion, we mean that, no specific
command is given by us to change the class or membership of an object.
Whenever, implicit coercion take place, it coerces a vector or matrix or an
array in accordance with the highest precision of their elements. The coercion
rule can be viewed from the following figure.
•Lowestt
Logical precision
Integer
Numeric
Character •Highest
precision
For the illustration purpose, let us create a vector by mixing a numeric element
with a character element. Then implicit coercion takes place and the output will
a character vector due to higher precision of character than numeric.
#Mixing numeric and character elements
> c(1.7,
, "a")
)
[1] "1.7" "a"
For more clarification, let us create two vectors of different types, say a vector
n of numeric type and another vector s of character type with arbitrary
elements as follows:
#Creating a numeric vector
> n <-
- c(2,
, 3,
, 5)
)
Next, we combine them using the c() function to create a single vector as
follows:
#Concatenating two vectors of different types
> c(n, s)
[1] "2" "3" "5" "a" "b" "c"
From the obtained outputs, you can observe that, implicit type conversion
takes place while binding the vectors row-wise and column-wise. Either we
bind them row-wise or column-wise the obtained outputs will be a character
matrix of some suitable order due to the higher precision of character than and
numeric.
Next, we illustrate the explicit type of coercion. Note that, explicit coercion is
not done by the software. We give a coercion function command to change the
class or membership of an object to another. We are listing here some of the
most useful coercion functions with their objectives:
Coercion Function Objective
as.numeric() coerce an object to numeric.
as.integer() coerce an object to integer.
81
Fundamentals of R Language
as.character() coerce an object to character.
as.factor() coerce an object to factor.
as.logical() coerce an object to logical.
as.vector() coerce an object to vector.
as.matrix() coerce an object to matrix.
as.array() coerce an object to array.
as.list() coerce an object to list.
as.data.frame() coerce an object to data frame.
as.ts() coerce an object to time series.
To illustrate, how these coercion functions work, we shall take different types
of objects and coerced them into another type or class of object. Let us first
create an integer vector named x with elements 0 to 5 as follows:
#Creating an integer vector
> x<-0:5
5
After creating it, we now coerce it to a character vector using the coercion
function as.character()
as.character r() as follows:
#Coercing an integer vector to a character vector
> as.character(x)
[1] "0" "1" "2" "3" "4" "5"
Next, instead of taking vector objects, we shall take a matrix object for the
illustration purpose. Consider the following arbitrary matrix named A of order
2x4.
#Creating a matrix to A
> A <- matrix(1:8, nrow=2, ncol=4); A
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
We next coerce this matrix object A to a data frame object using the
as.da
as ata.fra
fra
rame
as.data.frame() () function. We shall also overwrite
me()
me e matrix A while coercing
as follows:
#Coercing matrix A to a data frames object and overwriting A
> A <-
- as.data.frame(A);
; A
V1 V2 V3 V4
1 1 3 5 7
2 2 4 6 8
From the above outputs, you can observe that we are getting the output as
NAs with a warning message, as this coercion is not possible. It also means
that the values after coercion are not available.
Note: The is.numeric() function tests the mode or membership, not the
class, but as.numeric() function coerces to the class.
Similarly, other explicit coercion functions can be explored by you. In the next
section, we discuss list object of R programming.
SAQ
Q2
Write the output of the following code:
as.data.frame(matrix(1:9, nrow=3, dimnames=
list(c("x1","x2","x3"),c("y1","y2","y3"))))
84
Membership Testing, Coercion and Lists in R
3.4 LISTS
List is an R object. It consists of an ordered collection of objects, which are
known as its components. In some situations, lists are very useful specifically,
when we are required to combine a collection of different types of objects
under the same name (so in lists various components which may be referred
as its elements need not be of the same type). As earlier discussed, this
facility is not available in the case of vectors, matrices and arrays. A list could
consist of the following objects as its components:
Numeric vectors/elements
Logical vectors/elements
Character vectors/elements
matrices
Data frames
Lists
Functions, to name a few.
In this section, we first define a list then discuss the method of creating a list
and the method of extraction of its components and specific element.
Additionally, we discuss the procedure of merging two or more lists. Let us
discuss each one of them one-by-one.
3.4.1
3.4
4.1 Creation
Crrea
atio
on of a List
List
A list object in R is created using the li t() function. You should note that
list()
ist
the elements of a list (which can be referred as components of a list as well)
are always numbered. Also, with the help of these numb m ere s, list components
numbers,
as well as particular element(s) of a list component can be referred. d We first
illustrate the method of creating a list of 4 components and named it Std.
These four components of St Stdd consists of the details of two students, namely,
Deepika and Advait. Its first component consists of the name of the students,
the second component consists of the semester in which they are studying,
i.e., VI, the third component consists of their roll numbers, 50 and 03. Lastly,
the fourth component displays average marks of the students, i.e., 80 and 89,
respectively.
#Creating a list
> Std
d <-
- list(c("Deepika",
, "Advait"),
, "VI",
, c(50,
, 03),
,
c(80,
, 89));Std
d
[[1]]
[1] "Deepika" "Advait"
[[2]]
[1] "VI"
[[3]]
[1] 50 3
[[4]]
[1] 80 89
85
Fundamentals of R Language
After creating Std, we next verify whether the created object is a list object or
any other object by using the testing function is.list() as follows:
#Testing for list
> is.list(Std)
)
[1] TRUE
Hence, the output confirms that the created object Std is a list object.
Next, we illustrate the method of extracting the jth element of the kth
component, i.e., Std[[k]][j]. To work on that, we extract the 2nd element of
the 1st component and 1st element of the 3rd component from Std as follows:
#Extracting the 2nd element of the 1st component
> Std[[1]][2]
[1] "Advait"
#Extracting the 1st element of the 3rd component
> Std[[3]][1]
[1] 50
Next, we illustrate the method of extraction of list components using the name
86 of the list components and ‘ $ ’ operator. To do so, we first give names to the
Membership Testing, Coercion and Lists in R
list components of Std to make it more self-describing (as presently the list
components are not self-describing) as follows:
#Naming the list components
> Std
d <-
- list(Name=c("Deepika",
, "Advait"),
, Semester="VI",
,
Rollno=c(50,03),
, Marks=c(80,89));
; Std
d
$Name
[1] "Deepika" "Advait"
$Semester
[1] "VI"
$Rollno
[1] 50 3
$Marks
[1] 80 89
From the obtained output, it is verified that each component of the list is
properly named. After naming the components, we next extract each
component of the created list Std one-by-one using components names and
‘ $ ’ operator as follows:
#Extracting the 1st component
> Std$Name
[1] "Deepika" "Advait"
#Extracting the 2nd component
> Std$Semester
[1] "VI"
#Extracting the 3rd component
> Std$Rollno
[1] 50 3
#Extracting the 4th component
> Std$Marks
[1] 80 89
> Std[["Semester"]]
]
[1] "VI"
> Std[["Rollno"]]
]
[1] 50 3
> Std[["Marks"]]
]
[1] 80 89
87
Fundamentals of R Language
Note: It is important to note that, in the last two methods, we have only
discussed the procedure of extracting list components. The element(s) of the
extracted list components can be easily referred by appending ‘ […] ’ after ‘ $ ’
operator or ‘ [[…]] ’ operator as discussed earlier.
[[2]]
[1] "a" "b" "c"
Consider another example in which we create a list with different objects (as
components) like vector, matrix, list and data frame and named it as Lst
Lst as
follows:
#Creating a list with different objects
> data <- sleep
> Lst<-list(c(1986,2022), c("T", "K"), mat
matrix(rep(1,4),ncol=2),
trix(rep(1,4),ncol=2),
list("A", "P"), data)
> Lst
[[1]]
[1] 1986 2022
[[2]]
[1] "T" "K"
[[3]]
[,1] [,2]
[1,] 1 1
[2,] 1 1
[[4]]
[[4]][[1]]
[1] "A"
[[4]][[2]]
[1] "P"
[[5]]
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
88
Membership Testing, Coercion and Lists in R
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
Hence, a list with name Lst is created, whose 1st component is a numeric
vector, the 2nd component is a character vector, the 3rd component is a
numeric matrix of order 2x2, the 4th component is a list and the last component
is a built-in data frame sle
eep available in the datasets package.
sleep
SSAQ
SA
AQ 3
Consider the list named Lst created in Section 3.4. Extract its 2nd component
using all the three methods discussed in this section.
3.5
3 .5
5 A
ATTRIBUTES
TTRIBUTES OF
OF O
OBJECTS
BJE
ECTS
In this section, we shall discuss the following attributes of R objects:
names()
dimnames()
dimensions
class()
length()
We shall discuss each of these attributes one-by-one with the help of suitable
examples. Let us first discuss the names() function.
3.5.1 The names() Function
Names of the R objects can be set using the names() function available in
the base package. Setting names are very useful for writing self-describing
and readable code. When this function is used alone on an R object, it will
return the names of the R object. Note that the names() function accepts
different objects as argument such as vector, matrix, list and data frames.
Moreover, when the names() function is used with the assignment operator
‘ <- ’ and a character vector of up to the same length as an object, it will set
the name of the R object.
89
Fundamentals of R Language
Vector argument supplied to names() function:
Consider the first example in which we assign the names to a vector object
pin consisting of the pin codes of different places.
#Creating a vector of pin codes
> pin
n <-
- c(110092,
, 110032,
, 201301,
, 122001,
, 302001);
; pin
n
[1] 110092 110032 201301 122001 302001
After creating a vector named pin, we now illustrate the method of naming
corresponding pin codes using the names() function and the assignment
operator ‘<-’ as follows:
#Naming the pin codes
> names(pin)
) <-
- c("Anand
d Vihar",
, "Shahdara",
, "Noida",
,
"Gurgaon",
, "Jaipur");
; pin
n
Anand Vihar Shahdara Noida Gurgaon Jaipur
110092 110032 201301 122001 302001
The obtained output confirms that the names to the pin codes are successfully
self-describing.
assigned and now elements are more self- f describing.
After setting the names, we next illustrate the method of getting the names of
an R object using the names()
na
amess() function. To do so we simply supply the pin
p n
pi
vector as argument to the names()
na
amees(() function as follows:
#Getting the names
> names(pin)
[1] "Anand Vihar" "Shahdara" "Noida" "Gurgaon"
"Jaipur"
Next, we assign names to each of its elements using the names() function.
#Assigning names to matrix elements
> names(P)
) <-
- c("Anand
d Vihar",
, "Shahdara",
, "Noida",
, "Gurgaon");
;
P
[,1] [,2]
[1,] 110092 201301
[2,] 110032 122001
attr(,"names")
[1] "Anand Vihar" "Shahdara" "Noida" "Gurgaon"
[[1]]
[1] 22200
[[2]]
[1] 23000
[[3]]
[1] 15010
[[4]]
[1] 10000
After creating the list, we next set the names of the components of the list
Lst2 using names() function as follows:
#Setting the names of the list components
> names(Lst2)
) <-
- c("Pooja",
, "Barkha",
, "Shrawanti",
, "Shivam");
;
Lst2
2
$Pooja
[1] 22200
$Barkha
[1] 23000
$Shrawanti
[1] 15010
$Shivam
[1] 10000
91
Fundamentals of R Language
A data frame argument supplied to names()
) function:
Next, we supply a data frame object as an argument to the names() function
and illustrate the method of getting the names of a data frame object. Consider
the built-in data set sleep discussed in the beginning of this unit. Using the
membership testing function, we have already shown that the sleep data is a
data frame object. Let us now supply the sleep data as argument to the
names() function to get the names of the columns of the data frame as
follows:
#Getting the names of the columns of the sleep data
> names(sleep)
)
[1] "extra" "group" "ID"
The names of the sleep data can be overwritten using the names() function
on the same lines as discussed earlier.
There may be situations, in which we only want to set the row names or the
column names of a created matrix. In such situations, we use the
rownames() and colnames() functions, respectively and supply the created
92
Membership Testing, Coercion and Lists in R
After creating it we next set the names of the rows using the rownames()
function as follows:
#Naming the rows only
> rownames(MatE)
) <-
- c("R1","R2","R3");
; MatE
E
[,1] [,2]
R1 2 8
R2 4 10
R3 6 12
Note that, in the obtained output both rows and columns names are appearing
as the names of the rows are already set due to the previously used
roownamees(
rownames()s()) function command. Since the names off the columns are setting
the names of the rows therefore both the rows and columns names are
appearing in the output off colnames()
l () ffunction command.
Next, we illustrate the method of setting names of the rows and columns of a
data frame. To do so, we consider the first three rows of the sleep data.
Assign the extracted data to MD and then set names of the rows and columns
using the dimnames() function as follows:
#Extracting and assigning the first 3 rows of the sleep data
> MD
D <-
- sleep[1:3,];
; MD
D
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
Now we illustrate the method of setting the names of the rows and columns as
("R1", "R2", "R3") and ("C1", "C2", "C3") respectively of a data frame
object MD as follows:
93
Fundamentals of R Language
The dimnames() function can also be used to get names of the rows and
columns of a data frame. In that case the obtained output will be a list object
whose first component is consisting of the names of the rows and second
component consisting of the names of the columns. For the illustration
purpose let us get names of the rows and columns of MD data frame as
follows:
#Getting the names of the rows and columns of MD data frame
> dimnames(MD)
)
[[1]]
[1] "R1" "R2" "R3"
[[2]]
[1] "C1" "C2" "C3"
Note: The names of the data frame can be extracted using the ro
row.names()
ow.nammes( ()
function and the names of the columns of a data frame can be extracted using
the na
names()
ames() ) function.
3.5.3
3.5
5.3 dimensions
dim
mensio
ons
s
The dimension of R objects like matrices and da data
ata frames can be obtained
using the di im() function. This function is already discussed in the Unit 2 of
dim()
MST-015. In Unit 2 we supplied a matrix object as an argument to the dim()
function. Recall, that this function returns the number of rows and
an columns
colu
l mns of a
matrix. Similarly, we can supply a data frame object as its argument. For the
illustration purpose, let us supply a data frame object MD as its argument as
follows:
#Getting the dimensions of a data frame
> dim(MD)
) #As MD <- sleep[1:3,]
[1] 3 3
Note that, we are getting the output as 3 3. It means that the MD data frame
consists of 3 rows and 3 columns.
Note: Two separate functions, nrow() and ncol() can also be used on a
data frame (on the similar lines as on a matrix) to explicitly get the number of
rows and number of columns of a data frame object.
Note: The effect of the class (if necessary) can be removed temporarily by
using the function unclass()
uncl
clas
class
as s() function.
3.5.5
3.5
5.5
5 The
The length()
gth() Function
leng
The length() function available in the base package is used to get or set
the length of vectors and factors. A list object can also be supplied as its
argument. Other objects on which its execution is defined can also be supplied
as argument to this function.
Let us now supply different objects one-by-one as argument to this function
and observe the obtained output.
#Supplying a vector argument
> length(c(-3,
, -2,
, -1,
, 0,
, 1,
, 2,
, 3,
, NA))
)
[1] 8
#Supplying a character vector as argument
> length(c("Aa",
, "Bb",
, "Cc",
, "Dd"))
[1] 4
#Supplying a factor argument
> length(factor(c(1,
, 1,
, 1,
, 2,
, 3,
, 2,
, 3,
, 1)))
)
[1] 8 95
Fundamentals of R Language
#Supplying a list
> length(list(22200,
, 23000,
, 15010,
, 10000))
)
[1] 4
Hence, the length of y has been increased from 4 to 7. Next, we decrease the
length of the y vector to 2 using le
eng ) function as follows:
gth()
length()
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[17] 17 18 19 20
96
Membership Testing, Coercion and Lists in R
SAQ
Q4
Consider the following student’s data:
Name Score
Deepika Sangwan 98
Shivam 97
Anupam Pandey 85
Anadi Vishist 90
Brijesh 85
Siddharth Tondon 82
Harshvardhan 98
Harshit 96
Shivani 85
Monalisa 97
3.6
3.
.6 S
SUMMARY
UMMARY
The main points discussed in this unit are as follows:
We have discussed several membership testing functions
funcctions available in R.
Implicit and explicit type of coercion have been discussed.
Different explicit type coercion functions available in R have been
discussed
The procedure of creating a list is discussed.
Different methods of extracting list components and elements are
discussed.
Different types of attributes of R objects have been discussed with
examples.
97
Fundamentals of R Language
(vi) The effect of the class (if necessary) can be removed temporarily by
using the function unclass().
(vii) The length() function can’t be used to increase the length of an
already defined vector.
2. Fill in the blanks:
(i) The internal structure of an R object can be viewed using ……
function.
(ii) Testing for a data frame object is done using ………function.
(iii) The is.matrix() function has/have ……….function argument(s).
(iv) The is.list() function is used to test for ……..
(v) Mixing of character elements and integer elements in a vector
results……………..
(vi) Mixing of character elements and logical elements in a matrix
results……………..
((vii)) The output
p of as.integer(c(1.1,
g ( ( , 0.1, )) is ………
, -3.4))
3. Create a vector named w with elements 1.1, 0.1, -3.4, 0.7, 1.8 and 2.2.
Test for its membership with the numeric type of vector and coerce it to
an integer vector.
4. What will be the out of the following R command:
matrix(c(1, FALSE, 0, TRUE),ncol=2)
5. Write R command to get the row names and column names of the
sleep{datasets} data in a single line command.
6. Consider the following data set:
Tree S. No. Age Circumference
1 108 43
1 2 494 98
3 654 106
4 108 42
2 5 494 98
6 654 113
7 108 52
3 8 494 88
9 654 102
3.8 SOLUTIONS/ANSWERS
SOLUTIONS/ANSWER
RS
Self-Assessment
Self-A
Assessment Questio
Questions
ons (SAQs)
(SA
AQs
s)
1. The output will be as follows:
Structure
4.1 Introduction The print() Function
4.1
4 .1 INTRODUCTION
INTRODUCTIO
ON
In Unit 2 of MST-015 (Introduction to R Software) course, you have learnt the
method of creating and using R objects, namely, vectors, matrices, arrays and
factors. In the same unit, you also got familiar with the arithmetic operators,
relational operators and logical operators. Thereafter, in Unit 3 you learnt the
method of creating a list object and extraction of its components or elements
of the components. In Unit 3 you have also studied a number of membership
testing and coercion functions.
In the beginning of this unit, we shall discuss the method of creating a data
frame, subsetting of a data frame, ordering, sorting and ranking functions. We
shall also discuss some commonly used functions in which data frame is
supplied as an argument to the function. Later in this unit, we shall make you
familiar with some formatting command functions such as print(),
paste(), paste0() and cat() functions. For the data analysis purpose, it
is important to know the way of reading data from different file formats (such
as .txt, .csv, .delim and .xslx) and write the data to a file of specific format. So,
we discuss different functions used for reading and writing from/to a file. We
shall also discuss the commonly used date and time functions, namely, 101
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
In the given data there are four columns. First three columns consists of the
names, genders and percentage of marks of the students. The fourth column
is of logical type indicating whether the age of the student is more than 30 or
not. The rows of the given data are the sample unit or some-times referred as
observations. Note that, the first column consisting of the names of the
students is of character type, the second column consisting of gender
information is a categorical variable of character type, the third column
consisting of the percentage of marks of the students is of numeric type and
the fourth column, i.e., age is of logical type as it consists of TRUE and FALSE.
Also, note that the given data consists of a missing value corresponding to the
percentage of marks of the student Pehu.
Next, we create a data frame of the given data using the da
data.frame()
data.
.fr
rame
e()
function and named it as Adm.data
Adm data as follows:
dm.
#Creating and assigning a data frame
> Adm.data <- data.frame(
+ c("Shreyash","Prithu","Yuvaan","Advika","Pawan","Pehu"),
+ as.factor(c("Male", "Male", "Male", "Female", "Male",
"Female")),
+ c(88.55, 80.13, 85.31, 75.22, 65.04, NA),
+ c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))
Now a data frame named Ad data is created. Observe that while creating
Adm.da
Adm.data
da
this data frame the gender variable is coerced to a factor
c or using the coercion
fact
function as.factor().
a .factor()
as (). Additionally, the columns of the data frame do not
()
names. In Unit 3 of MST
have names MST-015 course, you have learnt to set names of the
015 course
columns of a data frame using the names() function. Let us use the names()
function here to set the names to the columns of the Adm.data as Name,
Gender, Percentage and AgeG30 as follows:
#Setting the column names
> names(Adm.data)
) <-
- c("Name","Gender","Percentage","AgeG30")
)
#Printing the data frame
> print(Adm.data)
)
Name Gender Percentage AgeG30
1 Shreyash Male 88.55 TRUE
2 Prithu Male 80.13 FALSE
3 Yuvaan Male 85.31 FALSE
4 Advika Female 75.22 FALSE
5 Pawan Male 65.04 TRUE
6 Pehu Female NA FALSE
103
Fundamentals of R Language
Hence, the output confirms that the names to the columns of the data frame
are correctly assigned. After creating a data frame and assigning the names,
next we view the internal structure of the Adm.data data frame using the
str() function as follows:
#Internal structure of the data frame
> str(Adm.data)
)
'data.frame': 6 obs. of 4 variables:
$ Name : chr "Shreyash" "Prithu" "Yuvaan" "Advika" ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1
$ Percentage: num 88.5 80.1 85.3 75.2 65 ...
$ AgeG30 : logi TRUE FALSE FALSE FALSE TRUE FALSE
From the internal structure, it is clear that the obtained information consists of
the complete details on the 6 observations of 4 columns or variables (whose
names are written after ‘ $ ’ operator. The output depicts that the Name
variable is a character variable (as it is specified by chr), the Gender variable
is a factor variable with two levels (as it is specified by Factor), the
percentage variable is a numeric variable (as it is specified by num) and the
last variable AgeGG30 is a logical variable (as it is specified by lo
AgeG30
eG logi).
ogi
gi).
Note: For the sake of convenience the column names are set in the
data data frame, as these variable names facilitate columns extraction,
Adm.data
Adm.da
da
enhance the readability and reference.
Next, we discuss some commonly used functions which are used on data
frames:
Function Objective
Note: The str(), head() and tail() functions are available in the utils
package and other mentioned functions are available in the base package.
Recall that in Unit 3, you have already learnt the use of the names() and
row.names() functions. So, we do not discuss these functions in detail
104
Data Frames, Reading and Writing in R
again. To illustrate the use of these function, we again consider the created
Adm.data data frame and supply it as an argument to these functions as
follows:
#Getting the number of rows of a data frame
> nrow(Adm.data)
[1] 6
Note that when you write the data() command, you will be able to view all
the data sets, whose libraries are already loaded to the working environment,
or otherwise by default, you will view the data sets available in the datasets
package. Moreover, the data() function can also be used to load a specific
data set. To do so, either you call the packages first using require() or
library() function, or otherwise write the following da data()
data () function
ta()
ta
command with the package name. For the illustration purpose we now view
the data sets available in da
datasets
ata
t set ts and MA
MASS
M S1 libraries together as follows:
SS
#Viewing the data sets available in the datasets and MASS
#libraries
> library("MASS")
> data()
Note that, a blank space at the column indices place is just left to indicate that
all the columns need to be selected. Also, the rows will be extracted in the
same order in which they are written. See for example
#Extracting rows in different order
> Adm.data[c(6,2,5),
, ]
Name Gender Percentage AgeG30
6 Pehu Female NA FALSE
2 Prithu Male 80.13 FALSE
5 Pawan Male 65.04 TRUE
df.name[c(i,j,k), ]
Extracts the ith , jth and kth rows, while keeping or selecting all
the columns of a data frame.
df.name[ ,1:m]
Extracts the first m columns,
columns while selecting all the rows of a
data frame.
df.name[ [ ,c(i,j)]
Extracts the ith and jth columns, while selecting all the rows
of a data frame.
On the similar lines, selection of the rows, columns and subpart of the data
frame can also be done using logical conditions. For example:
df.name[(m>n),c(i,j)]
Extracts the subpart of a data frame consisting of rows
whcih satisties the logical condition (m>n) of the ith and jth
columns.
108
Data Frames, Reading and Writing in R
In addition to all these, particular number of rows and columns can also be
dropped by writing the negative sign in front of the row and column indices.
For example, 4th column of the data frame can be dropped as follows:
df.name[ [ ,-4]
Dropping 4th column of a data frame and considering all the
rows of a data frame.
Also, we can have a look on the data frame by simply writing the name of the
data frame on the R console. Let us display first few rows of the data frame as
follows:
To illustrate subsetting, we now extract the first six rows of the 2nd and 4th
columns of the USArrests data frame by writing row and column indices of
the data frame as follows:
#Extracting subpart of a data frame
> USArrests[1:6,
, c(2,4)]
] 109
Fundamentals of R Language
Assault Rape
Alabama 236 21.2
Alaska 263 44.5
Arizona 294 31.0
Arkansas 190 19.5
California 276 40.6
Colorado 204 38.7
nd
Next, we extract 2 column of the USArrests data frame.
#Extracting 2nd column of the data frame
> USArrests[
[ ,2]
] #Or USArrests$Assault
[1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113
[15] 56 115 109 249 83 300 149 255 72 259 178 109 102 252
[29] 57 159 285 254 337 45 120 151 159 106 174 279 86 188
[43] 201 120 48 156 145 81 53 161
The highlighted element from the USArrests data frame can be extracted
using the following statement.
#Extracting a particular element
> USArrests[3,2]
]
[1] 294
Lastly, we illustrate the method of extraction of all those rows of a data frame,
for which either the Assault variable is more than 250 or the Murder
variable is more than 16 and select only first three columns as follows:
#Extracting using logical condition
> USArrests[USArrests$Assault>250|USArrests$Murder>16,
, 1:3]
]
Murder Assault UrbanPop
Alaska 10.0 263 48
Arizona 8.1 294 80
California 9.0 276 91
110
Data Frames, Reading and Writing in R
After attaching the data frame now we can access the columns of the
USArrests
U
US Ar est
Arre s s data frame by using the column names only as follows:
#Accessing the Murder and Assault variables
> Murder #Otherwise USArrests$Murder
[1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3
[12] 2.6 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1
[23] 2.7 16.1 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0
[34] 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2
[45] 2.2 8.5 4.0 5.7 2.6 6.8
> Assault
t #Otherwise USArrests$Assault
[1] 236 263 294 190 276 204 110 238 335 211 46 120 249 113
[15] 56 115 109 249 83 300 149 255 72 259 178 109 102 252
[29] 57 159 285 254 337 45 120 151 159 106 174 279 86 188
[43] 201 120 48 156 145 81 53 161
It is users’ responsibility to always detach the attached data frame after the
work is over. Whenever data frame is detached, its columns cannot be
accessed just by writing the column names. For the illustration purpose, after
detaching the data frame, we now try to access its columns by their names
and see what we get.
#Accessing columns after detaching the data frame
> Murder
r
Error: object 'Murder' not found
> Assault
t
Error: object 'Assault' not found
Thus, after detaching the data frame, we will not be able to access the variable
by simply writing the column names. In the next subsection we discuss about
ordering, sorting and ranking functions.
112
Data Frames, Reading and Writing in R
Next, we sort the data frame according to the Murder variable of data. To do
so, we first compute the order of the Murder variable using the order()
function. Additionally, we also append the computed orders to data using the
‘ $ ’ operator and named it as OrderM as follows:
#Computing the order of the Murder column and appending it to
#data
> data$OrderM <- order(data$Murder)
> print(data)
Murder Assault UrbanPop Rape OrderM
Alabama 13.2 236 58 21.2 7
Alaska 10.0 263 48 44.5 8
Arizona 8.1 294 80 31.0 6
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Colorado 7.9 204 78 38.7 5
Connecticut 3.3 110 77 11.1 2
Delaware 5.9 238 72 15.8 1
Florida 15.4 335 80 31.9 9
Georgia 17.4 211 60 25.8 10
Note that, one more column (fifth column) named OrderM, consisting of the
order
d off th
the rows is
i now appendedd d tto d Also, b
data. Al by d
default this ffunction
f lt thi ti gives
i
the output in ascending order (which means rows will be arranged in
ascending order of Murder variable). We now try to understand the obtained
output. The values shown under OrderM variable indicates that the smallest
value of Murder variable i.e., 3.3 is present in the 7 row (as OrderM[1]=7) of
the data frame data and the largest value of the Murder variable, i.e., 17.4 is
present at the 10th row (as OrderM[10]=10) of the data frame. On the similar
lines other elements of the OrderM column can be inferred. Moreover, this
order is shown in increasing order.
Next, we sort the rows of the data frame data according to the Murder
variable, using the computed OrderM column at row indices place of the data
frame data as follows:
#Sorting data according to Murder variable
> data[data$OrderM,
, ] #Arranging rows according to OrderM
Murder Assault UrbanPop Rape OrderM
Connecticut 3.3 110 77 11.1 2 113
Fundamentals of R Language
From the obtained output we observe that all the rows of the data are now
rearranged according to the Murder variable of data, due to the OrderM
variable. Hence the data frame data is sorted according to the Murder
variable. On the similar lines the data can be sorted according to any column
of the data frame.
Next, we discuss the rank() and sort() functions. The ranks of the
elements of any column of the data frame or a vector, can be obtained using
the rank() function, just by supplying the column as an argument to the
function. Additionally, a column of a data frame can be sorted using the
sort()
s
so () function. Consider the data frame da
rt() a again.
data
ata
#Extracting data
> data <- USArrests[1:10,]
For the comparison purpose, we first append the order of the Assault
AsO to data.
variable with name AsO
#Appending computed order of the
e Assault va
v
variable
riable
> data$AsO <- order(data$Assault)
Observe from the obtained output that the AsO, AsR and AsS columns
consists of the Order, ranks and the sorted Assault variable.
For the illustration purpose we consider the ships data set available in the
MASS package. We first seek help on the data frame and show some of its
rows as follows:
#Loading MASS package
> library(MASS)
#Seeking help
> ?ships
The details on the ships data set can be read from the R Documentation
page. Next, we have a look on the data frame now:
115
Fundamentals of R Language
$D
type year period service incidents
25 D 60 60 251 0
26 D 60 75 105 0
27 D 65 60 288 0
...
$E
type year period service incidents
33 E 60 60 45 0
34 E 60 75 0 0
35 E 65 60 789 7
...
Note that, a specific group can also be extracted by explicitly using the logical
condition. For the illustration purpose, we now extract the rows of the ships
data frame corresponding to the type C as follows:
#Extraction of rows where type=C
> ships[ships$type=="C",]
type year period service incidents
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
20 C 65 75 676 1
21 C 70 60 783 6
22 C 70 75 1948 2
23 C 75 60 0 0
24 C 75 75 274 1
SSAQ
SA
AQ 1
Consider the admission data discussed in the Section 4.2 and create a data
frame consisting of the admission data. After creating the data frame do the
following tasks:
(i) Observe the output will look like if we do not set the names of the
columns.
(ii) Set suitable row and column names of the data frame in a single
command.
(iii) Sort the data frame according to the percentage variable.
From these internal structures, observe that the only difference between the
paste() and paste0() functions is of the sep argument. For the illustration
purpose, we now concatenate term-by-term first 10 upper-case letters of the
Roman alphabet with 1 to 10 numbers using both the functions as follows:
Note that, the difference between the two outputs is due to the sep argument
only. If we write the sep argument as sep="" (without an empty space) then
we get the same output as of paste0() function, see for example:
#Alternative approach to paste0() function
> paste(LETTERS[1:10],
, 1:10,
, sep="")
)
[1] "A1" "B2" "C3" "D4" "E5" "F6" "G7" "H8" "I9"
[10] "J10"
Next, we illustrate the use of the sep and collapse arguments, so that the
difference between the two can be clearly understood. To do use, we
concatenate the earlier two vectors term-by-term using the sep and
collapse arguments as follows:
#Concatenating two vectors term-by-term
> paste(LETTERS[1:10],
, 1:10,
, sep="$",
, collapse=",
, ")
)
[1] "A$1, B$2, C$3, D$4, E$5, F$6, G$7, H$8, I$9, J$10"
From the output, observe that the terms are separated using the collapse
argument and elements of the vectors are separated using the sep argument.
Next, we illustrate the use of the recycle0
re
ecy
cycle argument whose default value
c e0
E.
FALSE.
FALSE
#If recycle0 is FALSE
> paste("Use of recycle0", vector(mode="character",length=0),
recycle0=FALSE)
[1] "Use of recycle0 "
But this output is not explicitly showing, what these numeric digits are
representing.
The paste() function can be used to print the detailed output and the output
will be printed in double quotes ( " " ) as follows:
#Printing given information using paste() function
> paste("Monthly salary of Pawan is", Pawan, "Rs.")
[1] "Monthly salary of Pawan is 75000 Rs."
> paste("Monthly salary of Advait is", Advait, "Rs.")
[1] "Monthly salary of Advait is 65000 Rs."
Due to the absence of the new line character ‘ \n ’, the outputs are coming in
continuation. This shows the importance of the new line character ‘ \n ’.
120
Data Frames, Reading and Writing in R
The occurrence of the new line character can be controlled using the fill
argument of the function. Recall that, by default, fill=FALSE. So to add new
line argument at the end of each statement we can simply write fill=TRUE
as follows:
#Using fill argument of the cat() function
> cat("Monthly
y salary
y of
f Pawan
n is",
, Pawan,
, "Rs.",
, fill=TRUE)
)
Monthly salary of Pawan is 75000 Rs.
> cat("Monthly
y salary
y of
f Advait
t is",
, Advait,
, "Rs."
" ,fill=TRUE)
)
Monthly salary of Advait is 65000 Rs.
Note: A tab character ‘ \t ’ is used to give a horizontal tab space and a new
line character ‘ \n ’ is used for new line.
Next, we illustrate the use of the sep
se argument of the cat()
() function to get
cat(
cat()
modified outputs using it as follows:
#Using separator as blank space
> cat("ABC", "abc", sep=" ", fill=TRUE)
ABC abc
#Using separator as comma
> cat("ABC", "abc", sep=",", fill=TRUE)
ABC,abc
#Using separator as new line character
> cat("ABC",
, "abc",
, sep="\n",
, fill=TRUE)
)
ABC
abc
#Using separator as tab character
> cat("ABC",
, "abc",
, sep="\t",
, fill=TRUE)
)
ABC abc
From the obtained outputs observe that the terms of the output are separated
by the character specified by the sep argument of the cat() function.
Moreover, note that the paste() function can be used as argument of the
cat() function for getting further formatted output as follows:
#Printing using cat() function only
> cat(letters[1:3],
, 1:3,
, sep=",",
, "\n")
)
a,b,c,1,2,3, 121
Fundamentals of R Language
SAQ
Q2
Write a R command to get the following output:
a##1$, b##3$, c##5$, d##9$
Observe from the screenshot that the location of the .txt file shown in the
image is not same as of our current directory. So, we first set up the path of
the working directory using the setwd() function as follows:
#Setting the path of the working directory
> path="C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software"
"
122
Data Frames, Reading and Writing in R
> setwd(path)
)
Note: An alternative approach to do this is that, we specify the path of the file
while reading it, which will be illustrated soon.
After setting the working director, we verify whether the working directory is
properly set or not using getwd() again as follows:
#Verify the working directory
> getwd()
)
[1] "C:/Users/Taruna Kumari/Desktop/Introduction to R Software"
Hence, it is verified that the working directory is properly set. Also note that,
the .txt file named “TKfile1” consists of the following data.
After, setting the working directory, next we read the .txt file using the
read.table()
re
ead.t tabl le() function and by supplying the name of the file as character
string with proper extension in the following manner:
#Reading the data from a .txt file
> read.table("TKfile1.txt")
V1 V2 V3 V4
1 x y w z
2 13.2 8.2 11 August
3 12.1 3.1 12 December
4 14.8 6.1 13 June
5 14.2 7.2 10 July
Note: If required the read data can be named using an assignment operator.
From the obtained output it can be note that by default row names (1 to 5) and
column names (V1, V2, V3 and V4) are shown in the output, but the column
names were x, y, w and z. So, to read the first line of the file as column names
(as header), we assign the header argument (whose default value is FALSE)
of the read.table() function as TRUE in the following manner:
#Reading the data from a .txt file using header argument
> read.table("TKfile1.txt",
, header=TRUE)
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July
123
Fundamentals of R Language
Hence, the column names are now read and default column names are now
replaced with the original ones. Additionally, as the row.names argument of
this function is missing therefore default numbering is given as row numbers.
Note that the read.table(), read.csv() and read.delim() functions
have two more important arguments with different default values, namely, sep
and dec. The sec argument is used to specify how the elements of the data
are to be separated and dec argument is used to specify decimal point. See
the following help page for more clarification:
124
Data Frames, Reading and Writing in R
The same file can be read using the read.table() function as well but by
changing the default value of the dec argument as follows:
#Reading data from a file
> read.table("TKfile2.txt",
, header=TRUE,
, dec=",")
)
x y w z
1 13.2 8.2 11 August
2 12.1 3.1 12 December
3 14.8 6.1 13 June
4 14.2 7.2 10 July
Or otherwise, the file will be read incorrectly and we will get the following
output (as by default dec=".").
d c=".").
de
> read.table("TKfile2.txt", header=TRUE)
x y w z
1 13,2 8,2 11 August
2 12,1 3,1 12 December
3 14,8 6,1 13 June
4 14,2 7,2 10 July
For more clarification on the dec argument of the function. We create one
more .txt file named TKfile3.txt in the following manner:
Observe that in the TKfile3.txt file the terms are separated using the ‘ , ’
and decimal point is ‘ . ’. So, here it would be better to read this file using the
read.table() function by specifying the dec and sep arguments accordingly
as follows:
125
Fundamentals of R Language
This package can read, write and manipulate both Excel 97–2003 and Excel
2007/10 spreadsheets. The readWorksheetFromFile() function is used to
read a excel file and the writeWorksheetFromFile() function is used to
write to a excel file. For the illustration purpose we have created a excel file
named TKfile5.xlsx in the working directory.
We first set the working directory as the path from where the file is to be read
(if not set earlier), TKfile5 xlsx file,
earlier) then we read both the sheets of TKfile5.xlsx one-
file one
by-one using the readWorksheetFromFile() function and assign them to
df.one and df.two as follows:
#Setting the current working directory
> setwd("C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software")
)
127
Fundamentals of R Language
From the above outputs observe that the sheet argument of the function is
assigned as 1 to read sheet number 1 and assigned as 2 to read sheet
number 2. Moreover, the header argument is assigned as TRUE, to read the
first line of the file as its header.
Or otherwise, if you do not want to change the working directory and directly
want to read the file from the location where it is saved, you can simply use
the following commands to read both the sheets as follows:
#Assigning the location of the file in path
> path <- “C:/Users/Taruna Kumari/Desktop/Introduction to R
Software/Tkfile5.xlsx”
Note: Specific rows and columns from a .xslx file can be read using the
startRow, endRow, startCol and endCol arguments of the
readWorksheetFromFile() function. The usage of each of these
arguments is self-explanatory.
SAQ
Q3
Create a .txt file of the admission data discussed in Section 4.2 (Adm.data)
and write R command to read it.
package. For the illustration purpose we now write the first 6 rows of the
trees data set available in the datasets package to .txt and .csv files.
Note: The write.table() and write.csv() functions also have
arguments such as sep, dec, row.names (with default value TRUE) and
col.names (with default value TRUE). These arguments can be used on the
same lines as discussed earlier.
#Writing first 6 rows of the trees data to .txt file
> write.table(trees[1:6,],
, "Trees1.txt")
)
#Writing first 6 rows of the trees data to .txt file using sep
#and dec arguments
> write.table(trees[1:6,],
, "Trees3.txt",sep=",",
, dec=".")
)
H
Hence, th
the screenshot
h t confirms
fi that
th t the
th first
fi t six
i rows off the
th sleep
l data
d t are
properly written in the file’s named Tree1.txt, Tree2.csv and Tree3.txt,
according to the written command. Next, we shall use the
writeWorksheetToFile() function available in the "XLConnect" 2
package to write the data in a excel file.
#Loading the package
> library("XLConnect")
)
#Setting the path of the file
> path
h <-
- "C:/Users/Taruna
a Kumari/Desktop/Introduction
n to
o R
Software/Trees4.xlsx"
"
#Writing data to first sheet of .xlsx format file
> writeWorksheetToFile(path,
, data=trees[1:6,],
,
sheet="FirstSheet")
)
2https://CRAN.R-project.org/package=XLConnect
129
Fundamentals of R Language
The saved data can be viewed by opening the excel file as follows:
The location of the file can be seen by viewing the properties of the file as
files.
SSAQ
SA
AQ 4
Write R statements to write the USArrests data set in the .csv, .txt and .xslx
files.
4.6
4 .6
6 Dates
Dates and
and Times
s
In R a number of functions are available to deal with date and time data. In this
section, we shall discuss the as.Data(),
as.D
as Dat
ata(
a()
a( ), ISOdatetime(),
ISOd
IS Odat
Odateti
at time
me()), as.POSIXlt()
() as.P
as POS
OSIX
IXlt()
IX )
and as.POSIXct()
OSIXct() functions to dea
as.POS
OS deal
e l with time data.
The as.Date() function available in the base package is used to convert a
character string representation of date (a calendar date) to an object of Date
class. But this function does not handle times. We first take help on this
function as follows:
#Seeking help
> ?as.Date
e
starting httpd help server ... done
130
Data Frames, Reading and Writing in R
It would be interesting to view the internal structure of a Date object. See for
example:
#Checking internal structure of date object
> str(as.Date(c("2023/08/13", "2023/10/31")))
Date[1:2], format: "2023-08-13" "2023-10-31"
From this output it is clear that the object supplied to the str() function is a
object.
Date object
Next we discuss a function named difftime() available in base package,
used to compute the difference in time units such as "auto", "secs",
"mins", "hours", "days" and "weeks". For the illustration purpose
consider the following commands in which we compute the difference in two
different time objects in different units of time.
#Assigning two Date objects time1 and time2
> time1
1 <-
- as.Date("2023/08/13")
)
> time2
2 <-
- as.Date("2023/10/31")
)
> difftime(time2,
, time1,
, units="days")
)
Time difference of 79 days
131
Fundamentals of R Language
> difftime(time2,
, time1,
, units="hours")
)
Time difference of 1896 hours
> difftime(time2,
, time1,
, units="weeks")
)
Time difference of 11.28571 weeks
> difftime(time2,
, time1,
, units="mins")
)
Time difference of 113760 mins
> difftime(time2,
, time1,
, units="secs")
)
Time difference of 6825600 secs
There are some functions which extracts the weekdays, months, quarters and
number of days since some origin of a date (or POSIXt) object. The
weekdays() function is used to get weekdays, the months() function is
used to get months, the quarters() function is used to get quarters and the
julian() function is used to get the number of days since some origin of a
Date object. All these functions come as part of base package. See for the
illustration purpose:
#Getting weekdays
> weekdays(as.Date(c("2023/08/13", "2023/10/31")))
[1] "Sunday" "Tuesday"
#Getting months
> months(as.Date(c("2023/08/13", "2023/10/31")))
[1] "August" "October"
#Getting quarters
> quarters(as.Date(c("2023/08/13", "2023/1
"2023/10/31")))
10/31")))
[1] "Q3" "Q4"
In addition to this the length and along arguments of the seq() function
can also be used on the same lines as discussed in the Unit 2 of MST-015
course. For the illustration purpose we now generate a date sequence using
the length argument together with from and to arguments of the seq()
function and assign it to x as follows:
#Generting date sequence using length argument
> x <-
- seq(from=as.Date("2021/08/10"),
,
to=as.Date("2023/08/13"),
, length=5);
; x
[1] "2021-08-10" "2022-02-09" "2022-08-11" "2023-02-10"
[5] "2023-08-13"
The following R Documentation page will pop-up, when we take help on the
ISOdatetime() functions.
The year,
year
ye ar, month,
ar min and sec
nth, day, hour, min
mont
nt sec arguments can be interpreted
e inter
e preted
literally. The tz argument is left em
empty
mpty to get current time zone,
zone
e, otherwise it
can be “GMT” which is UTC-Universal Time Coordinated. For the illustration
purpose we now create an arbitrary date and time object using
ISOdatetime() function as follows:
#Creating a time object
> ISOdatetime(year=2021,month=8,day=10,hour=11,min=10,sec=5,
,
tz="")
)
[1] "2021-08-10 11:10:05 IST"
Next, we see the internal structure of the data and time object using str()
function as follows:
#Internal structure of the ISOdatetime() function
> str(ISOdatetime(year=2021,month=8,day=10,hour=11,min=10,
,
sec=5,tz=""))
)
POSIXct[1:1], format: "2021-08-10 11:10:05"
From this output it is clear that, the ISOdatetime() function creates an
object of class POSIXct. Moreover, it can be used in the seq() function to
generate sequences on the same lines of as.Date() function. Additionally,
134
Data Frames, Reading and Writing in R
the difference in time can also be computed using the difftime() function
in allowed units as follows:
#Creating date and time objects x and y
> x <-
- ISOdatetime(year=2021,
, month=8,
, day=10,
, hour=11,
, min=10,
,
sec=5,
, tz="")
)
> y <-
- ISOdatetime(year=2023,
, month=8,
, day=13,
, hour=10,
, min=5,
,
sec=15,
, tz="")
)
#Generating date sequence with time
> seq(from=x,
, to=y,
, length=5)
)
[1] "2021-08-10 11:10:05 IST" "2022-02-09 16:53:52 IST"
[3] "2022-08-11 22:37:40 IST" "2023-02-11 04:21:27 IST"
[5] "2023-08-13 10:05:15 IST"
#Difference between 2 times objects
> difftime(y,
, x,
, units="auto")
)
Time difference of 732.955 days
The output can be interpreted on the same lines as earlier.
Next, we discuss other two classes of the date objects which are used to
represent date and time. These two classes are "POSIXlt" and "POSIXct".
The functions as.POSIXTlt()
as.P
.PPOS
OSIXTl
Tlt(
Tl t ) and as.POSIXct()
ass.PPOSIXct t() are used to convert the
objects of other class, specially, "character" and "Date" to "POSIXlt"
and "POSIXct" classes. These functions can also be used to manipulate
objects of these classes. The main difference between these two functions is
due to internally storage of the values. The origin of time for the "POSIXct"
class is January 1, 1970, which means time data is stored as the number of
seconds since January 1, 1970 and the "POSIXlt" class c asss store the time data
cl
as a list with number of components, namely, "sec",e ", "min",
"sec
ec "minin"", "hour",
" ou
"h our",
"mda
"m y , "mon", "year",
"mday",
ay" "year", "wday",
yea "w ay" and "isdst". Consider the
day", "yday"
wda "yda
following for understanding purpose
#Testing for list
> is.list(as.POSIXct("1986-08-13 10:10:00"))
[1] FALSE
> is.list(as.POSIXlt("1986-08-13
3 10:10:00"))
)
[1] TRUE
> as.POSIXlt("1986-08-13
3 10:10:00")$year
r
[1] 86
> as.POSIXlt("1986-08-13
3 10:10:00")$wday
y
[1] 3
The Code and > as.POSIXlt("1986-08-13
3 10:10:00")$yday
y
Value for POSIX* [1] 224
class functions:
These two functions also accept character strings in the following formats, like
%H for Decimal as.Date() function.
hours
%M for Decimal Date format: "%Y-%m-%d" or "%Y/%m/%d"
minutes Time format: "%H:%M:%S" or "%H:%M"
%S for Decimal
second.
Other formats are ambiguous to these functions. If the input string is not in the
standard formats, then the format argument of these functions can be used
There are other
for conversion.
codes as well. To
see them you can Note: Unless a list time object is required "POSIXct" is the obvious choice.
take help on the Also, you can see the system time, by simply using the function Sys.time().
strptime()
Now, we show some examples in which we convert the character strings
function.
consisting date and times using as.POSIXct()
as.P
.POSIX
P IX
IXct ) and as.POSIXlt()
t() as.P
.POSIX
.P IXl
IX lt() )
functions.
#Converting different times to POSIX* class
> as.POSIXct(c("2023/08/13","2023/10/31"))
[1] "2023-08-13 IST" "2023-10-31 IST"
> as.POSIXct(c("2021-08-10 11:10:05", "2023-08-13 10:05:15"))
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
> as.POSIXct(c("2021-08-10 11:10", "202
"2023-08-13
23-08-13 1
10:05"))
0:05"))
[1] "2021-08-10 11:10:00 IST" "2023-08-13 10:05:00 IST"
> as.POSIXct(c("2021/08/10 11:10:05", "2023/08/13 10:05:15"))
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
T
Now, we see what happens if we convert POSIX* class object to Date class
object as follows:
> as.Date(as.POSIXct(c("2021/08/10
0 11:10:05",
, "2023/08/13
3
10:05:15")))
)
[1] "2021-08-10" "2023-08-13"
Observe that due to the as.Date() function now the time component is
removed.
Next, we convert a nonstandard character date and time character string vector
named Timedata to POSIX* class using the POSIXct() function as follows:
#Creating a vector consisting time in nonstandard formats
> Timedata
a <-
- c("10/August/2021:11:10:05",
,
"13/August/2023:10:05:15")
)
> as.POSIXct(Timedata)
)
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
136
Data Frames, Reading and Writing in R
Observe that we are getting an error message because the strings are not in
standard formats. So we can either use the strptime() function (explore
yourself) to get the strings in standard format before conversion or directly use
the format argument of the as.POSIXct() function and defined the format
as follows:
#Converting the character string to POSIX class using format
#argument
> as.POSIXct(Timedata,
, format="%d/%B/%Y:%H:%M:%S")
)
[1] "2021-08-10 11:10:05 IST" "2023-08-13 10:05:15 IST"
Moreover, the class of the created time object can be verified using the str()
function and time difference can be checked using difftime() function as
earlier. To do so, we first assign the time object to TD, then check its internal
structure as follows:
#Assigning object to TD
> TD
D <-
- as.POSIXct(Timedata,
, format="%d/%B/%Y:%H:%M:%S");
; TD
D
[1] "2021-08-10
2021-08-10 11:10:05 IST
IST" "2023-08-13
2023-08-13 10:05:15 IST
IST"
#Checking internal structure
> str(TD)
POSIXct[1:2], format: "2021-08-10 11:10:05" "2023-08-13
10:05:15"
SSAQ
SA
AQ 5
Write the output of the following:
(i) as.POSIXlt(c("2023/08/13","2023/10/31"))
POSIXlt( ("2023/08/13" "2023/10/31"))
(ii) as.POSIXlt(c("2021-08-10 11:10:05", "2023-08-13
10:05:15"))
(iii) as.POSIXlt(c("2021-08-10 11:10", "2023-08-13 10:05"))
(iv) as.POSIXlt(c("2021/08/10 11:10:05", "2023/08/13
10:05:15"))
4.7 SUMMARY
The main points discussed in this unit are as follows:
The creation of a data frame object is discussed together with data
frame subsetting.
The mainly used function on a data frame object are discussed to
manipulate data.
The functions used to get formatted outputs are discussed.
137
Fundamentals of R Language
4.9
4 .9
9 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. We first create a data frame named Adm.data using the following
code:
Adm.data <- data.frame( c("Shreyash","Prithu",
"Yuvaan","Advika","Pawan","Pehu"),as.factor(c("Male",
"Male", "Male", "Female", "Male", "Female")),c(88.55,
80.13, 85.31, 75.22, 65.04, NA),c(TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE))
(i) After creating Adm.data print it and observe the output.
(ii) We can set the rows and columns names in a single command using
the dimnames() function in the following manner:
dimnames(Adm.data)<-
list(paste0("R",1:6),c("Name","Gender","Percentage","
AgeG30"))
138
Data Frames, Reading and Writing in R
140
UNIT 5
GRAPHICALL REPRESENTATIONN
OFF DATA
A WITH
HR
Structure
5.1 Introduction 5.7 The curve() Function
Expected Learning Outcomes 5.8 Box Plot
5.2 Line and Scatter Plots 5.9 Pie Chart
Line Plot 5.10 Strip Chart
r
Scatter Plot 5.11 Cloud Plot
Saving a Created Plot 5.12 Conditional Plot
5.3 Pairs Plot 5.13 Summary
5.4 Stem and Leaf Plot 5.14 Terminal Questions
5.5 Bar Plot 5.15 Solutions/Answers
5.6 Histogram
5.1
5 .1
1 INTRODUCTION
INTRODUCTION
I th
In the first
fi t four
f units
it off Block
Bl k 11, you h
have llearntt diff
differentt ttypes off objects
bj t off R
like vectors, matrices, arrays, lists and data frames. In addition to this, you
have learnt indexing, the method of subsetting, testing for membership/class
and coercion of classes of R objects.
The main objective of this unit is to make you familiar with functions which are
most frequently used for the graphical representation of data. Graphical
representations of the statistical data help us to present the data in more
meaningful way, which is easily understandable and helps us to take decisions
and draw conclusions quickly. Often it is essential to present statistical data
graphically during the statistical analysis. Various types of graphical functions
are available to create plots by taking care of type of data in R.
There are several advantages associated with the graphical representation of
data. Some of them are as follows:
Graphical representations are more acceptable in comparison of data
presentations. 141
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Fundamentals of R Language
Let us first discuss the method of creating a line plot using the plot()
function.
For the illustration purpose consider the following arbitrary data of the sales of
steel for the period 2011-2022.
Year Sale of Steel (in thousand Year Sale of Steel (in thousand
tonnes) tonnes)
2011 7.9 2017 8.6
2012 8.2 2018 9
2013 9.5 2019 4
2014 10.5 2020 5
2015 8.1 2021 8.5
2016 9.3 2022 13
Now, we discuss the method of creating a line plot of the given sales data in
R. We first assign the year data to vector object Yr and steel sales data to a
vector object Sale as follows:
#Assigning sales data
> Yr <- 2011:2022
> Sale <- c(7.9, 8.2, 9.5, 10.5, 8.1, 9.3, 8.6, 9, 4, 5, 8.5,
13)
To create a line plot using the plot t() function, we assign its x argument as
plot()
Yr
Yr (for x-axis), the y argument as S Sale
alee (for y-axis) and the typ
pe argument as
type
"l"" (to create a line plot), the xl b as ‘Year’, the ylab
xlab
lab ab argument as ‘Sales of
steel (in thousand tonnes)’ and the mainin argument as ‘Sales of steel for the
mai
period 2011-2022’ as follows:
#Creating a line plot
> plot(x=Yr, y=Sale, type="l", xlab="Year", ylab="Sale
ylab="Sales
es of
steel (in thousand tonnes)", main="Sales of steel for the
period 2011-2022")
Fig. 5.1: Plot of the sales of steel data for the period 2011-2022
The type argument of plot() function can take different types like, "p",
"l", "b", "c", "o", "s", "h" and "n". Each of these types are used for
presenting a created plot differently. 143
Fundamentals of R Language
You can observe that the line plot shown in Fig. 5.1 does not reflect the points
which are joined by the line segments. Additionally, line color is black (default)
and line width is also 1 (default) only. We now present Fig. 5.1 differently, by
changing the type of the plot as "o" using the type argument, width of the
line using the lwd argument and color of the line using the col argument of
the plot() function as follows:
#Creating a plot
Possible types for the
type argument in the > plot(Yr,
, Sale,
, type="o",
, col="blue",
, lwd=2,
, xlab="Year",
,
plot() function are ylab="Sales
s of
f steel
l (in
n thousand
d tonnes)",
, main="Sales
s of
f
as follows: steel
l for
r the
e period
d 2011-2022")
)
"p" for points, The created plot is shown in Fig. 5.2. Note that when the type of the plot is
"l" for lines, chosen as "o" the plot() function will create a plot in which points will be
"b" for both points overplotted on the line. Also, lwd argument is used to increase the thickness
and lines,
of the line and the col argument is used for blue color line here.
"c" for empty
points joined by Note: A higher value of lwd argument displays the increased width
lines, (thickness) and a smaller value displays less thickness of the line from its
"o" for overplotted default value 1.
points and lines,
"s" and "S" for
stair steps
"h" for histogram-
like vertical lines,
and,
"n" does not
produce any points
or lines.
Fig. 5.2: Plot of the sales data with different line thickness, color and type
This plot() function command will only plot a single horizontal line, which
can be verified from the next screenshot.
Note: The range of x and y are chosen suitably for illustration purpose only.
Also, to plot the horizontal lines same y axis points are repeated using the
rep() function 11 number of times (i.e., equals to length of x)
144
Graphical Representation of Data with R
To display different line types, we shall plot more horizontal lines in the
created plot using the lines() function. This function is mainly used to add
lines in the already created plot. We first take help on this function as follows:
#Seeking help
> ?lines
s
starting httpd help server ... done
Integer
Inte
ege
g r value of the
Additionally, the line
lines()
es()) function also supports argum
arguments
u ents such as lty,
l y,
lt
ty function
lty
l
ol, lwd
col,
co wd and type
lw e. These arguments are used on the same lines as
type.
argument displays
discussed in pl
plot()
plot () function. To add more lines to the created plot, we run a
ot(
ot
the following
for
fo r loop using the l
lines()
ine
nes()
ne ) function to draw 5 more horizontal lines after the
line types:
plot()
plot
ot () function command. Moreover, we use the lt
ot() lty, col and lw
ty, co lwd
d
arguments to differentiate between the lines. 0 "blank"
1 "solid"
#Displaying different types of lines
2 "dashed"
> for(i
i in
n 2:6){
{
3 "dotted"
+ lines(x,rep(i,11),
, lty=i,
, col=i,
, lwd=2)
) } 4 "dotdash"
The created plot is shown in Fig. 5.2. 5 "longdash"
6 "twodash"
Note: You can have a look on different available colors using the following R
commands:
#Viewing available colors
146 > demo("colors")
)
Graphical Representation of Data with R
Next, we illustrate other important graphical parameters such as pch and cex
of the plot() function. Note that the pch argument is used to plot a character
and the cex argument controls the size of the plotting character.
Let us now display the first 25 plotting character available in R by plotting them
diagonally. To do this, we first assign the x-axis and y-axis as 1 to 25. To plot
each point in different colors we shall use the col argument and in different
characters we shall use the pch argument. Also, we assign the cex argument
as 2 (as its default value is 1) to display the characters bigger than the default
size as follows:
#Assigning x and y
> x <-
- 1:25
5
> y <-
- x
#plotting of a diagonal line consisting of first 25 plotting
#characters
> plot(x, y, pch=1:25, col=1:25, cex=2)
Furthermore, the cex argument is used to enlarge the size of the plotting
characters. To show the importance of the cex argument, we shall vary it by
0.1 in the range 0.6 to 3 (for illustration purpose only) as follows:
#plotting a diagonal line consisting of characters different
#sizes
> plot(x,
, y,
, pch=1:25,
, col=1:25,
, cex=seq(0.6,3,0.1))
147
Fundamentals of R Language
The created plot is shown in Fig. 5.6. From Fig. 5.6 you can observe that the
plotting characters are appearing in increasing sizes due to the cex argument
(starting plotted characters are smaller than the characters plotted at the end),
in different color due to the col argument and in different characters due to
the pch argument.
5.2.2
2 Scatter
Scatte
er Plot
Plott
A scatter plot is used to display a bivariate data using characters or symbols,
generally dots. It is mainly used to show the relationship between two
quantitative variables for a set of data, i.e., to find the relationship between two
given variables. So, using scatter plot we can easily check whether variables
are correlated or not. To see details on correlation you can refer the Unit 9 of
MST-015.
In R a scatter plot is created using the plot()
p ot
pl t()) function. Most importantly,
impor
o tantly, if
the ty
typepe argument of the plot() () function is not specified, th
hen
thenn by default it
creates a scatter plot. For the illustration purpose we create a scatter plot
between the Murder and Assault variables of the USArrests data frame
(discussed in Unit 4 of MST-015) using the plot() function. Recall that these
two variables can be extracted from the USArrests data frame using the ‘ $ ’
as USArrests$Murder and USArrests$Assault or otherwise we can use
column numbers to refer them as USArrests[,1] and USArrests[,2].
Then a scatter plot can be created by assigning the first two arguments of the
plot function as the Murder and Assault variables, the xlab argument as
“Murder”, the ylab argument as “Assault” and the main argument as "Scatter
plot of Assault vs Murder" respectively, as follows:
The created plot is shown in Fig. 5.7. The scatter plot depicts a positive
relationship between the Murder and Assault variables, which means as
number of murder increases the number of assault cases also increases.
148
Graphical Representation of Data with R
Note: Different pch, col, and cex can also be used in the plot() function,
while creating a scatter plot. Also recall from Unit 4 of MST-015 that the
columns of a data frame can also be referred as variables of a data frame.
5.2.3
5.2
2.3 Saving
Saving
g a Created
Cre
eated Plot
Plot
Till now, you have learnt to create a line plot and a scatter plot. After creating a
plot, you may be interested in saving it in a specific format. A number of ways
are available to save a created plot. In this subsection, we shall discuss the
mainly used methods of saving a created plot at chosen locations.
Recall that, in the beginning of this unit, we have given a sales data. For the
illustration purpose, we now discuss the methods of saving n a created plot,
shown in Fig. 5.1 in a specific format. For this, we generally set the working
directory using the setwd()
set
setwd(() function discussed in Unit 4 of MST-015 course.
We next discuss 3 methods of saving a created plot.
Method 1:
To
T save a created t d plot
l t go to
t the
th menu bar
b off the
th R window
i d and
d do
d the
th
following steps:
Step 1: After creating a plot, click on the graphic window.
Step 2: Click on ‘File’ and then on ‘Save as’ as follows:
149
Fundamentals of R Language
Step 2: After Step 1, open the graphics device in one of these formats, BMP,
JPEG, PNG and TIFF, say PNG. So, firstly we write the format name, say,
png, then in parentheses we write the name of the file with format as extension
as a character string, say, “TarunaKumari.png” as follows:
# Opening the graphical device in png format
> png("TarunaKumari.png")
)
Then we create a plot and close the graphical device window as follows:
#Creating a plot
> plot(x=Yr,
, y=Sale,
, type="l",
, xlab="Year",
, ylab="Sales
s of
f
Steel
l (in
n thousand
d tonnes)",
, main="Sales
s of
f steel
l for
r the
e
period
d 2011-2022")
)
150
Graphical Representation of Data with R
After executing these commands, the plot will the saved in the .png format in
the working directory using the setwd() function.
Note: On the similar lines, a created plot can also be easily saved in any
allowed format. Note that a plot can also be saved in PDF format on the same
lines. For more details you can seek help as follows:
#Seeking help
> ?png
g
starting httpd help server ... done
SAQ
Q 1
Write a command to create a plot which displays the first twenty plotting
characters in decreasing size diagonally.
Fig. 5.8: Matrix of scatter plots between the 4 variables of the USArrests data
Note that the produced matrix of scatter plots is a symmetric matrix. The upper
triangular part is the same as the lower triangular part of the matrix. The first
row of this matrix depicts the relationships of Murder variable with Assault,
Pop and R
UrbanPop
Urb banP Rape
ape similarly,
e variables. Proceeding similarly y, the second
se w of this
econd row
matrix depicts the relationships of Assault
As
ssaul
sa
ault t variable with Murder,
M rd
Mu er, UrbanPop
der U banP
Ur Popp
and Rape
Ra
apee variables. Remaining rows off the matr
matrix
rix can be inferred on the
same lines.
SSAQ
SA
AQ 2
Write R code to create a suitable plot to depict the relationship between the
first four variables of the following iris data set,
set whose first 10 rows are as
follows:
the sake of convenience, we first extract the first seven values from Murder
variable and assign it to x, then create a stem and leaf plot using the stem()
function as follows:
#Extracting and assigning first 7 values from Murder variable
> x <-
- USArrests$Murder[1:7];x
x
[1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3
2 | 3
4 |
6 | 9
8 | 180
10 | 0
12 | 2
You can also expand the scale of the stem and leaf plot to make it more
ale argument (with default value 1) of the
readable (if required) using the scale
sca
steem ) function as follows:
stem()
em()
#Creating a stem and leaf plot with scale=2
> stem(x, scale=2)
The decimal point is at the |
3 | 3
4 |
5 |
6 |
7 | 9
8 | 18
9 | 0
10 | 0
11 |
12 |
13 | 2
Note: In the stem() function, when scale=2, the stem and leaf plot
expanded almost twice longer than default. Additionally, the values appearing
on the left side are known as stem and the values appearing on right side are
known as leafs.
SAQ
Q3
Write R command to create a stem and leaf plot of the UrbanPop variable of
the USArrests data with scale as 2.
In the next section, we shall discuss the method of creating a bar plot. If the
bars are not described which make up the plot, then we use the table() 153
Fundamentals of R Language
Observe that the first row of the obtained output is showing different numbers
available in the data and second row is showing the frequency corresponding
to each number appearing in the first row.
Next, we illustrate the method of creating a bar plot using the barplot()
function. To do so, we create a bar plot of the Temp variable of the
airquality data. Let us first view the first 5 rows of the data using the
head() function as follows:
#Viewing the first 5 rows of the data
> head(airquality,
, 5)
)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
154
Graphical Representation of Data with R
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
Note that, before creating a bar plot of the Temp variable, we must compute
the frequency table of the Temp variable using the table() function as
follows:
#Computing frequency table
> table(airquality$Temp)
) #table(airquality[,4])
56 57 58 59 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
1 3 2 2 3 2 1 2 2 3 4 4 3 1 3 3 5 4 4 9 7
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 97
6 6 5 11 9 4 5 5 7 5 3 2 3 2 5 3 2 1 1
After computing the frequency table, to create a bar plot, we supply the first
argument of the barplot() function as the frequency table. Assign the main
argument to add a main title to the plot and the xlab lab and ylab
xl b arguments to
ylab
add labels to the x-axis and y-axis, respectively. Moreover, the color argument
cool is used to fill color in the bars. In the earlier shown examples, the col
col l
argument was assigned as positive integer values, but colors can also be
assigned as character string(s). So, for the illustration purpose, we now assign
the col argument as " "lavender"
l vend
la nd
derr" as follows:
#Creating a bar plot
> barplot(table(airquality$Temp), col="lavender", xlab="Temp",
ylab="Frequency", main="Bar plot of the Temp data")
The obtained bar plot is shown in Fig. 5.9.
6
4
2
0
56 57 58 59 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 97
Temp
Fig. 5.9: Bar plot of the Temp variable of the airquality data
the datasets package. If interested, you can take help on this data and view
what each column of this data is representing.
Employed
400
300
200
100
0
1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962
Year
Next, we create a subdivided horizontal bar plot of the same data. To create a
horizontal bar plot instead of a vertical bar plot, we use the horiz argument
of the function and assign it as TRUE and also assign the beside argument as
FALSE. Additionally, we keep the remaining arguments as earlier in the
following manner:
#Creating a horizontal subdivided bar plot
> barplot(cbind(GNP,
, Unemployed,
, Employed)
) ~ Year,
,
+ data
a = longley,
,
+ beside
e = FALSE,
,
+ horiz
z = TRUE,
,
+ legend.text
t = c("GNP","Unemployed","Employed"),
,
+ args.legend
d = list(x
x = "bottomright"),
,
+ col=c("red","blue","green")
)
+ )
SSAQ
SA
AQ 4
Write R code to create a multiple bar plot of the following data of the number
of students admitted to the M.Sc and B.Sc in Applied Statistics programme in
different academic years.
Year Admitted to M.Sc Admitted to B.Sc
2016-2017 500 1000
2017-2018 550 600
2018-2019 650 800
2019-2020 800 900
2029-2021 720 950
2021-2022 1000 1200
5.6 HISTOGRAM
A basic frequency histogram of an ungrouped data x can be created in R
using the hist() graphics The main
arguments of interest of the hist() function as follows:
157
Fundamentals of R Language
For the illustration purpose, we now create a histogram of the Temp variable
of the airquality data using the hist() function. To do so, we assign the
x argument of the function as the Temp
Te p data and give suitable x label and
overall title using the xlab and main
ma n arguments as follows:
ain
#Creating a histogram
> hist(airquality$Temp, xlab = "Temp", main = "Histogram of
Temp variable")
From Fig. 5.12, observe that the default filled color in rectangular bars are
grey. Suppose you would like to fill the bars with orange color and wants the
borderlines to be blue. Then, it can be done by using the col and border
arguments of the hist() function. Additionally, we can also display the
frequencies by assigning the labels argument as TRUE. Moreover, by default
axes argument is TRUE, which is used to display the axes. So, to hide the
axes, we assign it as FALSE. In addition to all these we assign the xlim,
158
Graphical Representation of Data with R
xlab and main arguments, to specify x-axis range, label and overall title of
the plot as follows:
#Creating a histogram
> hist(airquality$Temp,
, #data
+ col
l = "orange",
, #color to be filled in bars
+ border
r = "blue",
, #border color of the bars
+ labels
s = TRUE,
, #displaying frequency
+ axes
s = FALSE,
, #to hide x and y axes
+ xlim
m = range(airquality$Temp),
, #range of the Temp data
+ xlab
b = "Temp",
, #x-axis label
+ main
n = "Histogram
m of
f Temp
p data",
, #main title
+ )
$counts
[1] 8 10 15 19 33 34 20 12 2
$density
[1] 0.010457516 0.013071895 0.019607843 0.024836601
[5] 0.043137255 0.044444444 0.026143791 0.015686275
[9] 0.002614379
$mids
[1] 57
57.5
5 62
62.5
5 67
67.5
5 72
72.5
5 77
77.5
5 82
82.5
5 87
87.5
5 92
92.5
5 97
97.5
5
$xname
[1] "airquality$Temp"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
It can be verified that the internal structure of the obtained output is a list. So,
its breaks, counts and mids components can be extracted by appending
‘ […] ’ operator (consisting list component number) as follows:
#Extracting frequency distribution
> hist(airquality$Temp,plot=FALSE)[c(1,2,4)]
]
$breaks
[1] 55 60 65 70 75 80 85 90 95 100 159
Fundamentals of R Language
$counts
[1] 8 10 15 19 33 34 20 12 2
$mids
[1] 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5
Here, $breaks is showing the lower limit of the intervals, $counts is showing
the frequencies corresponding to each interval and $mids is showing the
middle points of the class intervals.
Note that, to fill the histogram with lines, we have assigned the density
argument of the function as 2. If we increase the value of density argument,
then the lines will appear closer and denser.
Next, we illustrate the use of the one of the most important argument breaks
of the hist() function. It is used to create a histogram with specific number
of breaks. For the illustration purpose, we now create the same histogram by
assigning 10 to the breaks argument as follows:
#Creating histogram using break argument
> hist(airquality$Temp,
,
+ col
l = "lightblue",
,
+ border
r = "black",
,
+ breaks
s = 10,
, #number of breaks
+ main
n = "Histogram
m of
f Temp
p data",
,
+ xlim
m = range(airquality$Temp),
,
+ xlab
b = "Temp",
,
+ axes = TRUE,
+ labels = TRUE)$breaks
[1] 55 60 65 70 75 80 85 90 95 100
The obtained histogram is shown in Fig. 5.16. Note that, when the break
breeakk
argument is assigned as a single number, it specifies the number of cells for
the histogram. Also, when the breaks
breaaks argument is assigned as vector, it
gives the breakpoints between histogram cells.
+ xlim
m = range(airquality$Temp),
,
+ xlab
b = "Temp",
,
+ axes
s = TRUE,
,
+ freq
q = FALSE,
,
+ labels
s = TRUE)[c(1,2,3)]
]
$breaks
[1] 56 65 98
$counts
[1] 18 135
$density
[1] 0.01307190 0.02673797
From Fig. 5.17, observe that we get a density histogram plot as we have
assigned the freq argument as FALSE.
Note: Whenever you are using the breaks argument, it is always better to
see the minimum and maximum values of the data under consideration.
SAQ
Q5
Write R code to create a histogram of the Wind variable of airquality data
with unequal class intervals.
For the illustration purpose, we now create a curve of the following function:
f(x)=x3+x2 over the range -5 to 5 using the curve() function. To do so, we
assign the expr argument of the function as f(x), the from argument as -5
and the to argument as 5 as follows:
#Creating a curve of the given function
> curve(expr=x^3+x^2,
, from=-5,
, to=5,
, col="blue",
, lwd=2,
,
ylab="f(x)=x^3+x^2")
)
You can observe that the curve() function also supports arguments such as
col, lwd, xlab, ylab and so on. These arguments can be easily used on the
same lines as discussed earlier. The plot of the given function is shown in Fig.
5.18.
SAQ
Q6
Write a R command to create a density curve of the normal distribution with
mean 2 and variance 16.
To create a box plot using the boxplot() function, we assign the first
argument of the function as the weight variable from the chickwts data. We
also give the label to the y-axis and main title for more clarity as follows:
166
Graphical Representation of Data with R
#Creating a boxplot
> boxplot(x=chickwts$weight,
, ylab="Weights",
, main
n = "Boxplot
t of
f
Weights
s data")
)
The obtained box plot is shown in Fig. 5.21.
Note that, the box plot is appearing vertically (as the default value of the
horizontal
hori
ho izonntall argument of the boxplot()
b xplo
bo ot()) function is FALSE).
LSE). We can also
FALS
FA LS
present the box plot horizontally. To do so we take help of the horizontal
hori
rizo
rizontal
zo al
argument of the function. If we assign the horizonal
horizona al argument as TRUE.
Then the box plot will appear horizontally. For the illustration purpose, we now
create the same histogram horizontally using the horizontal argument as
follows:
#Creating a horizontal boxplot
> boxplot(x=chickwts$weight, horizontal=TRUE, ylab="Weights",
main = "Boxplot of Weights data")
Fig. 5.23: Side-by-side boxplot of the weights for six different types of feeds
SAQ
Q7
Write R command to create box plots of all the variable of the USArrests
data frame using different colors. Also give a main title to the plot.
The x argument of the function is used to assign the values which are
displayed as the areas of pie slices. The labels argument is used to give
labels to the slices, the clockwise argument is used for placing of slices
either counter clockwise (by default) or clockwise and the radius argument is
used to specify the radius of the pie chart.
Note: Byy default,, the pie
p chart is drawn in the center of the square
q box who’s
sided are -1 to 1.
Now we illustrate the method of creating a pie chart of the given arbitrary
expenditure data of a company using the pie()
p e(
pi () function.
Category Expenditure (Rs. in Lakh)
Raw materials 1500
Taxes 560
Other expenses 490
Depreciation 380
Dividends 100
Manufacturing
790
expenses
To create a pie chart of the given data, we create a vector named x of the
expenditure data, then supply it as an argument to the pie()
pie( function as
ie()
follows:
#Numeric vector data
> x <-
- c(1500,560,490,380,100,790)
#Creating a pie chart
> pie(x)
)
3 6
4 5
From Fig. 5.24, you can observe that the in the obtained pie chart default
colors are filled in the slices of the pie chart. Also, the pie chart does not have
the labels to enhance the readability, i.e., which slice belongs to which
category and instead of labels, default numbers corresponding to the position
of the vector elements is appearing on the pie chart. Moreover, it does not
have the main title as well.
So, to make the pie chart more readable after assigning data to x, we set
names of the elements of x, so that they can be used for naming the slices
using the labels argument of the function. Also, we use the col and main to
more visual clarity as follows:
#Assigning data
> x <-
- c(1500,560,490,380,100,790)
)
Next, we illustrate the use of the radius and clockwise arguments of the
pie() function. As discussed earlier, the radius argument is used to specify
radius of pie chart and the clockwise is a logical argument with default value
170
Graphical Representation of Data with R
as FALSE. It used for placing of slices counter clock wise or clock wise. Now
we create the same pie chart by using these two arguments as follows:
SAQ
Q8
The following funds were disbursed during 2010 to 2017 by a leading financial
institution.
Write R command to create a pie chart of the following data. Also, add the
colors, main title, labels name and percentages to the created pie chart.
Now we create a strip chart of the data given in x using the stripchart()
function. To do so, we assign the first argument of the stripchart()
st
tri
r pcha
hart()
ha ()
function as x, the method
met o argument as "stack"
thod s ack" and to enlarge the size off
"st
the plotted symbol (square in this case) we assign the cex
cex argument as 4 in
the following
g manner:
#Creating a strip chart
> stripchart(x, method="stack", cex=4)
2 4 6 8 10
Note that, as the method argument was assigned as "stack" that is why
overplotting does not occurred here. But if we want the over plotting to take
place then the method should be chosen as "overplot" and specific
plotting character can be chosen as follows:
#Creating an over plotted strip chart
> stripchart(x,
, method="overplot",
, col=2,
, pch="*",
, cex=5)
)
After using the method as "overplot" the obtained strip chart will appear
as shown in Fig. 5.28.
172
Graphical Representation of Data with R
* * * * * * * * * *
2 4 6 8 10
SAQ
Q9
Write a R command to create a strip chart of the given data by controlling
overplotting and using the rep() function.
10 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 8 7 7
7 7 7 7 7 6 6 6 6 6 6 5 5 5 5 5 4 4 4 4 33 3 2 2 1
5.11
1 CLOUD
CLOUD PLOT
PLOT
The cloud plot is a three-dimensional scatter plot. It is created using
g the
cloud()
clooud(() function available in the lattice
latt ce package
tic g is used to create a cloud
plot. The formula methods do most of the work here.
For the illustration purpose we now create a cloud plot of the three variables off
the Insu
Insurance
sura
su n e data frame available in the MASS
ranc
ra S package.
g Let us view first
few rows of the data are as follows:
#Viewing first 6 rows of the Insurance data
> require(MASS)
> head(Insurance)
District Group Age Holders Claims
1 1 <1l <25 197 38
2 1 <1l 25-29 264 35
3 1 <1l 30-35 246 20
4 1 <1l >35 1680 156
5 1 1-1.5l <25 284 63
6 1 1-1.5l 25-29 536 84
SSAQ
SA
AQ 10
Write R code to create a cloud plot of the first 10 rows of the randu data
frame. Also, add the following x, y and z axes labels and the main title to the
created plot.
x-axis label: Uniform1
y-axis label: Uniform2
z-axis label: Uniform3
Main title: Cloud Plot
5.12
5 .12 CONDITIONAL
CONDITIO
ONAL PLOT
Conditional plots in R are created using the coplot()
copl
plot
pl () function
ot()
ot n available in
the gr
graphics
graphi ics package. Using co
coplot()
copplot
o ()( function, two variants off th
the
he
conditioning plots can be framed. Its first argument formula describes the
form of a conditional plot.
plot The two possible ways of assigning the formula
are as follows:
Conditioning on one variable:
A formula of the form y~x|z is used to plot y versus x by conditioning on the z
variable.
Conditioning on two variables:
A formula of the form y~x|z*w is used to plot y versus x by conditioning on z
and w variables.
For the illustration purpose, we now create a conditional plot of the variables of
iris data. To do so, we plot the Sepal.Length variable against the
Petal.Length variable by conditioning on Species variable of the data as
follows:
#Creating a conditional plot
> coplot(Sepal.Length~Petal.Length|Species,
, data=iris)
)
174
Graphical Representation of Data with R
SSAQ
SA
AQ 11
Write R code to construct the conditional plot of the Sepal.Width against
Petal.Width by conditioning on Species for the iris data.
5.13
5 .1
13 S
SUMMARY
UMMARY
The main points discussed in this unit are as follows:
To
T create
t line
li and
d scatter
tt plot
l t iin R
To create Bar plot, Histogram, Box Plot in R
To create Stem and Leaf plot, Strip chart and Pie chart in R
To create conditional plot and a three-dimensional plot in R.
Usage of different arguments of the discussed functions to make graphs
more attractive, readable and comparable.
Methods to saving a created plot.
5.15
5 .15 Solutions/Answers
Solution
ns/An
nswers
Self-Assessment
S elf-A
Asse
ess
sment Questions
Question
ns (SAQs)
(SAQs)
1. plot(1:20,1:20, pch=1:20, col=1:20, cex=seq(3,0.5,-0.1))
2. pairs(iris)
3. stem(USArrests$UrbanPop, scale=2)
4. The R code is as follows:
Ayear <- c("2016-2017","2017-2018", "2018-2019","2019-
2020","2020-2021","2021-2022")
MSc <- c(500, 550, 650, 800,720,1000)
BSc <- c(1000, 600, 800, 900,950,1200)
barplot(cbind(MSc, BSc) ~ Ayear,
beside=TRUE,
xlab="Academic Year",
legend.text=c("MSc","BSc"),
args.legend=list(x = "topleft"))
5. The histogram of the wind data can be created using the following code:
hist(airquality$Wind,
col = "lightpink",
border = "white",
176
Graphical Representation of Data with R
breaks=c(1,10,15,21),
main = "Histogram of Wind data",
xlim = range(airquality$Wind),
xlab="Wind",
axes = TRUE,
freq=FALSE,
labels = TRUE)
6. The density curve of the normal distribution with mean 2 and variance 16
can be created using the curve() function in a single command as
follows:
curve(expr=dnorm(x, mean=2, sd=4), lwd=2, from=-30,
to=30)
7. The boxplots of all the variables of the USArrests data can be created
using the following code:
boxplot(USArrests, #data frame
col=2:5, #colours to be filled
main = "Side-by-side boxplots of the variables of
USArrests data")
8. To create a pie chart of the given data, we first assign the data to a vector
named Amount, then assign the names to its elements. Thereafter, we
compute the percentages and create a pie chart using the pie() function n
as follows:
Amount <- c(1434, 1503, 1908, 2232, 3031, 4368, 5725,
6012)
names(Amount) <- as.character(2010:2017)
percentage <- (Amount/sum(Amount))*100
pie(percentage, col
pie(percentage col=1:8,labels
1:8 labels =
paste(names(Amount),round(percentage,2), "%"),
main="Pie chart of percentage of amount disbursed" )
9. The strip chart of the given data can be created using the following code:
stripchart(rep(seq(10,1,-1),times=10:1),
method="stack")
10. The cloud plot can be created using the following command:
cloud(z~x*y, data=randu[1:10,] ,xlab="Uniform1",
ylab="Uniform1", zlab="Uniform3", pch=11, col="red",
main="Cloud Plot")
11. The conditional plot can be created using the coplot() function using the
following command:
coplot(Sepal.Width~Petal.Width|Species, data=iris)
177
Fundamentals of R Language
178
MST-015
Introduction to R Software
Indira Gandhi National Open University
School of Sciences
Block
2
FUNCTIONS, CONDITIONAL STATEMENTS, LOOPS AND
DESCRIPTIVE STATISTICS WITH R
UNIT 6
Functions in R 183
UNIT 7
Control-Flow Constructs of R 203
UNIT 8
Apply Family in R 231
UNIT 9
Descriptive Statistics and Correlation with R 255
Curriculum and Course Design Committee
Prof. Sujatha Varma Prof. Rakesh Srivastava
Former Director, SOS Department of Statistics
IGNOU, New Delhi M. S. University of Baroda, Vadodara (GUJ)
Formatted and CRC Prepared by Dr. Taruna Kumari and Ms Preeti, SOS, IGNOU
Course Coordinator: Dr. Taruna Kumari
Programme Coordinators: Dr. Neha Garg and Dr. Prabhat Kumar Sangal
Print Production
Mr. Rajiv Girdhar Mr. Hemant Parida
Assistant Registrar Section Officer
MPDD, IGNOU, New Delhi MPDD, IGNOU, New Delhi
Acknowledgement: From the depth of my heart I render my gratitude to my family, specially, my father Mr. Puran
Chand, my mother Mrs. Raj Rani, my husband Mr. Anupam Pathak and my son Prithu for providing me necessary
comfort to overcome the ups and downs during the development of this material. Also, I extend my thanks to my
former graduate and post graduate students for their feedbacks and questions, which enabled me to get into
detailed explanation.
April;, 2023
© Indira Gandhi National Open University, 2023
ISBN-978-81-266-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the Indira Gandhi National Open University
Further information on the Indira Gandhi National Open University may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110068 or visit University’s website http://www.ignou.ac.in
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by the Director, School
of Sciences.
INTRODUCTION TO R SOFTWARE
R is a high level language. A language whose popularity is increasing day by day. It can also
be referred as an environment specially used for statistical analysis of the data and graphics
facilities. You may feel astonish to know that, R language has been around us since 1993.
The R language is dialected from the S language. 1The S language was developed at Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. The evolution of the S
language is described by the four books of John Chambers and coauthors. 2For John
Chambers efforts the Association for Computing Machinery (ACM) awarded him with its
Software System Award, that mentioned that this languge is “forever altered how people
analyze, visualize and manipulate data”. R was written by Ross Ihaka and Robert
Gentleman at the Department of Statistics of University of Aukland in New Zealand.
There are several reasons for the popularity of R. We are stating some of them here:
R is an interpreted language, which is free.
An outstanding and magnificent software, which is easy to use as well.
Work on Windows, Unix, Mac and Linux.
A number of statistical packages are available for handling statistical data analysis.
Comes with several data sets.
Quality of support and back-up available (via web-pages, R documents and books) on
functions and packages.
Widely accepted by many researchers, industralists and professors for the data
analysis purpose.
The main reason for impressive growth in the popularity of the R language now a days is,
emergence of data science as a career because data is everywhere and experts are needed
to sort and anlayze that day. So, together with the knowledge of computing, the knowledge of
the statistical methods and machine learning are also required.
This course is mainly written for the learners who are beginners in R computing g software.
Throughout the development of this course the emphasis are given to the packages which
comes with base distribution (i.e., precompiled binary di d stributions of the base sy
distributions ssystem)
stem)
during installation. It is essential for the learners to understand the basics of R b efore,
before,
switching to more complicated problems, such as discussed in the lab courses, i.e., MSTL- MSTS L-
011: Statistical Computing Using R-I, MSTL-012: Statistical Computing Using R-II, MSTL-
013: Statistical Computing Using R-III and MSTL-015: Statistical Computing Using R-V. The
content of this course is organized into self-explainatory 9 units. First five units are the part of
the Block 1 (Fundamentals of R Language) and next 4 units are the part of the Block 2
(Functions, Conditional Statements, Loops and Descriptive Statistics with R). These units
can be summarized as follows:
Unit 1 (Introduction to R): It comprises of installation procedure, methods of seeking help
and details on basic terminologies of R
Unit 2 (Nitty-Gritty of R): The second unit discuss about the R objects such as different types
of vectors, matrices, factors and arrays. It also throw light on missing values, arithmetic and
logical operations.
Unit 3 (Membership Testing, Coercion and Lists in R): As clear from the name in this unit
discuss membership: testing and coercion of R objects. Additionally, the lists objects are also
discussed in this unit.
Unit 4 (Data Frames, Reading and Writing in R): This unit given extensive details on data
frames objects, methods of reading and writing from/to a file and formatting commands.
1 Refer “An Introduction to R” manual by R Core Team
2
Refer“R Language Definition” manual by R Core Team
Unit 5 (Graphical Representation of Data with R): Different types of graphical functions that
are used to create plots of Scatterplot, Boxplot, Histogram, Barplot, Stripchart, Stem and
Leaf plot, Pie chart, pairs plot, coplot, cloud plot etc are discussed in this unit.
Unit 6 (Functions in R): The method of creating your own function is discussed in this unit by
taking some suitable examples.
Unit 7 (Control-Flow Constructs of R): Control-flow constructs such as conditional
statements, different types of loops and method of putting additional control on the loops
using the next and breaks statements are discussed in this unit with examples.
Unit 8 (Apply Family in R): This unit comprises of details on the usage and importance of the
apply family functions.
Unit 9 (Descriptive Statistics and Correlation with R): Unit 9 comprises of details on
measures of central tendency and dispersion together with examples on correlation
computations with R.
To develop this course, we have used Window operating system and the R commands
written in this course are run on R version 4.1.1. In a Window system, we interact with R
through the R console. Futhermore, the written commands can be easily saved. More details
on it are given in Unit 1 of this course.
In this course, the written codes, associated outputs and names of the functions, R objects,
packages, operators are written in ‘Lucida Console’ font type and theory is written in ‘Arial’
font type. Additionally, the R commands are written in bold and associated outputs are
unbold. Note that, the lines starting with ‘ # ’ written before the R commands are the
unexecuted commands, written to give clear understanding of the code part. Furthermore,
while studying this course do all the illustrations on the computers, preferably by writing the
commands in R script files (in an integrated editor) available on R Graphical User Interface
((RGui).
(RGui)). Then do all the SAQs and TQs, without using g computers.
It is important to note that, if you use any R function in your research/publications for data
analysis purpose then you should cite that package, in you written w work. example
ample to
ork. Say for exa
cite the used package base firstly get the citation details
e ails using the citation() function
det
and then use the obtained reference for citation purpose as follows:
In case, if the citation details are accessible (or available) via citation() function at the
prompt them learners may visit the CRAN (Comprehensive R Archive Network) page to get
the details of the contributors (such as author’s names, year and title) for citation purpose.
Lastly, in this introduction page I would like to express my deepest gratitude and thanks to
the R core team, Bill Venables, David M. Smith, John Chambers, Robert Gentleman, Ross
Ihaka, Martin Maechler and other contributors for providing access to enormous R sources
and for their substantial contribution in R language, which has extremely benefited the world.
The MST-015 (Introduction to R Software) is a 2 credit self-explained course, which is
developed for self-study. But still if you want to refer to additional books or references on
discussed topics you may refer to the following books and references.
Suggested Further Reading
1. Braun, W. j. & Murdoch, D. J. (2007). A First Course in Statistical Programming with R.
Cambridge.
2. Crawley, M. J. (2012). The R book. John Wiley & Sons.
3. Albert, J. & Rizzo, M. (2012). R by Example. Springer
4. Teetor, P. (2011). R Cookbook. O’REILLY.
5. Lafaye de Micheaux, P., Drouilhet, R., & Liquet, B. (2013). The R software:
Fundamentals of programming and statistical analysis. Springer.
6. Zuur, A., Ieno, E. N., & Meesters, E. (2009). A Beginner's Guide to R. Springer Science
& Business Media.
7. Heumann, C., Schomaker, M. & Shalabh (2016). Introduction to statistics and data
analysis: With Exercises, Solutions and Applications in R. Springer International
Publishing Switzerland.
8. Dalgaard, P. (2002) Introductory Statistics with R. New York: Springer- Verlag.
References
The packages used for the development of this course matrial can be referred from the
following references:
1. R Core Team (2021). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
2. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.
Springer, New York. ISBN 0-387-95457-0.
3. Mirai Solutions GmbH (2023). XLConnect: Excel
x ell Connector for R. R package version
Exc
1.0.7. https://CRAN.R-project.org/package=XLConnect
https://CRAN.R-project.org/pa
p ckage=
e=XLLCoonn
nnec
ectt
4. Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New
York. ISBN 978-0-387-75968-5
5. Lukasz Komsta and Frederick Novomestky (2022). moments: Moments, Cumulants,
Skewness, Kurtosis and Related Tests. R package version 0.14.1. https://CRAN.R-
project.org/package=moments
Expected Learning Outcomes
After completing this course, you should be able to:
Install R, take helps on functions and data sets, create R scripts and learn some basic
aspects of R;
create R objects and know the different data types and learnt to use membership:
testing and coercion functions;
read and write from/to a file;
do graphic representation of data with R;
do looping, create control statements and functions in R; and
compute descriptive statistics and correlation with R.
Unit 2: Nitty-Gritty of R
Unit 8: Apply
Apply Family in R
Block 2 consists of four units, namely, Units 6, 7, 8 and 9. We strongly recommend the
learners to study Block 1 of the MST-015 (Introduction to R Software) course before
studying Block 2.
Unit 6: After learning from the units of block 1, you may feel that it is time to write your own
function. It is mainly required when you want to run a piece of code (some statements
together) for different inputs. Functions can be used as a saviour from retyping the same
code again and again. Some suitable examples from the MSTL-011 (Statistical Computing
Using
g R-
R-I)
I)) lab course are discussed in this unit.
Unit 7: This unit comprises of rich detail on the control-flow constructs of R such as different
types of if statements, for loop, repeat loop and while loop. To get addition controls
on the conditional statements and loops, the next and breaks statements are also
discussed in this unit with examples.
Unit 8: It is not always necessary to write a loop. Situations may come arocess when you
can condense your entire loop code into a single command with the help of apply fam mily
family
functions. So unit 8 throw light and consists of brief details on the lapply(), sapply(),
apply(), tapply(), and mapply() functions.
Unit 9: Unit 9 is the last unit of this MST-015 (Introduction to R Software) course, whose
objective is not to discuss about the fundamentals or basics of R, but it helps yo yyou
u to start with
statistical computing in R using descriptive statistics. It consists of details on computations of
measures of central tendency
tendencyy and dispersion along g with correlation coefficients.
This material has been developed for self-study. We hope you will enjoy studying this block.
use built-in function and manipulate them by using their outputs in your own functions;
create loops in R;
compute the Pearson’s and Spearman’s correlation coefficients using built-in functions
and without using built-in functions.
6.1
6 .1 INTRODUCTION
INT
TRODU
UCT
TIO
ON
In any programming language, a function is a self-contained piece of code
(with or without a name) that carries out some specific, well-defined task.
If a name is not given to a function, then it is called as anonymous
function. A function contains some executable statem
statements,
ments, which are
written to accomplish a particular task, say for example, to display a
message, to compute some expression
message expression, to compute coefficient of variation
and so forth.
In this unit, we shall discuss two categories of functions, namely built-in
functions (which comes with packages) and user-defined functions (which
are created by users). Throughout MST-015 (Introduction to R Software)
and MSTL-011 (Statistical Computing Using R-I) courses, we have used
built-in and user-defined functions. So, it is important you to know the
difference between the two. The main difference between these two
categories of functions is that, the built-in functions are not required to be
written by user as they come as part of some packages and are already
written to accomplish a predefined task, whereas user-defined function
are developed by the user, to accomplish an intended task. Also, the user
defined functions can be modified according to the requirements of the
user. In any user-defined function, we can use any already defined
function (either built-in or user-defined) at the time of writing a code.
There are several advantages of using functions, some of them are as
follows: 183
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Expected Learning
Learn
ning
gOOutcomes
utcomes
After completing this unit, you should be able to:
distinguish between the built-in and user-defined functions;
learn the advantages of the user-defined functions;
differentiate between actual and formal arguments;
learn the concept of argument matching; and
learn to create user-defined functions.
6.2
6 .2 USER
USER DEFINED
DEFIN
NED FUNCTIONS
FUNCTIO
ONS
In Block 1 of MST-015 (Introduction to R Software) course
course,e, we ha
h
haveve
discussed a number of objects of R programming, like vectors, matrices,
arrays, lists, data frames, expression and null objects. In this unit, we will
cover one more object of R programming,
programming which are function objects.
objects It is
surprising but true that functions are R objects, which also have class and
type like other R objects. A user-defined function can be defined
anywhere in the code. Before using any user-defined function, we need to
make sure that it is defined in advance. Function objects, either built-in or
user-defined, have three components which are:
1. A formal list of arguments.
2. Body of the function.
3. An environment.
A user-defined function can be known or anonymous. When a function
has name, we call it as known function, if it does not have name than we
call it as anonymous. An anonymous function in R is created using the
following syntax.
#Syntax for writing an anonymous function
function
n (arglist)
) body
y
184
Functions in R
N t th
Note that,
t a name tot the
th function
f ti isi given
i in
i suchh a manner, so that
th t it
describes the purpose of creating it. In addition to this, empty parenthesis
after the keyword function shows that the function does not have any
arguments, thus arglist is empty and the function body consists of only
a single statement that is why braces are skipped (as we can skip the
braces if the function consists of only a single executable statement).
Next, we check the class and type of the user-defined function display
as follows:
#Checking class and type of user-defined function
> class(display)
)
[1] "function"
> typeof(display)
)
[1] "closure"
Hence, the class of a user-defined function is "function" and its type is
"closure". The closure word has its own importance, which will be
discussed in function environment. 185
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Note that, the body and the list of formal arguments of a function (user-
defined and built-in) can be extracted using the body()
() and formals()
b dy()
bo foorm
r als(s())
functions (available in the base
ba e package) as follows:
ase
#Extracting body of the display() function
> body(display)
[1] "Functions in R"
#Extracting formal arguments of display() function
> formals(display)
NULL
Hence, the output verifies that the body of the display function consist of
only one statement and the display() function do not have any formal
arguments. Additionally, the third basic component of the function, i.e.,
function environment will be discussed in the end of this unit.
SAQ
Q1
Consider the following names for the user-defined functions. Write which
function names are appropriate and which are inappropriate ones with
reasons:
(i) function
(ii) mean
(iii) sum of squares
(iv) Est$x
186
(v) Varx
Functions in R
Before using any built-in function, it is advisable to view it help page and
see the purpose of the function, its arguments, arguments with default,
examples quoted at the end of the help pages and other important details.
6.4
6 .4 RETURN
RETURN S
STATEMENT
TATEMENT
T
A user defined function has only one return statement
statement, which is optional
optional.
The syntax for the return statement is as follows:
#Return statement
return
n expression
#Alternatively
return
n (expression)
)
If the function does not have a return statement, then by default the last
statement written in the function body will be returned. For the illustration
purpose, now we create a function which computes the following
expression.
n
C xp x qn x , where q 1 p ...(6.1)
The first question which may come to your mind would be “which should
be the function arguments?”. You can observe that to evaluate the given
expression, we should have the value of x, n and p. So, while creating a
user-defined function, we use them as function arguments. So, when
187
Functions, Conditional Statements, Loops and Descriptive Statistics with R
these values will be supplied to the function, then only (6.1) will be
computed. We now create two functions with name Ex1 and Ex2, one
with a return statement and another without a return statement, as follows:
#Creating a function with a return statement
> Ex1
1 <-
- function(x,
, n,
, p){
{
+ y <-
- choose(n,x)*p^x*(1-p)^(n-x)
)
+ return(y)
)
+ }
It can be seen from the user-define function that Ex1 is the name of the
function and x, n and p are its arguments. The function body consists of
two statements out of which the first statement is computing the given
expression and assigning it to y and the second statement is the return
statement which return the value y to the function call (discussed in next
section). The same task could have been done without a return statement
as follows:
#Creating a function without a return statement
> Ex2 <- function(x, n, p){
+ choose(n,x)*p^x*(1-p)^(n-x)
+ }
The Ex2
Ex function is the second user-define function with arguments x, x, n
p. The body of the function consists of only one single statement and
and p.
there is no return statement. So by default, the function will return the last
evaluated value of the statement choose(n,x)*p^x*(1-p)^(n-x).
Recall that in the last section, we have created two functions, namely Ex1
and Ex2 for computing the expression given in (6.1). Both the functions
were having 3 function arguments. Now, we create a function call
corresponding to these two functions as follows:
#Calling Ex1 function
> Ex1(2,
, 10,
, 0.5)
)
[1] 0.04394531
#Calling Ex2 function
> Ex2(2,
, 10,
, 0.5)
)
[1] 0.04394531
Observe that, in the function call we have not used tags (arguments
names), the actual values 2, 10 and 0.5 are assigned to the formal
arguments, x, n and p by positionally matching the arguments (Refer
Section 6.6 for formal and actual arguments of functions). Due to
positional matching of the arguments, the first value 2 in the function call
is supplied to the first formal argument in the function definition, which is
x.. Similarly, other arguments will be supplied according to the positions of
x
the arguments.
Note: The function call for Ex1
1 with tags will be like Ex1(x=2,
Ex1(x=2
Ex 2, n=10,
=0.5). Also, Ex
p=0.5).
p Ex1 can be computed for different values of the x, n and p
arguments.
Now, we update the user-defined function Ex2 by adding more statements
to it, to check whether any argument of the function is missing or not using
the mi
missing()
missin ing(
in ) function (available in base
g()
g( s package). Give the name
Ex2MArg
E
Ex 2M MArg to the updated function as follows:
#Checking for missing arguments
> E
Ex2Marg
2M <-
- f
function(x,
ti ( , n,
, p){
){
+ cat("Is
s the
e value
e of
f x missing?",
, missing(x)
) ,"\n")
)
+ cat("Is
s the
e value
e of
f n missing?",
, missing(n)
) ,"\n")
)
+ cat("Is
s the
e value
e of
f p missing?",
, missing(p)
) ,"\n")
)
+ #choose(n,x)*p^x*(1-p)^(n-x)
)
+ }
The Ex2MArg function consists of three statements, which will test for the
missing arguments. The missing() function will return TRUE if its value
is missing in the evaluation frame of the function and FALSE if its value is
available. Moreover, as we are not interested to evaluate the expression
(6.1), therefore we have used ‘ # ’ (so that it will be considered as
comment and will not get evaluated).
#Creating a function call
> Ex2MArg(n=10,
, p=0.5)
)
Is the value of x missing? TRUE
189
Functions, Conditional Statements, Loops and Descriptive Statistics with R
From the output, it is clear that, when exact matching on tags (refer
Section 6.6 for argument matching) will be conducted, the value of x will
be missing and n and p will be available due to the function call
Ex2MArg(n=10, p=0.5). The same can be verified from the output.
The output confirms that the value of x argument is missing, as TRUE is
returned by the missing() function command. Also, the arguments n
and p are not missing, as FALSE is returned corresponding to these two
arguments.
Note that, the earlier defined two functions Ex1 and Ex2 are evaluating
the probability mass function (pmf) of the binomial distribution, which can
be computed using the built-in function dbinom() as well (refer Session
3 of MSTL-011 (Statistical Computing Using R-I) course for more detail).
The same can be verified by passing its arguments as x, size and prob
as follows:
#Computing the pmf of binomial distribution
> dbinom(x=2, size=10, prob=0.5)
[1] 0.04394531
SSAQ
SA
AQ 2
Consider the following code:
x <- list(runif(5), runi
n f(10), runif(15));x
runif(10), x
y <- (x[[1]]-mean(x[[1]]))/sd(x[[1]]);y
z <- (x[[2]]-mean(x[[2]]))/sd(x[[2]]);z
w <- (x[[3]]-mean(x[[3]]))/sd(x[[3]]);w
Create a function to do the same task and call it accordingly.
So, 3, 5 and 0.8 are the actual arguments. Also, the value 3 is supplied to
x, 5 to n and 0.8 to p. This process of calling a function by supplying
actual arguments is known as call-by-value.
SSAQ
SA
AQ 3
Consider the following code and write the formal and actual argument
arguments:
ts:
Line <- function(ch, n)
{
for(i in 1:n) cat(ch)
cat("\n ")
}
Line("*", 50)
#Creating a function
> ArgMat1
1 <-
- function(x,
, y,
, z){
{
+ x <-
- x+1;
; y <-
- y+1;
; z <-
- z+1
1
+ x*y*z
z
+ }
Hence, the output shows that, the tags on actual and formal arguments
are matched when the function is called and the function is evaluated.
Also, it can be observed that if tags are used, then the position of the
arguments in function call does not matter.
Next, we call the Ar
ArgMat1
rgMat t1 function without using tags. We shall show that
in this case positional matching, the value appearing at the first place in
the function call will be supplied to the first formal arguments. Similarly,
the values appearing in the function call (actual arguments) at the second
and third places in the function call will be supplied to the second and third
formal arguments in the function definition.
#Function call of ArgMat1 with positional matching arguments
> ArgMat1(2, 1, 4)
[1] 30
Also, note that if any argument is left unmatched, you will surely
sur
u ely get an
error message. The same can be verified from the following function call:
#An argument left unmatched leads to an error
> ArgMat1(1,
, 4)
)
Error in ArgMat1(1, 4) : argument "z" is missing, with no
default
Since, there are two supplied values for the z argument, that is why, we
get an error message. Hence, when there is more than one supplied value
for any of the actual argument an error occurs or vice-versa.
In the next example, we increase the number of formal arguments of
192 ArgMat1 by adding one more argument to the function, which is an
Functions in R
So, the only difference between ArgMat1 and ArgMat2 is of the 4th
formal argument w whose value depends on x, y and z arguments. So, to
make a function call of ArgMat2, it is enough to pass x, y and z
arguments only as follows:
#Function call of ArgMat2
> ArgMat2(x=2,
, y=1,
, z=4)
)
Is the value of w missing? TRUE
[1] 30
Note that the w argument is not used in the function body, therefore the w
argument will not get evaluated until unless it is used in the function body
(which is known as lazy evaluation of the function argument). Therefore,
the missing(w)
miss
mi sing( w) is returning TRUE.
g(w) TRUEE. Hence, the function argument will not
be evaluated until unless its value is required. For more clarification, let
us modify ArgMat2
Arg
ArgMatt2 and create new function named ArgMat3
ArgM
gMa
gM at3 as follows:
#Creating a function
> ArgMat3 <- function(x, y, z, w=x+y+z){
+ x <- x+1; y <- y+1; z <- z+1
+ x*y*z*w
+ }
It can be seen that the only difference between ArgMat2
ArgM
Ar t2 and ArgMat3
gMat
gM ArgM
gMa
gM at3 is
in the return statement. The return statement of ArgMat3 uses the value
of w, then w gets evaluated and the value of the product x*y*z*w is
returned as follows:
#Function call
> ArgMat3(x=2,
, y=1,
, z=4)
)
[1] 300
+ cat("\n
n English-",
, English,
, "\n
n Hindi-",
, Hindi,
, "\n
n
Mathematics-",
, Mathematics,"\n")
)
+ }
This function can simply be called using tags and without using tags, as
discussed earlier. Now, we call this function by using incomplete tags as
follows:
#Calling the StdData1 function with incomplete tags
> StdData1(Eng=100,
, Hin=50,
, Math=90)
)
English- 100
Hindi- 50
Mathematics- 90
> StdData1(Hin=50,
, Eng=100,
, Math=90)
)
English- 100
Hindi- 50
Mathematics- 90
Mathematics
Hence, the output verifies that incomplete tags can also be used to call a
function, but it should not be encouraged.
Next, to discuss partial matching, we consider StdData2
StdD
St Datta2 function and
call it by matching exactly one argument completely as follows:
#Creating user-defined function StdData2
> StdData2 <- function(Management, Mathematics){
+ cat("\n Management-", Management, "\n Mathematics-",
Mathematics,"\n")
+ }
#Calling the StdData2 function by partial matching on a tag
> StdData2(Management=50, Ma=100)
Management- 50
Mathematics- 100
Hence, the output verifies that partial matching on tags took place. It will
be interesting to check how far this partial matching on tags is allowed.
Consider the following function call:
#Calling the StdData2 function
> StdData2(Man=50,
, Ma=100)
)
Error in StdData2(Man = 50, Ma = 100) : formal argument
"Management" matched by multiple actual arguments
You can see that an error is occurring here, the reason behind the error is
that the actual argument Man is used for supplying Management marks,
but the actual argument Ma matched both the formal arguments,
therefore, the error “formal argument Management matched by multiple
actual arguments” appears.
In the next, illustration, we used a very interesting object of R as a
function argument, which is Dot-dot-dot (...). Recall that, the ...
argument allow us to take any number of arguments. To illustrate it, we
now create a function named Mn, which accepts any number of arguments
194 together with the n and m arguments as follows:
Functions in R
#Creating a function
> Mn
n <-
- function(...,
, n,
, m){
{
+ (sum(...)-n)/m
m
+ }
From this function definition, it is clear that the Mn function computes the
sum of its arguments except for n and m arguments and the subtract n
from the sum and thereafter divide the remaining value by m. Let us create
a function call for Mn, by suppling 3 vector arguments in addition to n and
m as follows:
#Calling the Mn function
> x <-
- 1:10;
; y <-
- 11:20;
; z <-
- 21:30
0
> Mn(x,
, y,
, z,
, n=3,
, m=2)
)
[1] 231
#Verification of the obtained result
> Mn(1:30,
( , n=3,
, m=2)
)
[1] 231
Hence, the output verifies that firstly n and m arguments were supplied
then the remaining arguments were absorbed to ... argument.
SAQ
Q4
Create a function which returns the minimum of each column of the data
frame argument.
6.8 RECURSION
In programming languages, recursion is a process in which a function call
itself repeatedly, until some specified condition is satisfied. Whenever you
solve a problem using recursion, two conditions must be satisfied. First,
the function should call itself again and again and second, the recursive
function should include a stopping criterion (generally an if statement).
The absence of stopping criteria leads to an infinite recursion. For the
illustration purpose, now we create a recursive function named nthderiv
to compute the nth order derivative of a simple function using the D()
function (refer Session 2 of MSTL-011 for more detail).
#Creating a function to compute the nth order derivative
> nthderiv <- function(fx, ch, n){
+ y <- D(fx, ch)
+ n <- n-1
+ if(n>0){
+ nthderiv(y, ch, n)
+ } else {
+ return(y)}
+ }
196
Functions in R
Note: An R object of type "expression" is created using the
expression() function available in the base package. The expression
objects are unevaluated R statement. The expression objects can only be
evaluated by using the eval() function (Refer Session 2 of MSTL-011
(Statistical Computing Using R-I) for more detail).
Next, we call (invoke) the created function by passing the three arguments
as follows:
The first argument as an expression object eobj.
The second argument as a character variable with respect to
which derivative is to be computed.
The last argument as 2, to compute 2nd order derivative.
#Invoking a function (creating a function call)
> nthderiv(eobj,
, "x",
, 2)
)
4 * (3
3 * x^2)
)
d 4
Hence, we get the required result, i.e., x 5x 3 12x 2 .
dx
Note: Recursion in R is not used frequently, due to the availability of a
number of built-in functions.
SSAQ
SA
AQ 5
Create a recursive function which will call itself n number of times to print
the following message by Dr. APJ Abdul Kalam ji.
"Dream Transform into Thoughts and Thoughts Result in
Action"
6.9
6.
.9 E
ENVIRONMENTS
NV
VIRO
ONMEN
NTS AND
AND SCOPE
SCO
OPE
Whenever we start R and create some objects in R, by default they are
created in the global environment. So, the user’s workspace is the global
environment. The current environment in R can be seseen
een using the
envi r nment() function available in ba
environment()
viro
vi ase package. For the illustration
base
purpose, we now create three arbitrary R objects, namely, x, y and z.
Then we list all of them using ls() function and thereafter check in which
environment these objects are available using environment() function
in the following manner:
#Removing all the objects
> rm(list=ls())
)
#Creating R objects
> x <-
- matrix(1:4,
, 2,
, 2)
)
> y <-
- list(x,
, as.logical(x))
)
> z <-
- as.data.frame(x)
)
#Listing all objects present in the current R environment
> ls()
)
[1] "x" "y" "z"
#Checking the current R environment
> environment()
)
<environment: R_GlobalEnv> 197
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Note that in the created function, there is only one unbound object, which
is p. Additionally, to get the evaluation environment, the environment()
function is called inside the function. Since p is not found in the
environment of the function, therefore it values is searched in the
198
Functions in R
Note: The search path for any R object can be seen using the search()
function available in the base package as follows:
#Looking for the search path
> search()
)
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
6.10 SUMMARY
The main points discussed in this unit are as follows:
We have discussed the g y
general syntax g a user-defined
of creating
function;
The usage of function arguments and about argument matching are
discussed;
The method of creating a function call by taking care of argument
matching is discussed in this unit.;
Function environment and scope is also discussed;
Illustrated several user-defined functions; and
The differences between built-in and user-defined
d functions are
explained.
6.11
6 .1
11 T
TERMINAL
ERMINAL QUESTIONS
QUES
STIO
ONS
1. Define built-in functions. Give an example of a built-in function.
2. Write the name of the at least two built-in functions, which are
available in base package.
package
3. Differentiate between the user defined and built-in functions.
4. Create a function to check whether the given number is an even
number or an odd number.
5. Create a function to obtain Fibonacci series of size n.
6. If A is a square matrix of order 3x3 and I is an identity matrix of
same order. Then create a function named Mult which computes
the following expression for any arbitrary matrix A:
A3+3I
7. Write any two advantages of user-defined functions.
6.12 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
199
Functions, Conditional Statements, Loops and Descriptive Statistics with R
200
Functions in R
202
UNIT 7
CONTROL-FLOW
W CONSTRUCTSS
OFF R
Structure
7.1 Introduction
The while Loop
Expected Learning Outcomes
The repeat Loop
7.2 Versions of if statements
Nested Control-Flow Constructs
An if Statement
7.4 The next and break
An if-else Statement
Statements
Nested if Statements
7.5 Summary
The ifelse() Function
7.6 Terminal Questions
7.3 Loops
7.7 Solutions/Answers
The for Loop
7.1
7 .1 INTRODUCTION
INTRODUCTION
It is not always possible to write your entire code (to do a particular task) as a
single sequence of R statements. The problems encountered in real life for the
data analysis purpose are rarely that simple. Often the situations are
encountered when we need to execute particular number of statements of a
written code under a specific condition and different set of statements under
other conditions. Additionally, we may encounter situations in which a program
or set of statements has to be executed or evaluated more than once. Such
situations can be handled with the help of control-flow constructs of R
smoothly and efficiently. Moreover in R, the statements written in different
types of conditional statements and loops are executed in the same order in
which they appear, like in any other language.
The control-flow constructs determine the sequence, in which the statements
are executed in a code portion. Generally, the computations consist of
sequentially evaluating statements. Further, the statements are either
separated by a semicolon ( ; ) or a new line. Syntactically complete and
correct statements in R are evaluated when we press the enter key, i.e., when
203
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R
the new line is encountered at the end of the statement (on R console). For
error-free execution of the code portion the syntax and statements should be
syntactically complete and proper.
In addition to this, note that in R more than one statement can be grouped
together using the braces ‘ { } ’ and the group of statements is referred to as
a block.
The if, for, while, repeat, break and next words used in conditional
statements and loops are reserved words. These are the basic control-flow
constructs of the R language.
Note: The if statements is also known as conditional execution statements
and the loops are also known as repeated execution statements.
write and use various types of loop functions, such as for, while and
repeat;
7.2
2 VERSIONS
VER
RSION
NS OF if
if S
STATEMENTS
TATEMENTS
In this section, we shall discuss different versions
n of if f statements (al
(also
lso
known as conditional execution statements). In any conditional execution
statement on the basis of a test condition, one out of two or more possible
actions is carried out. In conditional execution, we firstly frame
e a test condition.
The test condition can be a value, variable, or an expression that yields TRUE
or FALSE
FA or a numeric value. Note that, if a test condition is a numeric value,
then a non-zero value means TRUE and a zero value means FALSE.
Moreover, a test condition may include different types of operators. Recall
that, different types of operators have been already discussed in the Unit 2 of
MST-015 (Introduction to R Software) course. So, before, reading this unit you
must refer to operators first and remember the precedence of different
operators used in the test conditions. By precedence we mean that, highest
precedence operators will be solved prior than the lower precedence
operators. For a quick reference of operators see the following table:
Next, we shall show some examples of test conditions. Recall that a test
condition should always yield one value, either TRUE, FALSE or any numeric
value, which means the length of the test condition is only one. To make you
understand the framing of the test conditions, we now frame some test
conditions and see their interpretation as TRUE or FALSE.
Before framing the test conditions using different types of operators, it is
important to have some assigned variables, say, i and f as numeric
variables; and j as a character variable as follows:
#Assigning variables
> i <-
- 7;
; f <-
- 5;
; j <-
- "z"
"
Then some of the possible test conditions which can be framed using these
assigned variables could be as follows: 205
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Test
t Condition
n Interpretation
n Numeric
c Value
e
(i >= 5) && (j == "z") TRUE 1
(i >= 5) && (j == "z") FALSE 0
(i >= 5) || (j == "z") TRUE 1
(f < 15) || (i >2) TRUE 1
(j != "x") && ((i+f )<12) FALSE 0
(i<2) && (j>3) || (j == "z") TRUE 1
On the similar lines using different operators and assigned variables, we can
frame any number of test conditions.
7.2.1 An if Statement
An if statement is generally used when we get the answer to a testing
condition (or question) in terms of yes (TRUE) and no (FALSE). The general
conditional structure involves the if-else statement. But often we may
encounter problems in which the else part is not required. This means a
number of statements (or a single statement) are to be executed if the test
condition is TRUE and the same statements will be skipped if the test condition
is FALSE.
The iff statement begins with the reserved word if f and consists of a test
condition which is an expression, written as TestCond
TestCo ond within the parentheses
‘ ( ) ’, which must results in TRUE
TR E or FALSE
RUE FAL
FA LSE or a numeric value. After
parentheses, the R statements which are to be executed written in braces
‘ { } ’.
Note: We can get help on if statement by writing the
h following
w ng command:
followi com
mmand:
#Seeking help
> ?'if'
if (TestCond) expression
Note that, since the assigned value was positive, therefore the test condition
x>0 yields TR
x> RUE. Due to which the executable statements written in braces
TRUE.
are executed and the square root of x is computed as 3.162278. This
computed value is assigned to a variable y and is printed using the cat()
function. After the execution of the if statement, the sum of x and y are
computed and printed using x+y +y as 13.16228 (without any error message).
x+
Note: The value of the y variable will be known if the square root of x is
computed successfully otherwise it will be unknown and an error message will
be produced.
Next, we assign x with a negative value, say -2 and run the same code again
as follows:
#Computing square root of a number
> x <-
- -2
2
> if(x>0){
{
+ y <-
- sqrt(x)
)
+ cat("Square
e root
t of
f x is
s ",
, y,
, "\n")
)
+ }
Observe from the output that, as the assigned value of x is negative. Due to
which the executable statements written in braces of an if statement are
skipped and y is unknown. Therefore, the statement print(x+y) gives an
error message.
The same if statement can be written in more concisely (without the braces),
by not assigning the square root to y as follows: 207
Functions, Conditional Statements, Loops and Descriptive Statistics with R
In the next illustration, we shall show that a non-zero number depicts TRUE
and a zero number depicts FALSE. To do so, we first assign a variable x as
zero. Then overwrite it by incrementing it by 1 using an if statement and
finally print it as follows:
#When the test condition is numeric 0
> x <-
- 0 #assigning x
> if(x)
) x <-
- x+1
1 #incrementing x by 1
> print(x)
) #printing x
[1] 0
Note that, we get a 0 output because the test condition, which is a zero value,
yields FALSE, therefore the statement written after the if statement was
skipped. Next, we assign a non-zero value 3 to x and observe the output as
follows:
#When the test condition is a non-zero value
> x <- 3
> if(x) x <- x+1
> print(x)
[1] 4
Note that, as the test condition, which is a non-zero value, yields TRUE
TR E
therefore, the if statement is executed and the value x is over-written by x+1.
if
Thus, the value of x is incremented by 1 and printed as 4.
7.2.2
7.2
2.2
2 An
An i se Sta
if-else
f-els Statement
atement
In the previous subsection, you must have observed that, an if statement do
nothing
thi ((or just
j t skip
ki executable
t bl statements
t t t written
itt in braces off an if
i b
statement) if the TestCond is FALSE. But we may encounter situations in
which, we may want to execute some statements (as alternative action) if the
TestCond is FALSE, then in such situation we use an if-else statement.
The general syntax of an if-else statement is as follows:
#An if-else statement
if (TestCond) {
True block executable statement(s)
} else {
False block executable statement(s)
}
The obtained output shows the decision as rejected, as the experience of the
candidate is less than 5 years and the credit card will be given if and only if
both the conditions are satisfied at a time or together.
In the next illustration, we would like to compute the tax, which varies
according to the type of the item. If considered item is an essential item, then
the person has to pay 5% of the price of the item as tax. Or otherwise, if the
item is a luxury item, then the person has to pay 18% of the price of the item
as tax. Now, we frame an if e se statement to compute the tax as follows:
if-else
f-el
#Framing an if-else statement
if(item == "Essential") {
tax = 0.05 * price
} else {
tax = 0.18 * price }
#Or otherwise
if(item == "Luxury") {
tax = 0.18 * price
} else {
tax = 0.05 * price }
The same if
if-else statement can also be written in concisely as follows:
folllows:
#Writing if-else statement more concisely
if(item=="Essential")
) tax
x = 0.05
5 * price
e else
e tax=
= 0.18
8 * price
e
#Or otherwise
if(item=="Luxury")
) tax
x = 0.18
8 * price
e else
e tax=
= 0.05
5 * price
e
Note: You can use any one out of the four written if-else statements to
compute the tax as all will give the same computed tax.
Next, we assign an item’s information arbitrarily and observe the output as
follows:
#Assigning information of a person
> item
m <-
- "Luxury"
"
> price
e <-700000
0
#Computing the tax
> if
f (item
m ==
= "Essential")
) {
+ tax
x = 0.05
5 * price
e
210
Control-Flow Constructs of R
+ } else
e {
+ tax
x = 0.18
8 * price
e }
> print(tax)
)
[1] 126000 #Computed tax
Since the item is a luxury item therefore, the calculated tac is 700000x0.18
=126000, which is same as the obtained result.
if(TestCond1){
-
if(TestCond2){
Executable statement(s) 1
} else {
Executable statement(s) 2
}
-
} else {
Executable statement(s) 3
}
x 2 x 4
y x 4 2 x 4 …(A)
2
x x 2
One of the possible ways of solving given problem (A) is by writing a nested
if-else statement. But to run the if-else statement, we need an assigned
value of x in advance, say, x as 1.5. Then the problem can be solved using
the following code:
#Assigning x
> x <-
- 1.5
5
#Writing nested if-else statement
> if(x>2){
{
+ if(x<4)
+ y <- x+4
+ else
+ y <- x-2
+ } else
+ y <- x^2 #braces are skipped due to single statement
#Printing the output
> print(y)
[1] 2.25 #Output
Note: On the similar lines the false block of an if-else statement can be
nested with an if statement as well.
Syntax 3: The false block of an if-else statement is nested with an if-
else statement.
#Nesting a false block of an if-else statement with an if-else
#statement
if(TestCond1){
Executable statement(s) 1
} else {
-
if(TestCond2){
Executable statement(s) 2
} else {
Executable statement(s) 3 }
-
}
Hence, the result verifies that we get the same result as earlier.
Syntax 4: Both true and false blocks of an if-else statement is nested with
if-else statements.
#Nesting the true and false blocks of an if-else statement with
#if-else statements
if(TestCond1){
-
213
Functions, Conditional Statements, Loops and Descriptive Statistics with R
if(TestCond2){
Executable statement(s) 1
} else {
Executable statement(s) 2 }
-
} else {
-
if(TestCond3){
Executable statement(s) 3
} else {
Executable statement(s) 4 }
-
}
In this layout, we have only 4 if-else, just to make you understand the
execution process. If the TestCondi, where i=1,2,3,4 is TRUE then
executable statement i will be evaluated and other executable statements will
be skipped. If none of the four test conditions is TRUE (or if all the test
conditions are false) then the last executable statement 5 will be evaluated.
214
Control-Flow Constructs of R
7.2.4
7.2.4
4 The
The ifelse()
ifelse() F
Function
unction
Consider a situation in which the length of the result of a test condition is more
than one, which means, it returns a vector instead of a single value. For those
cases there is a vectorized version of the if-else statement, i.e., ifelse()
function available in the base package. The main arguments of interest of the
ifelse() function are as follows:
#The ifelse() function
ifelse(test, yes, no)
This function returns a vector of the same length as test, where test
representing the test condition. The yes and no are function arguments,
representing the statements or expressions, which are to be evaluated
according to the test condition. If any element of test is TRUE then
expression yes will be evaluated, otherwise expression no will be evaluated.
For the illustration purpose, we consider a vector x with elements 1, 2, 3, 4, 5,
-1, -2 and -3. The problem is to compute square root of the elements of x. The
test condition will be tested, i.e., x>0 for each element of x. If the test
condition is TRUE (the value of the element of x is greater than 0), then x will 215
Functions, Conditional Statements, Loops and Descriptive Statistics with R
SAQ
Q1
(i) Write a concise if statement to evaluate the following:
if (x == 999) {
z <- 2*x+sqrt(x)+x/2
cat("The computed value of z is ", z, "\n")
}
(ii) Write an if-else statement to compute the value of y, where
4x3 and increase x by
y 1,, if x 4
y 2
4(x 1) and decrease x by 1, if x 4.
7.3 LOOPS
LOOPS
S
In R programming the looping facility is provided by three types of loops,
namely, for,
forr, while and repeat. The for fo and while are entry-controlled
loops and repeat
re eat is an exit controlled loop. Additionally, there are ttwo
epe wo flow
x and break,
control statements, next
next br
reaak, which facilitates additional control over
the evaluation of these loops.
Note: Loop functions belonging to the apply family provides implicit looping
and will be discussed in Unit 8 of MST-015 course. Note that, in R so
ssometimes
metimes
looping may not be necessary, as many arithmetic functions are vectorized.
But looping may be necessary, for those functions which are not vectorized.
In this syntax, an object can be either vector or a list. When a for loop runs,
it assigns all the items/elements present in the iterable object to name one-
by-one and executes the body of the for loop for each item.
216
Control-Flow Constructs of R
Note that, generally in the for loop the name variable is used for assignment
is usually a new variable in the scope where the for statement is coded. The
value of the name variable can be changed inside the loop, but it will
automatically be set to the next item in the sequence when the control returns
to the first line of the loop again. It should be noted here that the name
variable remains in existence even after the loop is concluded. The last item of
the object will remain as the value of name variable.
Note: In R programming the for loop is used much less as compared to other
compiled languages. Further, code that take a ‘whole object’ as an argument
is likely to be faster and clearer in R programming language. Apply family
functions are great substitutes for the loops.
Now, we present an illustration in which, we shall write a for loop to print
numbers 1, 2, 3, 4, 5 as follows:
#Writing for loop to print the numbers 1 to 5
> for(i
i in
n 1:5){
{
+ print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
On comparing the written loop with the general format of the for fo loop,
loopp, we get
that i is representing namme and 1:5 is a vector represen
name e tiing ob
representing bject
object.ct. There is
ct
only one executable statement in the block of for r loop w hich is pr
which rint( (i)
print(i).i).
During the execution of the for
or loop the i variable will be assigned each
fo
element of the object 1:5 sequentially, i.e., one-by-one. Then the body of the
loop will be evaluated for each element of the object.
Note that, the variable name i will remain in existence even after the complete
execution of the for loop and its value is the last used value (last element of
object) in the loop. It can be verified by printing i, outside the loop (after the
loops ends) as follows:
#Checking the value of i after completing the for loop
> print(i)
)
[1] 5
Next, we present another illustration in which a string will be printed using a
for loop.
#Concatenating strings together using for loop
> for(x
x in
n c("Hello",
, "R",
, "Programming")){
{
+ cat(x,
, "\t")
)
+ if(x
x ==
= "Programming")
) cat("\n")
)
+ }
Hello R Programming #Output
217
Functions, Conditional Statements, Loops and Descriptive Statistics with R
On comparing this loop with the general syntax, we get x as name and the
character vector c("Hello","R","Programming") as object. So, x will
be assigned the elements/items from the character vector one-by-one in
sequence. Then for each element printing will be performed using cat()
function (to concatenate the output). Also, the elements are separated due to
the tab character ‘ \t ’. Moreover, an if statement is nested in the for loop
to pass the control to the new line ‘ \n ’ at the encounter of the last element of
the object, i.e., "Programming". The step-by-step execution of the for loop
can be understood from the following table:
Changes in output
Step
Execution according to the steps
No.
involved in the loop
Firstly, x is assigned as first element of
1
the character vector, i.e., "Hello".
Then cat() function is executed to print
2 "Hello" and a horizontal tab space is Hello
appended due to the tab character ‘ \t ’.
Next, the if statement is tested and as x
is not same as "Programming"
"Pr
P ogram mming"
therefore, the test condition, i.e., x ==
3
"Programming"
"Progra ammin
i g"" results FALS
FALSE.
LSE.
LS
Accordingly, the ca
cat("\n")
c t(
("\n ) statement is
n")
skipped.
After executing the entire fo or loop for the
for
first item of the character vector. The
4 control is again passed to the first line of
the fo
for
f r loop and x is assigned as second
element of the character vector, i.e., ""R".
R".
Next, the cat
t() function is executed for
cat()
the second time to print "RR" in
"R"
continuation (due to the tab charac
character)
a ter) to
5 Hello R
" He llo" and a horizontal tab space is
"Hello"
ell
again appended at the end of "R" " due to
the tab character ‘ \t ’.
Then if statement is tested for
f r the
fo
second item again and as x is not same
as "Programming", therefore, the test
6
condition,
diti ii.e., x == "P
"Programming"
i "
results FALSE. Accordingly, the
cat("\n") is once more skipped.
After executing the entire for loop for first
two items of the character vector. The
control is again passed to the first line of
7
the for loop and, x is assigned as the
third (last) element of the character
vector, i.e., "Programming".
The cat() function is executed to print
"Programming" in continuation to
8 Hello R Programming
"Hello R " and a horizontal tab space
is appended.
Lastly, the if statement is tested again
and as x is same as "Programming"
therefore, the test condition, i.e., x ==
9 "Programming" results TRUE. Due, to
which the cat("\n") statement is
executed and control will pass to the next
218 line (or new line) and the for loop ends.
Control-Flow Constructs of R
For more clarification, we next create another for loop to compute the sum of
the squares of all the elements of a vector consisting elements 2, 4, 6 and 8.
#Compute 22 + 42 + 62 + 82 using for loop
> sum
m <-
- 0
> for(i
i in
n c(2,4,6,8)){
{
+ sum
m <-
- sum
m + i^2
2
+ }
> cat("2^2+4^2+6^2+8^2
2 =",
, sum,
, "\n")
)
2^2+4^2+6^2+8^2 = 120 #Output
The sum variable with initial value 0 before the for loop is used to save the
sum of the square of the elements of the vector. In addition to this, the step-
by-step execution of the for loop can be understood from the following table:
Changes in output
Step
Execution according to the steps
No.
involved in loop
Firstly, sum is assigned as 0. Then the
1 evaluation of the for loop starts and the i
variable is assigned as 2
Next, the statement sum <- sum ^2 is
sum+i^2
m+i^
evaluated with the initial value of sum as 0 and
sum
2 i as 2. Due to this statement sum is sum
sum
m = 4
overwritten by the value 4 (computed from
0+22).
As the loop consists of only single statement,
the control of the loop is again passed to the
3
first line of the fo
for
f r loop and i is assigned as
4.
After that, the statement su
sum <-- suum+i^2 is
sum+i^2
evaluated again with updated value of sum as
4 4 and i as 4. Due to this statement sum is
sum
s sum
su
um = 20
overwritten by the value 20 (computed from
4+42).
The control of the loop again passed to the first
5 line of the fo
for loop and i is assigned a value
6.
Thereafter, the statement sum <- sum+i^2
is evaluated 3rd time, with the updated value of
6 sum as 20 and i as 6. Due to this statement sum = 56
sum is overwritten by the value 56 (computed
from 20+62).
Next, i is assigned as the last element of the
7
numeric vector, i.e., 8.
Then the statement sum <- sum+i^2 is
evaluated with updated value of sum as 56 and
8 i as 8. Due to this statement sum is sum = 120
overwritten by the value 120 (computed from
56+82) and the for loop ends.
After, the for loop ends, we run the printing
9 statement and final computed value of sum is 2^2+4^2+6^2+8^2 = 120
printed due to cat() function.
219
Functions, Conditional Statements, Loops and Descriptive Statistics with R
for(name1 in object1){
-
for(name2 in object2){
Executable statement(s)
}
-
}
When a nested for loop runs, it assigns the first element present in the
iterable object1 to name1,
m 1, then control will pass to nested for loop. The
name
nested for loop runs and assigns all the elements present in the iterable
object2
obje
ob j ctt2 to namme2 one-by-one and executes the nested f
name2 or loop for each
for
item. Then the control again passes to outer f forr loop, and it assigns the
or
second element present in the iterable ob
object1
obje
j ct1 me1, then control will
1 to nam
name1,
once more passes to nested fo for loop. Thereafter, for the second time the
nested for loop runs and assigns all the elements present in the iterable
object2
objjectt2 to na
ame2 one-by-one and executes the nested fo
name2 or loop for each
for
item again. This process will continue
u until all the elements of the ob
object1
obje
ject11
are assigned to nam
name1.
me1.
For comparison purpose, we now again discuss the fo or loop. Also, recall that
for
n() function and USAr
the mean
mean() USArrests
rrest
stss data frame were discussed in the Unit 6
of MST-015 course. We know that the USUSArrests
USAr
Arrrest
sts
st s data frame
e consists of four
columns, which can be verified using the l h() function as follows:
length()
ength
#Computing the number of columns of the USArrests data frame
> length(USArrests)
)
[1] 4
Next, we illustrate the method of computing the mean of each column of the
USArrests data frame using the for loop by controlling only column
subscript (column indices) of a data frame as follows:
#Computing the column means
> for(i
i in
n 1:length(USArrests)){
{
+ cat(names(USArrests)[i],"=",mean(USArrests[
[ ,i]),"\n")
)
+ }
Murder = 7.788
Assault = 170.76
UrbanPop = 65.54
Rape = 21.232
220
Control-Flow Constructs of R
Now, we explain its execution. In this for loop the i variable will take values
from 1:4 one-by-one and print the computed mean of the ith column of the
data frame. Also, observe that the printing of the computed column means
(with name of the columns in front) is done using the cat() function, which is
already discussed in Unit 4 of MST-015 course.
The computed results can be verified using the colMeans() function
discussed in the Unit 4 of MST-015 course. The colMeans() function gives
us an alternative way to compute a vector of column means of a data frame in
a single command. Note that, the column means can also be computed using
the apply family functions discussed in the Unit 8 of MST-015 course.
#Alternative approach
> colMeans(USArrests)
)
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
Thus from the computed outputs we conclude that using both the approaches,
(for loop and colMeans() function) we get the same result. But, note that
here single for loop was used as the control was put on column subscript
only. But if we want to put control on column and row subscripts together, in
that case we use the nested for loop.
For the illustration purpose we now extract the highlighted entries appearing in
the sixth and ninth rows of the Mu rder and Ra
Murder
M ape columns of the U
Rape USArrests
SAArressts
s
data frame using nested for r loop and compute the following sum of the
extracted elements.
i
xij , where x is representing USArrests data frame.
i (6,9) j (1,4) j
Since the highlighted entries are appearing in the 6th and 9th rows of the 1st
and 4th columns therefore the required sum can be easily done using nested
for loops as follows:
#Controlling two subscripts of a data frame using nested for
#loop
> sum
m <-
- 0
> for(i
i in
n c(6,9)){
{
+ for(j
j in
n c(1,4)){
{ 221
Functions, Conditional Statements, Loops and Descriptive Statistics with R
+ sum
m = sum
m + choose(i,j)*USArrests[i,j]
] }
+ }
> cat("sum=",
, sum,
, "\n")
)
sum= 4785.9
Note that, wh
w
whilee is a reserved word and the executable statements written
ile
inside the body of the while
wh e loop are executed repeatedly until test condition
hile
TRUUE. This process stops when Te
results TRUE TestC
TestCond
Con ALSE.
nd is FA
FALSE
For the illustration purpose, we now write a while loop to print the values
while
assigned to a variable i and then decrement its value by 2 if the body of the
loop is executed as follows:
#Initialization of the iterable variable
> i <- 7
Note: Here, the body of the while loop will be executed 3 times.
The step-by-step execution of the illustrated while loop can be understood
from the following table:
Changes in output
Step
Execution according to the
No.
steps involved in loop
1 Firstly, the variable i is assigned as 7.
Recall that the while loop is an entry- controlled
2 loop. After the assignment of i, the first line of the
while loop is executed and the test condition
i>2 is tested at the entry. As the assigned value
222
Control-Flow Constructs of R
7.3.4
7.3
3.4
4 The
Th
he repeat L
Loop
oop
The repeat
repe t loop is the third type of loop available in R. It causes the
peat
pe
repeated evaluation of the body of the repeat
p at loop until a break
repe break is explicitly
requested. This loop should always be carefully handled as there are quite
high chances of creation of an infinite loop due to the repeated execution of
the loop body. Generally, the body of the repeat
repeat loop consists of more than
two executable statements. The general syntax of the repeat
reepe at loop is as
peat
follows:
#The repeat loop
Note that, the control-flow constructs break and next will be discussed in the
next section of this unit. For now, you can understand that in the repeat loop
the break statement is used to get exit from the repeat loop.
Changes in output
Step
Execution accordingg to the steps
p
No.
involved in loop
1 Firstly, x is assigned as 10.
Then the first line of the repe
peat loop is
repeat
pe
executed. As there is no loop entering testing
condition, the executable statements written
2 inside the body of the repeat
re eat loop are
epe 10
executed for the first time. Since the first
statement is a print statement, therefore, the
value of x is printed as 10.
Then due to the decrement statement
3 x <
<-
- x/
x 2, x is overwritten by x/
x/2, /2 and
x/2
becomes 5.
Next, the test condition of the if
f statement, i.e.,
1 is tested for the updated value of x as 5
x<1
4 and it results FA
ALSE. Thus, br
FALSE. b ak statement is
break
ea
not executed and control will again pass to the
peat loop.
repeat
repe
pe
The body of the repeat loop is executed for
the second time and the value of x is printed as
5 5
5 and x is overwritten by x/2 and becomes
2.5.
Next, the test condition of the if statement is
tested for the updated value of x as 2.5 and
6 results FALSE. So, break statement is not
executed this time as well and control will again
pass to the repeat loop.
Thereafter, the body of the repeat loop is
executed for the third time and the value of x is
7 2.5
printed as 5. Also, x is overwritten by x/2 and
becomes 1.25.
The test condition of the if statement is tested
for the latest value of x as 1.25 and results
8 FALSE. So, the break statement is again not
executed this time as well and control will again
pass to the repeat loop.
Finally, the body of the repeat loop is
executed for the fourth time and the value of x
9 1.25
is printed as 1.25. Also, x is overwritten by x/2
224 and becomes 0.625.
Control-Flow Constructs of R
Note that, in the given loop the i variable can take vavalues
alues 1, 2 and 3 and the j
variable can take values 1, 2, 3, 4, 5. The illustrated loop can be understood
from the following table:
Changes in output
Step
Execution according to the
No. steps involved in loop
The first line of the outer for loop is executed first
1
and i is assigned as 1.
Then control is pass to nested for loop and j is
assigned the values from 1 to 5, one-by-one. For
each pair of values (1,1), (1,2), (1,3), (1,4) and
(1,5) the test condition of the if statement is
tested and results as follows:
2 i j TRUE/FALSE
1 1 FALSE
1 2 FALSE
1 3 FALSE
1 4 FALSE
1 5 FALSE 225
Functions, Conditional Statements, Loops and Descriptive Statistics with R
8 i j TRUE/FALSE
3 1 FALSE
3 2 FALSE
3 3 FALSE
3 4 FALSE
3 5 FALSE
SAQ
Q2
Write a loop to compute the product of the squares of the following terms:
x 2 , x=2, 4, 6 and 8. Also, write the step-by-step execution of the loop.
x
226
Control-Flow Constructs of R
SSAQ
SA
AQ 3
State whether the following statements are TRUE or FA
F
FALSE:
LSE:
(i) for and if are the simple words in R.
(ii) break is an entry-controlled loop.
(iii) Nesting can be done in both the TRUE and FALSE blocks of the if-
statement.
7.5 SUMMARY
The main points discussed in this unit are as follows:
Firstly, in this unit we have discussed various types of if statements,
such as an if statement without else part, an if-else statement,
nested if statements, multiple if statement. Focus is given on the
clarity of the concept that ‘how an if statement is selected on the basis
of given problem’.
Different types of loops such as for, while, repeat and their nesting
are discussed in this unit. The difference between them and their usage
are discussed using the general syntax and suitable examples.
Lastly, using break and next statements, way of imposing additional
control on control structures is explained. 227
Functions, Conditional Statements, Loops and Descriptive Statistics with R
7.7 Solutions/Answers
Self-Assessment Questions (SAQs)
1. (i) Then given code can be written concisely as follows:
if (x == 999) cat("The computed value of the z is ",
2*x+sqrt(x)+x/2, "\n")
(ii) The value of y for a given value of x using the given conditions can
be computed using the if-else statement. We first need to assign the
value of x, then y can be computed using the following:
if (x <=4){
y <- 4 * x^3
x <- x+1
} else {
<- 4 * (x
y < (x-1)^2
1) 2
x <- x-1
}
The computed result can be printed using the cat() function in the
following manner:
cat("x =", x, "\t", "y=", y, "\n")
2. The product of the squares of the vector elements 2, 4, 6 and 8 can be
computed using any one of the three loops. So, we consider for loop to
evaluate it. The product using the for loop can be computed using the
following code:
prod <- 1
for(k in c(2,4,6,8)){
prod <- prod * k^2
}
cat("Product =", prod, "\n")
The step wise execution can be understood from the following table:
Changes in the output
Step
Execution according to the steps
No.
involved in loop
Firstly, prod is assigned a value 1 and k is
1
assigned a value 2
Then, the statement prod <- prod *
k^2 is executed or evaluated with the initial
2 value of prod as 1 and k as 2. Due to this prod = 4
statement prod is overwritten by the value
4 (computed from 1 * 22).
As the loop consists of a single statement
only, the control of the loop is again passed
3
to the for loop and k is assigned a value
4.
Then, the statement prod <- prod *
k^2 is executed with the initial value of
4 prod as 4 and k as 4. Due to this prod = 64
statement prod is overwritten by the value
64 (computed from 4 * 42).
229
Functions, Conditional Statements, Loops and Descriptive Statistics with R
APPLY
Y FAMILY
Y IN
NR
Structure
8.1 Introduction 8.6 The mapply() Function
Excepted Learning Outcomes 8.7 Summary
8.2 The lapply() Function 8.8 Terminal Questions
8.3 The sapply() Function 8.9 Solutions/Answers
8.4 The apply() Function
8.5 The tapply() Function
8.1
8.
.1 INTRODUCTION
INT
TRODUCTIO
ON
The apply family functions in R are very powerful because they allow us to
conduct a series of operations on data using a condensed
conden nsed form. Loop
functions sometimes can be used as an alternative to the e control-flow
constructs, for, while e and repeat,
re
epeat t, discussed in the Unit 7 of MST-015
(Introduction to R Software). A major advantage of these functions is that we
don’t have to write multiple R statements to do a particular task. These
functions are generally one-line code or statement of R. These loop functions
save the time of the user and shows an efficient way to code.
The apply family functions comes as a part of base package. Note that
whenever a predefined function is supplied as an argument to the FUN
argument of the apply family functions only the name of the function is used. If
the function has dependencies on other arguments, then they are supplied as
function arguments to the apply family functions.
In this unit we shall discuss the lapply(), sapply(), apply(), tapply()
and mapply() functions of apply family.
Expected Learning Outcomes
After studying this unit, you should be able to:
learn and use lapply() function;
learn and use sapply() function;
learn and use apply() function;
learn and use tapply() function; and
231
learn and use mapply() function.
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R
$y
[,1] [,2]
[1,] 10 5
[2,]
[2 ] -2 1
After creating a list, we next use the min() function to compute the minimum
of each component of Lst1. To do so, we assign the X argument of the
lapply() function as the list Lst1 and the FUN argument of the function as
the min() function in the following manner:
#Computing minimum of each component of a list
> lapply(X=Lst1,
, FUN=min)
)
$x
[1] 1
$y
[1] -2
Observe that, when the min() function is applied on the first component of
the list, i.e., x, it returns the minimum of the vector elements, which is 1 and
when it is applied on second component of the list, i.e., y, it returns the
minimum of the elements of the matrix y, which is -2.
Next, we assign other functions such as sum() and as.logical() to the
FUN argument and observe the obtained outputs as follows:
#Computing sum of the elements of each component of a list
> lapply(X=Lst1,
, FUN=sum)
)
$x
[1] 325
$y
[1] 14
You can easily verify that the sum of the elements of the vector x is 325 and
the sum of the elements of the matrix y is 14.
#Coercing each component of a list to logical object
> lapply(X=Lst1, FUN=as.logical)
$x
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[23] TRUE TRUE TRUE
$y
[1] TRUE TRUE TRUE TRUE
[[2]]
y1 y2
1 -2 4
2 1 9
3 3 6
233
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Next, we create a function named Ext to extract 2nd and 3rd rows of a data
frame as follows:
#Creating a function to extract 2nd and 3rd rows of a data frame
> Ext
t <-
- function(x)
) x[c(2,3),]
] #Function definition
[[2]]
y1 y2
2 1 9
3 3 6
An alternative way of doing the above task without separately defining a user
defined function is as follows:
#Using user-defined function anonymously
lapply(X=Lst2, FUN=function(x) x[c(2,3),])
[[2]]
[1] 0
[[3]]
[1] 1
[[4]]
[1] 1.414214
For the next illustration, we consider the built-in painters data available in
the MASS package. Let us first take help on the data as follows:
#Loading the package
> library(MASS)
#Seeking help
> ?painters
starting httpd help server ... done
The following R Documentation page will pop up when we seek help on the
pa
ainte ers data.
painters
Note: The information on the painters data and about its columns can be
read from the R Documentation page.
Next, we shall display some of the rows of the painters data frame but
before that we now see the internal structure of the data frame as follows:
#Internal structure of the painters data frame
> str(painters)
)
'data.frame': 54 obs. of 5 variables:
$ Composition: int 10 15 8 12 0 15 8 15 4 17 ...
$ Drawing : int 8 16 13 16 15 16 17 16 12 18 ...
$ Colour : int 16 4 16 9 8 4 4 7 10 12 ...
235
Functions, Conditional Statements, Loops and Descriptive Statistics with R
From the obtained output it is clear that its first four variables, namely,
Composition, Drawing, Colour and Expression are of integer type and
the last variable School is of factor type. Next, we display first few rows of the
data frame in the following screenshot:
Now we illustrate the method of computing the 25%, 50% and 75% quantiles
of the first four columns of the painters
p inte
pa te rs data, using the quantile()
ters q anti
qu ile
le()
()
function (the qua antile() function is used to compute the quantiles of a
quantile()
data).
To compute the quantiles of the first four columns of the pa t rs data
painters
ain
nte
frame, we assign the X argument of the function as pain
painters[,1:4],
nte
ters[,
[ 1: 4], the
1:4]
4]
FUN
F
FU N argument as quantit le() function. Note that quantile() function has
quantile()
a pr bs argument, which is used to assign the probabilities. So to compute
probs
prob
the quantiles, we supply this argume
argument
ent as an additional argument to the
lapply()
laapply ) function as follows:
y()
#Computing the quantiles of the first four columns of painters
#data frame using lapply() function
> lapply(X=painters[,1:4], FUN=quantile, probs=c(0.25, 0.50,
0.75))
$Composition
25% 50% 75%
8.25 12.50 15.00
$Drawing
25% 50% 75%
10.0 13.5 15.0
$Colour
25% 50% 75%
7.25 10.00 16.00
$Expression
25% 50% 75%
4.0 6.0 11.5
From the obtained output, note that the quantile of each column of the
236 painters data frame is computed in a single line command.
Apply Family In R
$B
Composition Drawing Colour Expression School
F. Zucarro 10 13 8 8 B
Fr. Salviata 13 15 8 8 B
Parmigiano 10 15 6 6 B
...
$C
Composition Drawing Colour Expression School
Barocci 14 15 6 10 C
Cortona 16 14 12 6 C
Josepin 10 10 6 2 C
...
$D
Composition Drawing Colour Expression School
Bassano 6 8 17 0 D
Bellini 4 6 14 0 D
Giorgione 8 9 18 4 D
...
$E
Composition Drawing Colour Expression School
Albani 14 14 10 6 E
Caravaggio 6 6 16 0 E
Corregio 13 13 15 12 E
...
$F
Composition Drawing Colour Expression School
Durer 8 10 10 8 F
Holbein 9 10 16 13 F
Pourbus 4 15 6 6 F
...
237
Functions, Conditional Statements, Loops and Descriptive Statistics with R
$G
Composition Drawing Colour Expression School
Diepenbeck 11 10 14 6 G
J. Jordaens 10 8 16 6 G
Otho Venius 13 14 10 10 G
...
$H
Composition Drawing Colour Expression School
Bourdon 10 8 8 4 H
Le Brun 16 16 8 16 H
Le Suer 15 15 4 15 H
...
$H
Composition Drawing Colour Expression
14.0 14.0 6.5 12.5
SAQ
Q1
Write R code to create a list with two matrix components A and B, where
2 0 0 5 4 3
A 0 3 0 and B 1 2 2
0 0 5 8 6 5
Also, compute the transpose of both the matrices in a single line command.
You can compare the obtained output with previously obtained output (in
Section 8.2) and observe, why we use the term simplified output.
Also, it can be verified that the assignment simplify=FALSE will give the
same output as obtained by using the lapply() function in Section 8.2.
#An alternative to the lapply() function
> sapply(X=painters[,1:4],
, FUN=quantile,
, probs=c(0.25,
, 0.50,
,
0.75),
, simplify=FALSE)
)
In the next illustration we shall use the sapply() function (to get simplified
output) together with the split() function to compute the column means of
the first four columns of the painters data frame according to the grouped
defined by the factor variable School as follows:
#Loading the MASS package
> require(MASS)
)
#Splitting the painters data frame according to School variable
> split.data <
<- split(painters, painters$School)
#Creating a function to compute column means
> Fun1 <- function(x) colMeans(x[,1:4])
Compare the obtained output with the output obtained in previous section and
observe the difference.
Next, we use the testing function is.matrix() (discussed in the Unit 3 of
MST-015 course), to verify whether the obtained output is a matrix object or
not as follows:
#Testing for matrix object
> is.matrix(sapply(X=split.data,
, FUN=Fun1,
, simplify=TRUE))
)
[1] TRUE
Since the obtained output is a matrix object, so we can use matrix function on
it. Let us compute the transpose of the obtained matrix using the t() function
discussed in the Unit 2 of MST-015 course as follows:
240
Apply Family In R
SAQ
Q2
Consider the admission data given in the Unit 4 of MST-015 course and write
code to compute the average percentage score of the students according to
the Gender variable of the data.
8.4 T
The
he apply()
apply() Fun
Function
nction
The apply()
apppl
plyy() function is mainly used on an array or matrix (also accepts data
frame) objects. This output of the apply()
apply() function is either a vector or array
or list object. This function is mainly used to apply a function assigned to the
UN argument on the margins (specified by MARGIN argument) of the arrays
FUN
FU
or matrix objects, X. The mainly used arguments of the apply()
ap
ppl
ply ( function are
y()
as follows:
#The apply() function
apply (X, #matrix or array or data frame
MARGIN, #an integer vector specifying the margins
FUN, #function to be applied on each element of X
simplify, #whether to simplify the result or not
...)
) #
#Other
h arguments of
f FUN and
d apply
l ffunctions
i
Note: The MARGIN argument of the apply() function can take values 1, 2 or
c(1,2). The value 1 is used to indicate rows, value 2 is used to indicate
columns and c(1,2) indicates both rows and columns. The margin c(1,2)
is mainly used if the X argument is an array.
Next, we illustrate its execution. To do so, we take help of a matrix with
following elements.
2 4 1
0 3 2
8 1 2
Next, we use the apply() function to compute the minimum of each row and
maximum of each column of the matrix. To do so, we assign the X argument of
the function as the matrix, the MARGIN argument as 1 (for rows) and 2 (for
columns). Also, we assign the FUN argument as min() (for minimum) and
max() (for maximum) as follows: 241
Functions, Conditional Statements, Loops and Descriptive Statistics with R
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
, , 3
[,1] [,2]
[1,] 13 16
[2,] 14 17
[3,]
[3 ] 15 18
Now, we compute the sum of each row and column of Arr as follows:
#Computing the sum of each row
> apply(Arr,
, 1,
, function(x)
) sum(x))
)
[1] 51 57 63
From the obtained output, observe that 51 is the sum of the first rows of the 3
matrices of Arr, similarly, remaining output can be understood.
Next, we compute the index/position wise sum of the elements of the three
matrices of Arr. To do so, we assign the MARGIN argument as 1:2 in the
following manner:
242
Apply Family In R
From the obtained output, observe that 21 is the sum of the elements present
at the first row and first column in each of the 3 matrices of the Arr, similarly,
other terms of the output can be understood.
Moreover, as discussed earlier recall that the apply() function also accepts
a data frame argument. So, we next assign a data frame argument to the X
argument of the apply() function. For the illustration purpose, we consider
the subpart (consisting of first 20 rows of the data) of the built-in data set
airquality available in datasets package. But first we seek help on the
data set as follows:
#Seeking help on the data set
> ?airquality
starting httpd help server ... done
You can see the details on the airquality data frame and about its
columns from the R documentation page. For the sake of convenience, before
using the apply() function on the subpart of the data set, we assign it to an
object named SubAir as follows:
#Extracing and assigning first 20 rows of airquality data
> SubAir
r <-
- airquality[1:20,];SubAir
r
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7 243
Functions, Conditional Statements, Loops and Descriptive Statistics with R
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
Next, we use the summary() function to compute the summary (with missing
values) of the variables of the SubAir data as follows:
#Computing the summary of the variables of the SubAir data
> summary(SubAir)
Ozone Solar.R Wind Temp
Min. : 6.00 Min. : 19.0 Min. : 6.90 Min. :56.00
1st Qu.:11.25 1st Qu.: 99.0 1st Qu.: 9.05 1st Qu.:61.75
Median :17.00 Median :194.0 Median :11.50 Median :66.00
Mean :19.22 Mean :197.1 Mean :11.64 Mean :65.15
3rd Qu.:26.75 3rd Qu.:299.0 3rd Qu.:13.35 3rd Qu.:68.25
Max. :41.00 Max. :334.0 Max. :20.10 M
Ma
Max.
x. :74.00
NA's :2 NA's :3
Month Day
Min. :5 Min. : 1.00
1st Qu.:5 1st Qu.: 5.75
Median :5 Median :10.50
Mean :5 Mean :10.50
3rd Qu.:5 3rd Qu.:15.25
Max. :5 Max. :20.00
From, the summary() function output, it can be seen that the Ozone and
Solar.R variables are having 2 and 3, NA’s, respectively. Therefore, before,
using any function on the columns or rows of the SubAir data, it is necessary
to remove NA’s (or otherwise we will get NA output).
The na.omit() function available in the stats package, can be used
efficiently to remove all the rows of a data frame SubAir, which are consisting
of NA’s. Additionally, after the removal of the NA’s, we assign the updated data
to SubAirUp as follows:
#Removing NA’s in a single command
> SubAirUp
p <-
- na.omit(SubAir);SubAirUp
p
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
244
Apply Family In R
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
On comparing the SubAirUp and SubAir data frames, you will find out that
the row numbers 5, 6, 9 and 10 consisting of the NA’s
NA’s values are now
removed. So, in total four rows have been removed by the na.omit()
na.omit( ()
function. The same can be verified using the nrow()
nrow ) function as follows:
w()
#Difference between the number of rows of SubAir and SubAirUp
> nrow(SubAir)-nrow(SubAirUp)
[1] 4
Hence, the obtained output confirms that four rows have been removed by the
na.omit()
naa.omit(
t()
t( ) function. After
A removing the missing values, we now comp
compute
m ute thee
row means and column sums from the SubAirUp data fr frame the
rame using th
he
apply()
a
ap () function. To compute the row means, we assign the X argument
ply()
pl arrgument of
the app y() function as SubAirUp,
apply()
ppl SubA
bAi
bAirUp M RGIN argument as 1 (for rows)
p, the MARGIN
MA
and the FUN
FUN argument as mean
mea
me an in the following manner:
#Computing the row means after omitting rows
w consisting NA’s
> apply(X=SubAirUp, MARGIN=1, FUN=mean) #Or rowMeans(SubAirUp)
1 2 3 4 7 8
51.90000 40.16667 42.60000 68.91667 67.93333 33.96667
9 12 13 14 15 16
20.35000 61.28333 65.70000 64.31667 29.03333 74.08333
17 18 19 20
73.50000 30.40000 75.91667 25.28333
If you do not want to use the na.omit() function, then alternatively you can
remove NA elements from the SubAir data frame before the computations
starts, by using the na.rm argument of the mean() function (discussed in Unit
2 of MST-015). To do so, we supply the na.rm argument as additional
argument to the apply() function as follows:
#Computing the row means by removing NA’s elements
> apply(X=SubAir,
, MARGIN=1,
, FUN=mean,
, na.rm=TRUE)
)
1 2 3 4 5 6
51.90000 40.16667 42.60000 68.91667 20.07500 23.98000 245
Functions, Conditional Statements, Loops and Descriptive Statistics with R
7 8 9 10 11 12
67.93333 33.96667 20.35000 57.32000 20.78000 61.28333
13 14 15 16 17 18
65.70000 64.31667 29.03333 74.08333 73.50000 30.40000
19 20
75.91667 25.28333
Next, we assign the quan le() function to the FUN argument to compute
quantile()
a til
the quantiles of the columns of the SubAirUp data as follows:
#Computing the quantiles of the columns of SubAirUp
> apply(SubAirUp, 2, quantile)
Ozone Solar.R Wind Temp Month Day
0%
% 6.00 19.00 7.400 57.00 5 1.00
25%
% 11.75 93.75 9.575 61.75 5 6.25
50%
% 17.00 223.00 11.500 65.50 5 12.50
75%
% 24.75 301.00 12.750 68.00 5 16
1
16.25
.25
100% 41.00 334.00 20.100 74.00 5 20.00
SSAQ
SA
AQ 3
Write R code to compute the row sums, column means, minimum and
maximum off the rows off SubAirUp
b i data frame
f (shown
( in S
Section 8.2).
)
246
Apply Family In R
For more clarification, we now present another illustration in which we shall get
the subsets of each columns of a data frame one-by-one according to a factor
variable and apply the mean() function on each subsets using the tapply()
function. Consider the painters data frame discussed in the beginning of
this unit again. We now use the tapply() function on the first four variables
of the painters data frame using a for loop as follows:
#Loading the package
> library(MASS)
)
247
Functions, Conditional Statements, Loops and Descriptive Statistics with R
Observe that, while printing, we have rounded the output till 2 decimal places
(for the sake of convenience).
SSAQ
SA
AQ 4
Write R code to create a data frame named warp_breaks consisting of the
subpart of an built-in data set warpbreaks{datasets} of R, wher
where
re wool
and tensions columns are of factor type and the breaks variable is of
numeric type:
breaks wool tension
26 A L
30 A L
54 A L
25 A L
70 A L
52 B M
51 B M
26 B M
67 B M
18 B M
Write R code, using tapply function to compute the wool-wise and tension-
wise minimum of the data.
Note that the mapply() function runs the function assigned to FUN argument
by taking values from the supporting arguments consecutively.
#The mapply() function
mapply (FUN, #function to be applied
MoreArgs, #list of other arguments of FUN
SIMPLIFY, #TRUE by default. It indicates whether to
#simplify the result or not
...) #Other arguments of FUN and mapply functions
If we want to generate five sequences starting from 1 and ending to 10, but
with the jump of 1, 2, 3, 4 and 5, respectively. Then it can be done by using
the five seq() function commands as follows:
#Generating five sequences using seq() function commands
> seq(from=1,
, to=10,
, by=1)
) #Jump=1
[1] 1 2 3 4 5 6 7 8 9 10
The mapply()
ma
app
p ly y() loop function is a very efficient loop function, which will
generate all these 5 sequences in a single command. To do so, we assign the
FUN argument as seq()
seq( ) function and its supporting arguments, namely,
q()
q(
om as 1, to as 10 and the by
from y argument as 1 to 5 in the following manner:
#Generating five sequences using mapply() function
> mapply(seq, from=1, to=10, by=1:5, SIMPLIFY=FALSE)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] 1 3 5 7 9
[[3]]
[1] 1 4 7 10
[[4]]
[1] 1 5 9
[[5]]
[1] 1 6
Observe that the values of the from and to arguments of the seq() function
are fixed as 1 and 10 (so they recycled themselves according to the length of
249
Functions, Conditional Statements, Loops and Descriptive Statistics with R
> seq(from=3,
, to=30,
, by=2)
)
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29
> seq(from=6,
, to=60,
, by=3)
)
[1] 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60
The same task can be done very efficiently using the mapply() function as
follows:
#Generating sequences using ma
m
mapply()
pp
ply() function
> mapply(seq, from=c(1,3,6,9,12), to=c(
to=c(10,30,60,90,120),
(10,30,60,90,120), by
=1:5, SIMPLIFY=FALSE)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29
[[3]]
[1] 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60
[[4]]
[1] 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
[20] 85 89
[[5]]
[1] 12 17 22 27 32 37 42 47 52 57 62 67 72 77
[15] 82 87 92 97 102 107 112 117
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[[3]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
[7,] 13 14
[8,] 15 16
Hence, we get 3 matrices of orders 2x8, 4x4 and 8x2 using the mapply()
function. Also, in all the above mapply() function statements, we have
assigned SIMPLIFY as FALSE to get each output separately.
SAQ
Q5
The runif(n,min,max) function is used to generate n random numbers in
the range min to max. Use mapply() function to generate 5, 10 and 15
uniform numbers in the range 0 to 1 in a single command.
8.7 SUMMARY
The main points discussed in this unit are as follows:
Different methods of using lapply() function are discussed.
Advantage of using sapply() function as compared to the lapply()
function is discussed and illustrated.
251
Functions, Conditional Statements, Loops and Descriptive Statistics with R
The apply() and tapply() functions with their general syntaxes and
methods of their usage are discussed and illustrated.
The advantage of using the mapply() function, parallelly reducing the
complexity of understanding the execution of the function is discussed
and illustrated.
8.9 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. We first create a list named Lst as follows:
Lst <- list(diag(c(2,3,5)), matrix(c(5,1,8,4,-
2,6,3,2,5), ncol=3))
Then we can compute the transpose of both the matrices using
252 lapply() function using following R statement:
Apply Family In R
lapply(X=Lst, FUN=t)
2. We first create a data frame using the following code:
Adm.data <- data.frame(Name= c("Shreyash","Prithu",
"Yuvaan","Advika","Pawan","Pehu"),
Gender= as.factor(c("Male", "Male", "Male", "Female",
"Male", "Female")),
Percentage= c(88.55, 80.13, 85.31, 75.22, 65.04, NA),
AgeG30= c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))
Then to compute the average percentage score of the students
according to the Gender variable of the data, we split the data in the
following manner and assigned to split.data:
split.data <- split(Adm.data, Adm.data$Gender)
print(split.data)
Thereafter, we create a function named Fun1 to get the 3rd column,
which is showing the percentages, i.e., x[,3]. Then we use the
sapply() function to get average percentage score of the students
according to the Gender variable of the data as follows:
Fun1 <- function(x) mean(x[,3], na.rm=TRUE)
sapply(X=split.data, FUN=Fun1)
3. We can compute the vectors of row sums, column means, minimum and
maximum of the SubAirUp data frame using following commands:
For a vector of row sums:
apply(X=SubAirUp, MARGIN=1, FUN=sum)
For a vector of column means:
apply(X=SubAirUp, MARGIN=2, FUN=mean)
For a vector of minimum of rows:
apply(X=SubAirUp, MARGIN=1, FUN=min)
For a vector of maximum of rows:
apply(X=SubAirUp, MARGIN=1, FUN=max)
4. Firstly, we create a data frame named warp_breaks can be created as
follows:
warp_breaks <-
data.frame(breaks=c(26,30,54,25,70,52,51,26,67,18),
wool=c(rep("A",5),rep("B",5)),
tension=c(rep("L",5),rep("M",5)))
Then the group wise minimum according to wool variable ( 2nd column)
can be obtained using the following command:
tapply(warp_breaks[,1], warp_breaks[,2], min)
Then the group wise minimum according to tension variable ( 3rd
column) can be obtained using the following command:
tapply(warp_breaks[,1], warp_breaks[,3], min)
5. Uniform random numbers of sizes 5, 10 and 15 in the range 0 to 1 can
be generated in a single command using the mapply() function by
writing the following statement:
253
Functions, Conditional Statements, Loops and Descriptive Statistics with R
254
UNIT 9
DESCRIPTIVEE STATISTICSS AND
D
CORRELATION N WITH HR
Structure
9.1 Introduction Skewness
9.1 INTRODUCTION
In this unit, we shall discuss the methods of computing the mean, median,
mode, variance, standard deviation, range, quartiles, quartile deviation and
mean deviation. We shall also discuss skewness and kurtosis to infer the
shape and spread of the data. Thereafter, to find whether the two variables are
correlated or not, correlations and bivariate plots are discussed. To do so, we
shall consider previous eight units of MST-015 (Introduction to R Software)
course as basics and illustrate the computations of the aforementioned
statistical measures in this unit.
Moreover, in this unit the methods of computing the aforementioned statistical
measures are illustrated for the ungrouped and grouped data. Recall that, the
255
*Dr. Taruna Kumari, School of Sciences, Indira Gandhi National Open University, New Delhi
Functions, Conditional Statements, Loops and Descriptive Statistics with R
9.2
9 .2 MEAN
ME
E AN
Means or averages are single values that describes the characteristics of the
central
entire data. Mean and median are the most useful measuress of centtral
tendency as they show the tendency off some central value around which data
clusters. They also facilitate the comparisons.
In this section, we shall discuss the method of computations of the mean of
grouped and ungrouped data. Recall tha that
h t to compute the mean
a of the
continuous frequency data, we must have Xi’s (middle values) in advance to
compute the mean. But, in the case of the discrete frequency data, we can
take given Xi’s directly for the computation purpose and then the mean can be
computed on the same lines of the continuous frequency data.
Then we use the sum() function to get the sum of the observations and
length() function to get n. Then compute the mean of the data using (9.1)
as follows:
#Computing the arithmetic mean of the income data
> sum(Income)/length(Income)
)
[1] 41.26667
It can be verified that the mean is computed using the non-missing values only
by writing the following mean() function statement: 257
Functions, Conditional Statements, Loops and Descriptive Statistics with R
#Verification
> mean(x=c(1,
, 2,
, 3,
, 5))
)
[1] 2.75
Hence, it is verified.
Note: If we do not use the na.rm argument of the mean() function then we
get NA output. See the following for clarification
#Computing the mean of the data consisting of NA value
> mean(x=c(1,
, 2,
, 3,
, NA,
, 5))
)
[1] NA
Next, we discuss the computations of the arithmetic mean for the grouped
data. Recall that the mean of the grouped data is computed using the following
formula.
n
fi x i
Arithmetic mean = i 1
, …(9.2)
N
n
where, N fi and n is a number of class intervals.
i 1
For the illustration purpose, we consider the following data representing the
figures of incentives (in thousand) earned by 1400 employees of a company:
To compute the average incentives using data frames, we first create a data
frame named data with three columns. The 1st column named ll represents
the lower limits of the intervals, the 2nd column named ul represents the upper
limits of the intervals and freq column represents given frequencies in the
following manner:
#Creating a data frame of the incentives data
> data
a <-
- data.frame(ll=seq(20,
, 140,
, 20),
, ul=seq(40,
, 160,
, 20),
,
freq=c(500,
, 300,
, 280,
, 120,
, 100,
, 80,
, 20))
)
Let us first attach the data frame data to use its column names without data
frame name as follows:
#Attaching the data frame
>attach(data)
)
Then the arithmetic mean can be easily computed using given formula. To do
so, we compute the middle values and thereafter use (9.2) to compute the
mean as follows:
258
Descriptive Statistics and Correlation with R
Note: This problem could have been solved without using the data frames as
well. In that case, class intervals and frequencies will be assigned to vectors
first. After that, Xi’s can be computed. Thereafter, using Xi’s and frequencies
mean can be computed.
Next, we discuss the method of computing the geometric mean of the given
ungrouped data. To go so, we consider the following production data of fans
noted for 7 days and compute the geometric mean from it.
Day 1 2 3 4 5 6 7
No. of units produced 4000 2000 1500 3500 2000 1900 3000
259
Functions, Conditional Statements, Loops and Descriptive Statistics with R
To compute the geometric mean, we first assign the number of units produced
to a vector named Production. Then use the prod() and length()
functions to compute the geometric mean using the formula given in (9.3) as
follows:
#Assigning the production data
> Production
n <-
- c(4000,
, 2000,
, 1500,
, 3500,
, 2000,
, 1900,
, 3000)
)
n
1
Or, Geometric mean exp l ge xi
log …(9.4)
n i 1
Note that, the formula of the geometric mean is appearing analogues to mean
formula given in (9.1), here instead of x i ’s, we have loge x i ’s. So, the
geometric mean given in (9.4) can be computed using the mean(), log()
p ) functions as follows:
and exp(
exp()
#Computing the geometric mean using logarithms
> exp(mean(log(Production)))
[1] 2414.789
You can observe that using both the approaches we get the same geometric
g ometric
ge
mean as 2414.789.
Next, we discuss the computation of the geometric mean for the grouped data.
We already know that the geometric mean for the grouped data is computed
using the following formula:
GM (x 1f1 x1f2 ...x1fn )1/N , …(9.5)
n
where, N= fi and n is the number of class intervals.
i 1
To compute the geometric mean from this grouped data using data frames, we
first create a data frame named data with three columns consisting of lower
limits, upper limits and frequencies as follows:
#Creating a data frame of the data
> data
a <-
- data.frame(ll=seq(1,
, 13,
, 2),
, ul=seq(3,
, 15,
, 2),
,
freq=c(4,
, 3,
, 2,
, 1,
, 1,
, 8,
, 2))
)
Note that, the first two columns namely, ll and ul of the data frame
represents the lower and upper limits of the intervals. Also, the third column
freq represents the frequencies given in the data.
Additionally, since here we have not attached the data frame data, so we
need to use data frame name together with the column name to use any
column (or otherwise we can use indices of the data frame). Here to compute
the geometric mean, we first compute the middle values and assign them to
xi. Then we extract the frequencies and assign them to fi. Moreover, we
compute the sum of the frequencies and assign it to N as follows:
#Computing the mid values
> xi <- (data$ll+data$ul)/2; xi
[1] 2 4 6 8 10 12 14
#Assigning frequencies
> fi <- data$freq;fi
[1] 4 3 2 1 1 8 2
After getting all these quantities, finally we compute the geometric mean using
the formula given in (9.5) as follows:
#Computing the geometric mean
> (prod(xi^fi))^(1/N)
[1] 6
6.735228
735228
An alternative method of computing the geometric mean for the grouped data
by using logarithms is as follows:
1 n
loge (Geometric mean) fi loge xi ,
Ni1
1 n
Or, Geometric mean exp fi loge xi . …(9.6)
Ni1
Hence, by using both the versions of the formula we get the same result.
In addition to all these note that we can also use the weighted.mean()
function to compute the geometric mean. To do so, we assign the x argument
of the weighted.mean() function as the logarithms of the middle values of
261
Functions, Conditional Statements, Loops and Descriptive Statistics with R
and,,
N
Harmonic mean= , for grouped data …(9.8)
n
fi
i 1 Xi
Now, we discuss the method of computing the harmonic mean. For the
illustration purpose, we compute the harmonic mean of the following arbitrary
ungrouped data:
15, 25, 35, 70, 50
We first assign the data to a vector named x, then use mean()
mean
an()
an funcction to
() function
compute the harmonic mean using formula given in (9.7) as follows:
#Assigning the data under the name x
> x <- c(15, 25, 35, 70, 50)
To compute the harmonic mean from given data, we first create a data frame
named data1 on the same lines as earlier. Then we compute the middle
values and assign them to xi. Also, assign the frequencies to fi and total of
the frequencies to N as follows:
262
Descriptive Statistics and Correlation with R
Finally,
y we compute
p the harmonic mean using
g the formula given
g in (9.8)
( ) and
by using the weighted.mean() function as follows:
#Computing the harmonic mean using formula
> N/sum(fi/xi)
[1] 13.96277
SSAQ
SA
AQ 1
(i) Write R code to compute the geometric mean of the following data:
Observation Frequency
x1 f1
x2 f2
x3 f3
x4 f4
x5 f5
9.3 MEDIAN
Median is a positional average, which comes under the measures of central
tendency. Median appears in the ‘middle’ of an ordered sequence of values.
Median is a value that divides the data, in such a way, so that, half of the
observations in a data are lower than it and half are greater than it. So, after
sorting the data, the median of an ungrouped data is given by following
formula:
263
Functions, Conditional Statements, Loops and Descriptive Statistics with R
th
n 1
observation, if n is odd
2
Median= th th …(9.9)
n n
1
2 2
observation, if n is even
2
So, depending on n (the number of observations) two options are available to
compute the median. One out of two is to be selected. So, to compute the
median we can create a conditional statement, i.e., an if-else statement.
For the illustration purpose, we use (9.9) to compute the median of the
following data of wages of 8 workers:
5580, 5600, 4600, 4607, 5034, 4666, 5612, 5123
To do so, we first assign the data to a vector named x and compute it length
using the length() function as follows:
#Assigning the data under the name x
> x<-c(5580, 5600, 4600, 4607, 5034, 4666, 5612, 5123)
Hence you can observe that, while using the median() function we just need
to supply the vector as an argument to the function and we do not need to sort
it as well. Moreover, this function works well for even as well as odd number of
observations see the following statements for more clarification.
264
Descriptive Statistics and Correlation with R
Next, we consider the case when we have continuous frequency data. In case
of continuous frequency data, the median class is computed first, where the
median class is the class in which cumulative frequency is just greater than
N n
, where N fi . Also, recall that the median is computed using the
2 i 1
following formula
h N
Median = l C , …(9.10)
f 2
We can create a data fame to assign the given wages data, but for a change
we now assign the given data to three vectors named wages_ll, wages_ul
and fi, consisting of lower limits, upper limits and frequency data
respectively.
#Assigning the data into different vectors
> wages_ll
l <-
- seq(200,
, 600,
, 100)
)
> wages_ul
l <-
- seq(300,
, 700,
, 100)
)
> fi<-c(4,
, 6,
, 25,
, 15,
, 8)
)
Next, we use Cum_fi and N to compute the median class at which the
cumulative frequency is just greater than N/2 using min() and which()
functions. Also assign the computed index of the middle class to ind as
follows:
#Computing the median class
> ind
d <-
- min(which(Cum_fi>N/2));
; ind
d
[1] 3
After computing the median class, we next compute the magnitude of the
median class and assign it to h as follows:
#Computing the magnitude of the median class
> h <-
- wages_ul[ind]-wages_ll[ind];
; h
[1] 100
Finally, after computing all these quantities, we compute the median using
(9.10) as follows:
#Computing the median
> wages_ll[ind]+(h/fi[ind])*(N/2-Cum_fi[ind-1])
[1] 476
SSAQ
SA
AQ 2
Create a function to compute the median of the ungrouped data.
9.4
9 .4
4 MODE
MODE
It is well known that the mode represents a value, which occurs most
frequently in a data. But there may be cases when maximum occurrence
concept does not work, like the case when the maximum frequency is
repeated and others. In such situations, we solve the problems using method
of grouping (not discussed here). Here, the considerations are given to the
case when the distribution of the data is unimodal and the mode can be
computed just by checking the maximum occurrence of an observation.
For the illustration purpose we now discuss the method of computing the
mode of the following data:
15, 16, 16, 15, 15, 15, 14, 12, 15, 15, 16, 14, 13, 12, 13, 12, 15, 11, 16, 15,
12, 13, 17, 14, 14, 13, 14, 12, 16, 17, 13, 15, 11, 15, 15, 13, 11, 17, 16, 14,
16, 12, 14, 15, 15, 14, 14, 12, 13, 14, 14, 14, 15, 14, 16, 16, 14, 13, 13, 14,
11, 15, 17, 15, 15, 17, 16, 15, 16, 13, 14, 17, 16, 15, 13, 11, 15, 13, 13, 12,
14, 14, 14, 15, 13, 14, 16, 12, 16, 13, 14, 17, 12, 13, 15, 14, 16, 14, 14, 13,
14, 17, 14, 16, 15
Firstly, we create a vector of the data and named it as x in the following
266 manner:
Descriptive Statistics and Correlation with R
From this output it is clear that the maximum occurring frequency is 14, which
has occurred 26 number of times. Therefore, the mode of the data is 14.
The same result can be computed by taking help of the which()
whic () function as
ich(
ic
follows:
#Computing the maximum occurring number with its index
> which(table(x)==max(table(x)))
14
4
SAQ 3
Explain the execution of the which() function with an example.
1 n
2
Variance= xi x , for ungrouped data …(9.11)
n i 1
and
1 n
2
Variance= fi xi x , for grouped data …(9.12)
N i 1
Note that here to compute the variance, we have used the sum(), mean()
and length() functions. The variance can also be computed using the same
formula with the help of mean() function only as follows:
#Computing the variance
> mean((Income-mean(Income))^2)
[1] 60.86222
Hence, verified.
Note: The var() function also support na.rm function argument for handling
missing values.
Now, we shall illustrate the method of computing the variance for the
continuous frequency data. For the illustration purpose consider the following
268
Descriptive Statistics and Correlation with R
Next, we compute the middle values using lower and upper limits and assign
i as follows:
them to xi
#Computing the middle values
> xi <- (BS_ll+BS_ul)/2
After assigning all these variables, now we use them to compute the variance
of the breaking strength data using the formula given in (9.12) as follows:
#Computing the variance of the grouped data
> sum(fi*(xi-mean(xi))^2)/sum(fi)
[1] 4.708861
Standard Deviation:
The standard deviation is the positive square root of the arithmetic mean of the
squares of the deviations of the data from its arithmetic mean.
The standard deviation of the data can be computed using the following
formula.
n
1
Standard Deviation= Variance = (xi x)2 , …(9.13)
n i 1
#Internal structure
> str(sd)
)
function (x, na.rm = FALSE)
From this internal structure it is clear that this function also supports the logical
argument na.rm m with default value as FAFALSE.
FALSLSEE. This argument can be used on
the similar lines (as earlier) in case of presence of missing values.
s () function gives the same results as of sqrt(var(
Note: The sd()
sd sqrt
t(v var ). For
)).
r( ))
more clarification we compute the standard deviation of the Income’s data
using both approaches.
#Computing the standard deviation using sd() function
> sd(Income)
[1] 8.075241
Hence, it is verified that the sd() function returns the square root of the
var() function result
result.
Furthermore, the cov() function can also be used to compute the variance of
data using the following fact:
Variance(X) = Covariance(X, X) …(9.14)
So the variance of the income data using the cov() function can also be
computed as follows:
#Computing the variance using the cov() function
> cov(Income,
, Income)
) #Same as var(Income)
[1] 65.20952
Note: The cov() function also uses (n-1) in the denominator of the
expression as var() function.
SAQ
Q4
270 Write R code to compute the standard deviation of the following data:
Descriptive Statistics and Correlation with R
116, 151, 116, 179, 141, 197, 191, 197, 160, 175, 162, 137, 122, 194,
128, 115, 140, 165, 123, 151, 179, 178, 189, 185, 143, 152, 195, 152,
117, 165, 199, 163, 173, 178, 172, 173, 179, 159, 191, 158
9.6.1 Range
The range of the ungrouped data in R can be computed using several ways.
We know that the range of the data is defined as the difference between the
maximum and minimum values. So, we can easily compute the range in R by
using the max() and min() functions. For the illustration purpose we now
compute the range of the following monthly number of grocery items
purchased data:
500, 600, 250, 700, 650, 800, 790
#Assigning the data
> purchase <- c(500, 600, 250, 700, 650, 800, 790)
#Computing range
> max(purchase)-min(purchase)
[1] 550
Hence, the range of the data is 550. The range of the data can also be computed
using the range()
r ng
ra nge(() and summary()
e() summ
mma
mmary(() functions. The range()
rang
ngge(() and summary()
su
umm
mmar
ary(
y()
y( )
functions are available in the base
se package. The range()
bas rang
n e(() function returns the
smallest and the greatest observation present in tthe he data and its internal
structure is as follows:
#Internal structure of the range() function
> str(range)
)
function (..., na.rm = FALSE)
Note: From the obtained internal structure, it is clear that the range() function
also supports the function argument na.rm to handle missing values.
Next, we compute the range of the purchase data using the range() function
as follows:
#Using range() function on data
> range(purchase)
)
[1] 250 800
The first value of this output is the smallest observation of the data and the
second value is the largest observation of the data. Hence the range()
function is returning the smallest and the largest observations of the data. The
difference between the two computed observations can be obtained using the
diff() function as follows: 271
Functions, Conditional Statements, Loops and Descriptive Statistics with R
The quartiles of the given data can be computed using the quantile()
function available in stats package. The main arguments of interest of the
quantile() function are as follows:
#The quantile() function
quantile(x, #numeric vector
prob, #vector of probabilities
names, #used to show percentiles #names
na.rm, #to handle missing values
type, #algorithm to compute quantile
...) #other arguments of the function
The quantile() function is used to compute the sample quartiles with given
probabilities of a vector assigned to the x argument of the function
272 corresponding to the given probabilities. These probabilities are assigned to
Descriptive Statistics and Correlation with R
the prob argument of the function. Also, from the R documentation page we
know that the smallest observation corresponds to a probability of 0 and the
largest to a probability of 1. Additionally, there is one more argument type
with default value 7 which takes integer values from 1 to 9. These integer
values are representing the algorithm of the quantile computation. Values 4 to
9 are used for continuous sample quantile and values 1 to 3 are used for
discontinues sample quantile. So, it is better to used type as 1.
Now, we use the quantile() function to compute the first quartile (Q1),
median (Q2) and third quartile (Q3) of the purchase data discussed earlier by
assigning the x argument of the quantile() function as purchase vector,
the prob argument as 25% (for Q1), 50% for median (for Q2) and 75% (for Q3).
Also, the type argument as 1 in the following manner:
#Computing the quartiles of the purchase data
> quantile(x=purchase,
, prob=c(0.25,
, 0.50,
, 0.75),
, type=1)
)
25% 50% 75%
500 650 790
Hence, 500 is the first quantile, 650 is the median and 790 is the third quantile
of the pu ase data.
purchase
pur
rcha
Q3 Q1
Then the quartile
r deviation of the data, i.e., , can be easily computed
2
using the d iff f() function by assigning the la
diff() ag argument (default as 1) of it
lag
as 2 in the following manner:
#Computing quartile deviation
> diff(quantile(x=purchase, prob=c(0.25,0.50,0.75), type=1,
names=FALSE), lag=2)/2
[1] 145
Hence, again we get the same quartile deviation as 145. Thus, the result
verifies that using both the approaches we get the same result.
Note: (i) To compute the six number summary of the purchase data, we
have just supplied the purchase data as an argument to the summary()
function. But note that, earlier we have computed the quartile by considering
the type argument as 1 (whose default value is 7). So, by default,
summary() function will also give quartiles using the type of the quartile as
7. To specifically use algorithm type as 1, we have assigned the
quantile.type argument of the summary() function as 1.
(ii) The unname() function is used to remove the name of the computed
quartile deviation. Also, Qcom[5] and Qcom[2] are Q3 and Q1 of the data.
(iii) The quartile deviations for ungrouped data and grouped data can be
computed on the same lines of mean using their formulae.
Now, for the illustration purpose, we compute the mean deviation about
abo
b ut mean
of the p
purchase
urcchaasee data.
#Computing the mean deviation about mean of the purchase data
> mean(abs(purchase-mean(purchase)))
[1] 139.5918
SAQ
Q5
Write the output of the following two statements and also differentiate between
them.
(i) diff(na.omit(c(2,4,NA,6,1)), lag=1, na.rm=TRUE)
(ii) diff(c(2,4,NA,6,1), lag=3, na.rm=TRUE)
9.7.1 Skewness
Skewness refers to a departure from symmetry (asymmetry of the distribution).
Skewness plays a very important role because the statistical theory is often
based on the assumption of the normal distribution and normal distribution is a
symmetric distribution.
The sign of 3 is positive and 1 is greater than zero indicates the positive
276 skewness. The same can be verified from thedensity histogram as well.
Descriptive Statistics and Correlation with R
Then the skewness of the lwt variable of the birthwt data can be computed
using the skewness() function as follows:
#Computing the skewness of the lwt variable
> skewness(birthwt$lwt)
)
[1] 1.390855
Hence, the obtained result confirms that the skewness() function returns the
value of the 1 .
Next, we discuss the method of computing the skewness for the continuous
frequency data. To do so, we again consider the breaking strength data
discussed in Section 9.5 of this unit. So, proceeding with the same objects
names and assigned data, we get.
#Recalling breaking strength data
> BS_ll <- seq(54, 62, 2) #lower limit
> BS_ul <- BS_ll+2 #upper limit
> fi <- c(4, 22, 25, 19, 9) #frequencies
To compute the skewness of the grouped data, we compute the middle values
first and assign them to xi. Additionally, to compute the moments about mean
we use the already created function momf f as follows:
#Computing the middle values
> xi<-(BS_ll+BS_ul)/2
Next, we use it to compute the 2nd and 3rd moments. Also, we assign the
computed values to Muf2 and Muf3 as follows:
#Computing the 3rd moment about mean
> Muf3
3 <-
- momf(xi,
, fi,
, 3);
; Muf3
3
[1] 1.254521
#Computing gamma 1
> Gamma1
1 <-
- sqrt(Beta1);
; Gamma1
1
[1] 0.124012
It can be observed, that the computed value of 1 is close to zero, and 1 is
also close to zero which indicates the moderate symmetry of the distribution of
the data.
9.7.2 Kurtosis
Recall that the normal curve is known as ‘mesokurtic’ curve. A curve, which is
more peaked than a normal curve is known as ‘leptokurtic’ and the curve
flatter (flat-topped) than the normal curve is known as ‘platykurtic’.
Since the obtained value is more than 3, we can say that the curve of the data
is moderately leptokurtic.
Alternatively, The kurtosis() function available in the moments package
can be used to compute the kurtosis of the data as follows:
#Computing the kurtosis of the data
> kurtosis(birthwt$lwt)
)
[1] 5.309181
Hence, the obtained result confirms that the kurtosis() function returns the
value of the
a ue o e 2.
Next, we discuss the method of computing the skewness from the grouped
data. For the illustration purpose we again consider the breaking strength
frequency data discussed earlier and we use the same objects names and
assigned data with user-defined function mo
m
momff for grouped data to compute
mf
the ku
kurtosis()
kurttosis
is()
is () as follows:
#Computing 4th moment about mean and assigning to Muf4
> Muf4 <- momf(xi, fi, 4); Muf4
[1] 48.65873
Since the obtained value is less than 3, we can say the curve of the data is
platykurtic.
SAQ 6
Write R code to compute the skewness of the following data:
0.7 1.2 -0.5 0.8 -0.1 0.6 -1.8 0.9 -2.8 0.2
The x and y arguments of the cor() function are used to assign the two
variables, say X and Y, whose correlation is to be computed. Another
important argument of the cor() function is the method argument, it is used
to specify the method to be used to compute the correlation. Moreover, the
use argument (with default value "everything") of the function is used to
assign a character string which specifies the method of computing in the
presence of NA values. Additionally, the cor() function can also be used on
matrices or data frames. In that case, the correlation between the columns of
X and the column of Y (when X and Y both are matrices or data frames) are
computed.
xi x yi y
i
r …(9.17)
2 2
xi x yi y
i i
Now we compute the Pearson’s correlation coefficient between the two sets
Now,
of scores using (9.17) and the cor() function. To do so, we first assign the
data to x and y vectors as follows:
#Assigning the data
> x <-
- c(23,
, 30,
, 18,
, 15,
, 29,
, 18,
, 28,
, 29,
, 19,
, 27)
)
> y <-
- c(36,
, 34,
, 15,
, 26,
, 22,
, 29,
, 18,
, 19,
, 18,
, 10)
)
6 di2
rs 1 , where, n is the number of values; …(9.18)
n(n2 1)
di is the difference between the ranks, i.e., x i y i for i=1,2,…,n.
282
Descriptive Statistics and Correlation with R
Hence, the bivariate plot also indicates that there is no correlation between x
and y variables.
As mentioned earlier, the cor() function can also be used on matrices or
data frames. This means the function arguments x and y will be assigned as
matrices or data frames instead of vectors. In that case, the correlation
between the columns of x and the columns of y will be computed. There may
be situations in which we might be interested in finding the correlation in
between the columns of x only. In that case, it is enough to supply only one
matrix or data frame object as function argument, cor(r(x,
r( x,x)
x,
cor(x,x)x) or coor((x), as
cor(x),
both will yield the same result.
For the illustration purpose, we create a correlation matrix using the co
c
cor()
r()
r( )
function of the first 20 rows of the U ests data. To
USArrests
SAr
rre T do so, we first assign
the extracted data to Ext tUs. Then compute the correlation
ExtUs. o matrix between the
columns of the E xtUs data as follows:
ExtUs
#Assigning the data
> ExtUs <- USArrests[1:20,]
The matrix of scatter plot of the ExtUs data can be created using the
pairs() function discussed in Unit 5 of MST-015 course as follows:
From the correlation matrix and scatter plots, it can be seen that highest
correlation is between Assault
A saul
As t and Murder
ult Muurd
rder r variables and that lowest
correlation is between UrbanPop
U ban
Ur op and Murder
nPo Mu r er variables.
urd
Note: After studying this unit, learners can also compute other statistical
measures such as percentile, quantile, decile and measures of variations such
as coefficient of variation and others measures on their own.
SSAQ
SA
AQ 7
Consider the following women data available in the datasets package and
write R code to do the following tasks:
(i) Extract the first 10 rows of the data frame and assign it under the name
ExtWo.
(ii) Compute the Karl Pearson’s correlation between the columns of ExtWo.
(iii) Create scatter plot of all the variables of ExtWo in single plot.
9.9 SUMMARY
The main points discussed in this unit are as follows:
Methods of computing different measures of central tendencies such as
arithmetic mean, geometric mean, harmonic mean, median and mode
284 are discussed.
Descriptive Statistics and Correlation with R
(i) Arithmetic mean for the grouped data can be computed using
weighted.mean() function.
(ii) The cov() function can be used to obtain the variance of the
ungrouped data.
(iii) The R statements sd(x) and sqrt(var(x)), where x is a vector,
will give different outputs.
9.11 SOLUTIONS/ANSWERS
Self-Assessment Questions (SAQs)
1. (i) In this problem we are asked to write a code to compute the
geometric mean of the given discrete frequency data. Clearly, in this
problem we don’t need to write any statement to compute the middle
values as xi’s are already given. So, we first assign xi’s and fi’s to x and
f, respectively.
x <- c(x1,x2,x3,x4,x5)
f <- c(f1,f2,f3,f4,f5)
Then, the total frequency can be obtained using the sum() function as
follows:
N <- sum(data$freq)
Finally, the geometric mean of the data can be computed using one of
the following two formulae:
(prod(xi^fi))^(1/N)
Or,
exp(weighted.mean(x=log(xi), w=fi/N))
(ii) The output of the given statement is 3.5.
2. To compute the median of the
e ungrouped data a function
n named
e
MEDIAN can be created as follows:
MEDIAN <- f
function(x){
n <- length(x)
sx <- sort(x)
if(n%%2==0) {
cat('Median', (sx[n/2]+sx[(n/2)+1])/2,'\n')
} else {
cat('Median', sx[(n+1)/2],'\n')}
}
3. The which() function is available in the base package. It gives us
TRUE indices of a logical object. The object can be a vector or an array.
For example, the following code will give the indices as 4 and 5 since the
elements greater than 13 are present at the 4th and 5th indices.
4. We first create a vector of the data and assign it under the name S as
follows:
S<-c(116, 151, 116, 179, 141, 197, 191, 197, 160,
175, 162, 137, 122, 194, 128, 115, 140, 165, 123,
151, 179, 178, 189, 185, 143, 152, 195, 152, 117,
165, 199, 163, 173, 178, 172, 173, 179, 159, 191,
158)
To compute the standard deviation of the data using sd() function, we
shall be required the length of the data, i.e., n. Therefore, we compute it
using the length() function and assign its value under the name n as
follows:
n<-length(S); n
Finally, we can compute the standard deviation of the data using the value
of n and sd() function using the following statement.
sd(S)*sqrt((n-1)/n)
5. The output of the given statements are as follows:
(i) 2 2 -5
(ii) 4 -3
The difference between the two statements is that, in (i) we have the
na.omit() function used on the vector and the lag argument is 1, but
in (ii), we do not have na.omit() function and lag argument is
specified as 3.
So, in (i) firstly the NA values will be removed due to the na.omit()
function and we get the vector will elements 2, 4, 6 and 1. Next, as
lag=1, the consecutive difference between the terms will be computed
as 2, 2, -5.
In the ne
nextt statement lag is 3,
3 so the difference will computed
ill be comp 6-2
ted as 6 2
and 1-4 (with gap between the observation as 3). Since NA’s will not be
removed from data so the output will be 4 and -3.
6. We first assign the given data to a vector named x as follows:
x <- c(0.7,1.2,-0.5,0.8,-0.1,0.6,-1.8,0.9,-2.8,0.2)
Then by writing the following two statements we can easily compute the
coefficient of skewness.
n <- length(x)
mean((x-mean(x))^3)^2/mean((x-mean(x))^2)^3
7. First, we extract the data using the following statement.
ExtWo <- women[1:10,]
Then the Pearson’s correlation coefficient can be computed using the
cor() function in following manner.
287
Functions, Conditional Statements, Loops and Descriptive Statistics with R
cor(ExtWo)
Or, cor(ExtWo$height, ExtWo$weight)
Lastly, we can create the scatter plot using the following plot() function
command.
plot(ExtWo$height, ExtWo$weight, col="blue", cex=2)
289
Functions, Conditional Statements, Loops and Descriptive Statistics with R
290