You are on page 1of 32

Sem-IV

ClASS-1

The R environment
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. Among other things it has
 an effective data handling and storage facility,
 a suite of operators for calculations on arrays, in particular matrices,
 a large, coherent, integrated collection of intermediate tools for data analysis,
 graphical facilities for data analysis and display either directly at the computer
or on hardcopy, and
 a well developed, simple and effective programming language (called ‘S’)
which includes conditionals, loops, user defined recursive functions and input
and output facilities. (Indeed most of the system supplied functions are
themselves written in the S language.)
The term “environment” is intended to characterize it as a fully planned and coherent
system, rather than an incremental accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.
R commands, case sensitivity, etc.
Technically R is an expression language with a very simple syntax. It is case
sensitive as are most UNIX based packages, so A and a are different symbols and would
refer to different variables. Normally all alphanumeric symbols are allowed (and in
some countries this includes accented letters) plus ‘.’ and ‘_’, with the restriction that a
name must start with ‘.’ or a letter, and if it starts with ‘.’ the second character must
not be a digit. Names are effectively unlimited in length.
Elementary commands consist of either expressions or assignments. If an expression is
given as a command, it is evaluated, printed (unless specifically made invisible), and
the value is lost. An assignment also evaluates an expression and passes the value to a
variable but the result is not automatically printed.
Commands are separated either by a semi-colon (‘;’), or by a newline. Elementary
commands can be grouped together into one compound expression by braces (‘{’ and
‘}’). Comments can be put almost anywhere, starting with a hashmark (‘#’), everything
to the end of the line is a comment.
If a command is not complete at the end of a line, R will give a different prompt, by
default + on second and subsequent lines and continue to read input until the command
is syntactically complete. This prompt may be changed by the user.
Use it as simple calculator
aa=45%%2 (remainder after division)
> aa
[1] 1
> aa1=45%/%2 (the integer part of a fraction)
> aa1
[1] 22
Vectors and assignment
R operates on named data structures. The simplest such structure is the numeric vector,
which is a single entity consisting of an ordered collection of numbers. To set up a
vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7,
use the R command
> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
This is an assignment statement using the function c() which in this context can take an
arbitrary number of vector arguments and whose value is a vector got by concatenating
its arguments end to end.7
A number occurring by itself in an expression is taken as a vector of length one.
Notice that the assignment operator (‘<-’), which consists of the two characters ‘<’ (“less
than”) and ‘-’ (“minus”) occurring strictly side-by-side and it ‘points’ to the object
receiving the value of the expression. In most contexts the ‘=’ operator can be used as
an alternative.
Assignment can also be made using the function assign(). An equivalent way of making
the same assignment as above is with:
> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
The usual operator, <-, can be thought of as a syntactic short-cut to this.
Assignments can also be made in the other direction, using the obvious change in the
assignment operator. So the same assignment could be made using
> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
If an expression is used as a complete command, the value is printed and lost8. So now
if we were to use the command
> 1/x
the reciprocals of the five values would be printed at the terminal (and the value of x, of
course, unchanged).
The further assignment
> y <- c(x, 0, x)
would create a vector y with 11 entries consisting of two copies of x with a zero in the
middle place.
Negative indices can be used to avoid certain elements.
We can select all but second and fifth elements of x.
x=1:40
> y=x[-c(2,5)]
> y
[1] 1 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29 30 31 32 33
[32] 34 35 36 37 38 39 40
The third through 5th elements of x can be avoided.
z=x[-(3:5)]
> z
[1] 1 2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34
[32] 35 36 37 38 39 40

Extracting some elements of x.


s1=x[4:8]
> s1
[1] 4 5 6 7 8
> x1=c(a=4,b=7,9)
> x2=c(3,8,9)
> names(x2)=c("a","b")
> class(x1)
[1] "numeric"
> x3=c('a',3,4)
> x;y;x1;x2;x3
[1] 2 3 4
[1] 2 3 4 6 7 8
a b
4 7 9
a b <NA>
3 8 9

Vectors can be combined via the function c. For examples, the following two vecto
rs n and s are combined into a new vector containing elements from both vectors.
n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")

Combining numeric and character vector


> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"
Resultant vector is character type.
x4=c(3,'b',5)
> x4
[1] "3" "b" "5"

Combining numeric, character and logical vector


> xx1=c('a',TRUE,5)
> xx1
[1] "a" "TRUE" "5"
Resultant vector is character type.
Combining numeric and logical vector

> xx2=c(TRUE,5,8)
> xx2
[1] 1 5 8

Resultant vector is numeric type.( True=1, False=0)


> xx2=c(FALSE,5,8)
> xx2
[1] 0 5 8

CLASS-2
Character vectors
Scalars and vectors can be made up of strings of chara
cters. All elements of a vector must be of the same typ
e.
There are two basic operations we want to perform on character ve
ctor.
Substr()-> it takes arguments substr(x,start,stop),where x is a v
ector of character strings, and start and stop say which characte
rs to keep.e.g , to print the first two letters of each col use.
cols=c("red","blank","blue","white","green")
> substr(cols,1,2)
[1] "re" "bl" "bl" "wh" "gr"

Second operation is paste() building up strings by concatenation


.
paste(cols,"flowers")
[1] "red flowers" "blank flowers" "blue flowers" "white flowe
rs" "green flowers"
There are two optional parameters to paste(). The sep parameter
controls what goes between the components being pasted together.
We might not want the default space.

> paste(cols,"flowers",sep="")
[1] "redflowers" "blankflowers" "blueflowers" "whiteflowers"
"greenflowers"
> paste("beautiful",cols,"flowers",sep="")
[1]
"beautifulredflowers" "beautifulblankflowers" "beautifulbluefl
owers"
[4] "beautifulwhiteflowers" "beautifulgreenflowers"
>
paste("beautiful",cols,"flowers")
[1] "beautiful red flowers" "beautiful blank flowers" "beautif
ul blue flowers"
[4] "beautiful white flowers" "beautiful green flowers"
> paste("beautiful",cols,"flowers",collapse = ",")
[1] "beautiful red flowers,beautiful blank flowers,beautiful blu
e flowers,beautiful white flowers,beautiful green flowers"

The collapse parameter to paste() allows all the components of t


he resulting vector to be collapsed into a single string
paste("beautiful",cols,"flowers",collapse = ",")
[1] "beautiful red flowers,beautiful blank flowers,beautiful blu
e flowers,beautiful white flowers,beautiful green flowers"

Scalar multiplication, addition, power or use of any other opera


tor on numeric vector
v1=1:10
> v2=c(v1)
> v3=v2[c(1:10)]+5
> v3
[1] 6 7 8 9 10 11 12 13 14 15
> v4=v2[c(1:10)]*2
> v4
[1] 2 4 6 8 10 12 14 16 18 20
> v5=v2[c(1:10)]^2
> v5
[1] 1 4 9 16 25 36 49 64 81 100

Creating upper limit from lower limit


ul=ll+10
> ul
[1] 10 20 30 40 50 60 70 80 90

> uL=ll[]+10
> uL
[1] 10 20 30 40 50 60 70 80 90

How to convert character into numeric. Factor offer alternate wa


y to store character data. Factor can be an efficient way of sto
re character data when there are repeats among the vector elemen
ts.

>factor(cols)
[1] red blank blue white green
Levels: blank blue green red white
> as.integer(cols)
[1] NA NA NA NA NA
> g1=factor(cols)
> as.integer(g1)
[1] 4 1 2 5 3
levels(g1)
[1] "blank" "blue" "green" "red" "white"
>
> levels(g1)
[1] "blank" "blue" "green" "red" "white"
>
> levels(g1)[1]="black"
> g1
[1] red black blue white green
Levels: black blue green red white
The levels()function can be used to change factor labels as well
.
f1=factor(rep(c("g1","g2"),c(10,12)))
> f1
[1] g1 g1 g1 g1 g1 g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2
Levels: g1 g2
> f2=as.integer(f1)
> f2
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

a1=rep(c(0,1),c(3,4))
a2=rep(a1,2)
> a2
[1] 0 0 0 1 1 1 1 0 0 0 1 1 1 1
> a3=factor(a2)
> a3
[1] 0 0 0 1 1 1 1 0 0 0 1 1 1 1
Levels: 0 1
> levels(a3)=c("Male","Female")
> a3
[1] Male Male Male Female Female Female Female Male Male Male Fe
male
[12] Female Female Female
Levels: Male Female

Read data from screen if let the file name "", or just without any parameter:

> x <- scan("",what="int")


1: 43 #input 43 from the screen
2:
Read 1 item
> x
[1] "43"

> x <- scan("",what="int")


1: 43 #input 43 from the screen
2: 22
3: 67
4:
Read 3 items
> x
[1] "43" "22" "67"

Large data can be scanned in by just copy and paste, for example paste from EXCEL.
> x <- scan()
Then use "ctrl+v" to paste the data, the data type will be automatically determined.

scan() function read data from screen or file.

scan(file = "", what =”” ,sep = "")

• file: the name of a file, if "", then read in from stdin


• what: type of data, including logical, integer, numeric, complex, character, raw
...
. sep : data values are separated (e.g , :, , )

Vector arithmetic
Vectors can be used in arithmetic expressions, in which case the operations are
performed element by element. Vectors occurring in the same expression need not all
be of the same length. If they are not, the value of the expression is a vector with the
same length as the longest vector which occurs in the expression. Shorter vectors in
the expression are recycled as often as need be (perhaps fractionally) until they match
the length of the longest vector. In particular a constant is simply repeated. So with
the above assignments the command
> v <- 2*x + y + 1
generates a new vector v of length 11 constructed by adding together, element by
element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times.
The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a
power. In addition all of the common arithmetic functions are
available. log, exp, sin, cos, tan, sqrt, and so on, all have their usual meaning.
max and min select the largest and smallest elements of a vector respectively.
range is a function whose value is a vector of length two,
namely c(min(x), max(x)).
length(x) is the number of elements in x,
sum(x) gives the total of the elements in x, and
prod(x) their product.
Two statistical functions are mean(x) which calculates the sample mean,
which is the same as sum(x)/length(x), and
var(x) which gives
sum((x-mean(x))^2)/(length(x)-1)
or sample variance. If the argument to var() is an n-by-p matrix the value is a p-by-p sample covariance
matrix got by regarding the rows as independent p-variate sample vectors.
sort(x) returns a vector of the same size as x with the elements arranged in increasing order; however there
are other more flexible sorting facilities available
Note that max and min select the largest and smallest values in their arguments, even if they are given
several vectors. The parallel maximum and minimum functions pmax and pmin return a vector (of length
equal to their longest argument) that contains in each element the largest (smallest) element in that position
in any of the input vectors.
For most purposes the user will not be concerned if the “numbers” in a numeric vector
are integers, reals or even complex. Internally calculations are done as double
precision real numbers, or double precision complex numbers if the input data are
complex.
cc1<- c(5,8,9,10,25)
> cc2<- c(7,8,9,12)
length(cc1)=length(cc2);cc1
[1] 5 8 9 10
Generating regular sequences
R has a number of facilities for generating commonly used sequences of numbers. For
example 1:30 is the vector c(1, 2, …, 29, 30). The colon operator has high priority
within an expression, so, for example 2*1:15 is the vector c(2, 4, …, 28, 30). Put n
<- 10 and compare the sequences 1:n-1 and 1:(n-1).

The construction 30:1 may be used to generate a sequence backwards.


The function seq() is a more general facility for generating sequences. It has five
arguments, only some of which may be specified in any one call. The first two
arguments, if given, specify the beginning and end of the sequence, and if these are
the only two arguments given the result is the same as the colon operator. That
is seq(2,10) is the same vector as 2:10.
Arguments to seq(), and to many other R functions, can also be given in named form,
in which case the order in which they appear is irrelevant. The first two arguments
may be namedfrom=value and to=value; thus seq(1,30), seq(from=1,
to=30) and seq(to=30, from=1) are all the same as 1:30. The next two arguments
to seq() may be named by=value and length=value, which specify a step size and a
length for the sequence respectively. If neither of these is given, the default by=1 is
assumed.
For example
seq(-5, 5, by=.2) -> s3
> s3
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2
.2 -2.0
[17] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1
.0 1.2
[33] 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4
.2 4.4
[49] 4.6 4.8 5.0
> s4 <- seq(length=51, from=-5, by=.2); s4
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2
.2 -2.0
[17] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1
.0 1.2
[33] 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4
.2 4.4
[49] 4.6 4.8 5.0
generates the same vector in s4.
The fifth argument may be named along=vector, which is normally used as the only
argument to create the sequence 1, 2, …, length(vector), or the empty sequence if
the vector is empty (as it can be).
A related function is rep() which can be used for replicating an object in various
complicated ways. The simplest form is
> s5 <- rep(x, times=5)
which will put five copies of x end-to-end in s5. Another useful version is
> s6 <- rep(x, each=5)
which repeats each element of x five times before moving on to the next.

s5<- rep(seq(4,24,4),2);s5
[1] 4 8 12 16 20 24 4 8 12 16 20 24

Few examples on seq() and rep()

> x=c(2,3,4)
> y=c(x,6,7,8)

>

>
> x5=rep(c(4,5),c(5,5))
> x5
[1] 4 4 4 4 4 5 5 5 5 5
> x5=rep(c(4,5),each=5)
> x5
[1] 4 4 4 4 4 5 5 5 5 5
> length(y)
[1] 6
> v1=1:11
> v2=seq(1,60,5)
> v2
[1] 1 6 11 16 21 26 31 36 41 46 51 56
> length(v2)
[1] 12
> v1=c(v1,12)
>
> fre=c(5,15,25,40,28,13,3)

For example 1:30 is the vector c(1, 2, …, 29, 30) . ... That is seq(2,10) is the same vector as 2:10 . Argum
ents to seq()

> ch=rep(c(marks),fre)

> marks=c(1,2,3,4,5,6,7)
> ch=rep(c(marks),fre)
>
> z=c(1,4,5,NA,8,9,NA)
>z
[1] 1 4 5 NA 8 9 NA
> is.na(z)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
2. Basic Data Types
We look at some of the ways that R can store and organize data. This is a basic
introduction to a small subset of the different data types recognized by R and is not
comprehensive in any sense. The main goal is to demonstrate the different kinds of
information R can handle. It is assumed that you know how to enter data or read data
files which is covered in the first chapter.

Variable Types
2.1.1. Numbers
The way to work with real numbers has already been covered in the first chapter and is
briefly discussed here. The most basic way to store a number is to make an assignment
of a single number:

> a <- 3
>
The “<-” tells R to take the number to the right of the symbol and store it in a variable
whose name is given on the left. You can also use the “=” symbol. When you make an
assignment R does not print out any information. If you want to see what value a
variable has just type the name of the variable on a line and press the enter key:

> a
[1] 3
This allows you to do all sorts of basic operations and save the numbers:

> b <- sqrt(a*a+3)


> b
[1] 3.464102
If you want to get a list of the variables that you have defined in a particular session you
can list them all using the ls command:

> ls()
[1] "a" "b"
You are not limited to just saving a single number. You can create a list (also called a
“vector”) using the c command:

> a <- c(1,2,3,4,5)


> a
[1] 1 2 3 4 5
> a+1
[1] 2 3 4 5 6
> mean(a)
[1] 3
> var(a)
[1] 2.5
You can get access to particular entries in the vector in the following manner:

> a <- c(1,2,3,4,5)


> a[1]
[1] 1
> a[2]
[1] 2
> a[0]
numeric(0)
> a[5]
[1] 5
> a[6]
[1] NA
Note that the zero entry is used to indicate how the data is stored. The first entry in the
vector is the first number, and if you try to get a number past the last number you get
“NA.”

Examples of the sort of operations you can do on vectors is given in a next chapter.

To initialize a list of numbers the numeric command can be used. For example, to
create a list of 10 numbers, initialized to zero, use the following command:

> a <- numeric(10)


> a
[1] 0 0 0 0 0 0 0 0 0 0
If you wish to determine the data type used for a variable the type command:

> typeof(a)
[1] "double"
2.1.2. Strings
You are not limited to just storing numbers. You can also store strings. A string is
specified by using quotes. Both single and double quotes will work:

> a <- "hello"


> a
[1] "hello"
> b <- c("hello","there")
> b
[1] "hello" "there"
> b[1]
[1] "hello"
The name of the type given to strings is character,

> typeof(a)
[1] "character"
> a = character(20)
> a
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
2.1.3. Factors
Another important way R can store data is as a factor. Often times an experiment
includes trials for different levels of some explanatory variable. For example, when
looking at the impact of carbon dioxide on the growth rate of a tree you might try to
observe how different trees grow when exposed to different preset concentrations of
carbon dioxide. The different levels are also called factors.

Assuming you know how to read in a file, we will look at the data file given in the first
chapter. Several of the variables in the file are factors:

> summary(tree$CHBR)
A1 A2 A3 A4 A5 A6 A7 B1 B2 B3 B4 B5 B6 B7 C1 C2 C3 C4 C5 C6
3 1 1 3 1 3 1 1 3 3 3 3 3 3 1 3 1 3 1 1
C7 CL6 CL7 D1 D2 D3 D4 D5 D6 D7
1 1 1 1 1 3 1 1 1 1
Because the set of options given in the data file corresponding to the “CHBR” column
are not all numbers R automatically assumes that it is a factor. When you use summary
on a factor it does not print out the five point summary, rather it prints out the possible
values and the frequency that they occur.

In this data set several of the columns are factors, but the researchers used numbers to
indicate the different levels. For example, the first column, labeled “C,” is a factor. Each
trees was grown in an environment with one of four different possible levels of carbon
dioxide. The researchers quite sensibly labeled these four environments as 1, 2, 3, and
4. Unfortunately, R cannot determine that these are factors and must assume that they
are regular numbers.

This is a common problem and there is a way to tell R to treat the “C” column as a set of
factors. You specify that a variable is a factor using the factor command. In the following
example we convert tree$C into a factor:

> tree$C
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
> summary(tree$C)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 2.519 3.000 4.000
> tree$C <- factor(tree$C)
> tree$C
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4
> summary(tree$C)
1 2 3 4
8 23 10 13
> levels(tree$C)
[1] "1" "2" "3" "4"
Once a vector is converted into a set of factors then R treats it differently. A set of
factors have a discrete set of possible values, and it does not make sense to try to find
averages or other numerical descriptions. One thing that is important is the number of
times that each factor appears, called their “frequencies,” which is printed using the
summary command.

As well as numerical vectors, R allows manipulation of logical quantities. The


elements of a logical vector can have the values TRUE, FALSE, and NA (for “not
available”, see below). The first two are often abbreviated as T and F, respectively.
Note however that T and F are just variables which are set to TRUE and FALSE by
default, but are not reserved words and hence can be overwritten by the user. Hence,
you should always use TRUE and FALSE.
Logical vectors are generated by conditions. For example
> temp <- x > 13
sets temp as a vector of the same length as x with values FALSE corresponding to
elements of x where the condition is not met and TRUE where it is.
The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In
addition if c1 and c2 are logical expressions, then c1 & c2 is their intersection
(“and”), c1 | c2 is their union (“or”), and !c1 is the negation of c1.
Logical vectors may be used in ordinary arithmetic, in which case they
are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1. However
there are situations where logical vectors and their coerced numeric counterparts are
not equivalent, for example see the next subsection.
Note that there is a difference between operators that act on entries within a vector and
the whole vector:

> a = c(TRUE,FALSE)
> b = c(FALSE,FALSE)
> a|b
[1] TRUE FALSE
> a||b
[1] TRUE
> xor(a,b)
[1] TRUE FALSE
There are a large number of functions that test to determine the type of a variable. For
example the is.numeric function can determine if a variable is numeric:
> a = c(1,2,3)
> is.numeric(a)
[1] TRUE
> is.factor(a)
[1] FALSE
2.1.4. Data Frames
A data frame is a table or a two-dimensional array-like structure in which
each column contains values of one variable and each row contains one set
of values from each column. It allows to store data in over viewable
rectangular grids corresponds to measurements or values of a instance,

While each col is a vector containing data for a specific variable. Following
are the characteristics of a data frame.

 The column names should be non-empty.

 The row names should be unique.

 The data stored in a data frame can be of numeric, factor or character type.

 Each column should contain same number of data items.

Useful Functions for Exploring Data Frames

>
b1=1:5
> b2=c(12.5,23,19,34,16)
> b3=c("a","b","c","d","e")
> b4=as.Date(c("2014-01-18","2005-06-12","2007-09-27","2006-10-0
9","2011-05-21")

> d1=data.frame(b1,b2,b3,b4);d1
b1 b2 b3 b4
1 1 12.5 a 0001-01-20
2 2 23.0 b 0005-06-20
3 3 19.0 c 0007-09-20
4 4 34.0 d 0006-10-20
5 5 16.0 e 0018-05-20

[1] Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The
output is a vector.

dim(d1)
5 4
Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can
get the same information by extracting the first and second element of the output vector from
dim().

> nrow(d1)
# same as dim(d1)[1]
[1] 5
> ncol(d1)
# same as dim(d1)[2]
[1] 4

The names() function will return the column headers


> names(d1)
[1] "b1" "b2" "b3" "b4"
> d1$b1
[1] 1 2 3 4 5

Use head() to obtain the first n observations and tail() to obtain the last n observations; by default,
n = 6. These are good commands for obtaining an intuitive idea of what the data look like without
revealing the entire data set, which could have millions of rows and thousands of columns.

> head(d1)
b1 b2 b3 b4
1 1 12.5 a 0001-01-20
2 2 23.0 b 0005-06-20
3 3 19.0 c 0007-09-20
4 4 34.0 d 0006-10-20
5 5 16.0 e 0018-05-20
> class(d1)
[1] "data.frame"

The str() function returns many useful pieces of information, including the above useful outputs
and the types of data for each column. In this example, “int” denotes that the variable “b1” and
“b2” is numeric (continuous), and “Factor” denotes that the variable “b3” is categorical with 5
categories or levels.

> data.frame': 5 obs. of 3 variables:


$ b1: int 1 2 3 4 5
$ b2: num 12.5 23 19 34 16
$ b3: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
>
To obtain all of the categories or levels of a categorical variable, use the levels() function.
> levels(d1$b3)
[1] "a" "b" "c" "d" "e"

When applied to a data frame, the summary() function is essentially applied to each column, and
the results for all columns are shown together. For a continuous (numeric) variable like “count”,
it returns the 5-number summary. If there are any missing values (denoted by “NA” for a
particular datum), it would also provide a count for them. In this example, there are no missing
values for “count”, so there is no display for the number of NA’s. For a categorical variable like
“b3”, it returns the levels and the number of data in each level.
>
summary(d1)
b1 b2 b3
Min. :1 Min. :12.5 a:1
1st Qu.:2 1st Qu.:16.0 b:1
Median :3 Median :19.0 c:1
Mean :3 Mean :20.9 d:1
3rd Qu.:4 3rd Qu.:23.0 e:1
Max. :5 Max. :34.0

Another way that information is stored is in data frames. This is a way to take many
vectors of different types and store them in the same variable. The vectors can be of all
different types. For example, a data frame may contain many lists, and each list might
be a list of factors, strings, or numbers.

There are different ways to create and manipulate data frames. Most are beyond the
scope of this introduction. They are only mentioned here to offer a more complete
description. Please see the first chapter for more information on data frames.

One example of how to create a data frame is given below:

> a <- c(1,2,3,4)


> b <- c(2,4,6,8)
> levels <- factor(c("A","B","A","B"))
> bubba <- data.frame(first=a,
second=b,
f=levels)
> bubba
first second f
1 1 2 A
2 2 4 B
3 3 6 A
4 4 8 B
> summary(bubba)
first second f
Min. :1.00 Min. :2.0 A:2
1st Qu.:1.75 1st Qu.:3.5 B:2
Median :2.50 Median :5.0
Mean :2.50 Mean :5.0
3rd Qu.:3.25 3rd Qu.:6.5
Max. :4.00 Max. :8.0
> bubba$first
[1] 1 2 3 4
> bubba$second
[1] 2 4 6 8
> bubba$f
[1] A B A B
Levels: A B
subset() function

The subset function is available in base R and can be used to return subsets of a vector, martix, or data frame
which meet a particular condition. In my three years of using R, examples of the subset function.

> numvec = c(2,5,8,9,0,6,7,8,4,5,7,11)


> charvec = c("David","James","Sara","Tim","Pierre",
+ "Janice","Sara","Priya","Keith","Mark",
+ "Apple","Sara")
> gender = c("M","M","F","M","M","M","F","F","F","M","M","F")
> state = c("CO","KS","CA","IA","MO","FL","CA","CO","FL","CA","WY","AZ")
>
> subset(numvec, numvec > 7)
[1] 8 9 8 11
> subset(numvec, numvec < 9 & numvec > 4)
[1] 5 8 6 7 8 5 7
> subset(numvec, numvec < 3 |numvec > 9)
[1] 2 0 11
>
> df_n = data.frame(var1=c(numvec), var2=c(charvec),
+ gender=c(gender), state=c(state))
>
> subset(df_n, var1 < 5)
var1 var2 gender state
1 2 David M CO
5 0 Pierre M MO
9 4 Keith F FL
> subset(df_n, var2 == "Sara")
var1 var2 gender state
3 8 Sara F CA
7 7 Sara F CA
12 11 Sara F AZ
> subset(df_n, var1==5, select=c(var2, state))
var2 state
2 James KS
10 Mark CA
> subset(df_n, var2 != "Sara" & gender == "F" & var1 > 5)
var1 var2 gender state
8 8 Priya F CO

Matrices in R
Matrices are Data frames which contain lists of homogeneous data in a tabular
format. We can perform arithmetic operations on some elements of the matrix
or the whole matrix itself in R.
Let us see how to convert a single dimension vector into a two-dimensional
array using the matrix() function:
v1_1<-seq(1,100)
> m1_1=matrix(v1_1,nrow = 10)
> m1_1
Matrices can represent the binding of two or more vectors of equal length. If
we have the X and Y coordinates for five quadrats within a grid, we can
use cbind()(combine by columns) or rbind() (combine by rows) functions to
combine them into a single matrix, as follows:

X <- c(16.92, 24.03, 7.61, 15.49, 11.77)


> Y <- c(8.37, 12.93, 16.65, 12.2, 13.12)
> XY <- cbind(X, Y);XY
X Y
[1,] 16.92 8.37
[2,] 24.03 12.93
[3,] 7.61 16.65
[4,] 15.49 12.20
[5,] 11.77 13.12

X <- c(16.92, 24.03, 7.61, 15.49, 11.77)


> Y <- c(8.37, 12.93, 16.65, 12.2, 13.12)
> XY <- rbind(X, Y);XY
[,1] [,2] [,3] [,4] [,5]
X 16.92 24.03 7.61 15.49 11.77
Y 8.37 12.93 16.65 12.20 13.12

4.1. Indexing Matrices in R


Like vectors, we can index matrices from the vectors of positive integers,
negative integers, character strings and logical values.
The difference is, matrices have two dimensions (height and width) that require
a set of 2 numbers for indexing while vectors have one dimension (length) that
enable indexing of each element by a single number.
Matrices are in the form of [row.indices, col.indices] for row and column indices.
Below are few examples of matrix indexing:

m1_1[3, ] – It displays entire third row.

m1_1[, 2 ] – It displays entire second column.

m1_1[,-2] – It displays all columns except second


m1_1[3, ]
[1] 3 13 23 33 43 53 63 73 83 93
> m1_1[, 2 ]
[1] 11 12 13 14 15 16 17 18 19 20
> m1_1[,-2]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 21 31 41 51 61 71 81 91
[2,] 2 22 32 42 52 62 72 82 92
[3,] 3 23 33 43 53 63 73 83 93
[4,] 4 24 34 44 54 64 74 84 94
[5,] 5 25 35 45 55 65 75 85 95
[6,] 6 26 36 46 56 66 76 86 96
[7,] 7 27 37 47 57 67 77 87 97
[8,] 8 28 38 48 58 68 78 88 98
[9,] 9 29 39 49 59 69 79 89 99
[10,] 10 30 40 50 60 70 80 90 100
Lists in R
Lists are R Data Types stores collections of objects of differing lengths and types
using list() function.
For example, we can create many isolated vectors, like temperature, shade, and
names to represent data from a single experiment and group them to make
them components of a list object, as follows:

EXPERIMENT <- list(SITE = SITE, COORDINATES = paste(X,+ Y, sep = ","),


TEMPERATURE = TEMPERATURE,+ SHADE = SHADE)
List created in the above example consists of four components:

 SITE which is a two-character vector.


 A two character vector named COORDINATES, which is a vector of XY
coordinates for sites A, B, C, D, and E

TEMPERATURE which is a numeric vector.

sample() function
sample() function is a randomly determined number, if you try this function repeatedly, you’ll ge
t different results every time. This is the correct behavior in most cases, but sometimes you may
want to get repeatable results every time you run the function.

sample(numvec,5,replace=FALSE)
[1] 11 9 6 7 8
One can use sample() to take samples from the data frame (data frame-df_n). In this case, you m
ay want to use the argument replace=FALSE. Because this is the default value of the replace arg
ument, you don’t need to write it explicitly:

Applying sample() function on data frame


> index<- sample(1:nrow(df_n),5,replace=FALSE);df_n[index, ]
var1 var2 gender state
7 7 Sara F CA
9 4 Keith F FL
1 2 David M CO
10 5 Mark M CA
4 9 Tim M IA

Tables
Another common way to store information is in a table. Here we look at how to define
both one way and two way tables. We only look at how to create and define tables; the
functions used in the analysis of proportions are examined in another chapter.

2.2.1. One Way Tables


The first example is for a one way table. One way tables are not the most interesting
example, but it is a good place to start. One way to create a table is using the table
command. The arguments it takes is a vector of factors, and it calculates the frequency
that each factor occurs. Here is an example of how to create a one way table:

> a <- factor(c("A","A","B","A","B","B","C","A","C"))


> results <- table(a)
> results
a
A B C
4 3 2
> attributes(results)
$dim
[1] 3

$dimnames
$dimnames$a
[1] "A" "B" "C"

$class
[1] "table"

> summary(results)
Number of cases in table: 9
Number of factors: 1
If you know the number of occurrences for each factor then it is possible to create the
table directly, but the process is, unfortunately, a bit more convoluted. There is an
easier way to define one-way tables (a table with one row), but it does not extend easily
to two-way tables (tables with more than one row). You must first create a matrix of
numbers. A matrix is like a vector in that it is a list of numbers, but it is different in that
you can have both rows and columns of numbers. For example, in our example above
the number of occurrences of “A” is 4, the number of occurrences of “B” is 3, and the
number of occurrences of “C” is 2. We will create one row of numbers. The first column
contains a 4, the second column contains a 3, and the third column contains a 2:
> occur <- matrix(c(4,3,2),ncol=3,byrow=TRUE)
> occur
[,1] [,2] [,3]
[1,] 4 3 2
At this point the variable “occur” is a matrix with one row and three columns of numbers.
To dress it up and use it as a table we would like to give it labels for each columns just
like in the previous example. Once that is done we convert the matrix to a table using
the as.table command:

> colnames(occur) <- c("A","B","C")


> occur
A B C
[1,] 4 3 2
> occur <- as.table(occur)
> occur
A B C
A 4 3 2
> attributes(occur)
$dim
[1] 1 3

$dimnames
$dimnames[[1]]
[1] "A"

$dimnames[[2]]
[1] "A" "B" "C"

$class
[1] "table"

2.2.2. Two Way Tables


If you want to add rows to your table just add another vector to the argument of the
table command. In the example below we have two questions. In the first question the
responses are labeled “Never,” “Sometimes,” or “Always.” In the second question the
responses are labeled “Yes,” “No,” or “Maybe.” The set of vectors “a,” and “b,” contain
the response for each measurement. The third item in “a” is how the third person
responded to the first question, and the third item in “b” is how the third person
responded to the second question.

> a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")


> b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
> results <- table(a,b)
> results
b
a Maybe No Yes
Always 2 0 0
Never 0 1 1
Sometimes 2 1 1
The table command allows us to do a very quick calculation, and we can immediately
see that two people who said “Maybe” to the first question also said “Sometimes” to the
second question.

Just as in the case with one-way tables it is possible to manually enter two way tables.
The procedure is exactly the same as above except that we now have more than one
row. We give a brief example below to demonstrate how to enter a two-way table that
includes breakdown of a group of people by both their gender and whether or not they
smoke. You enter all of the data as one long list but tell R to break it up into some
number of columns:

> sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
> rownames(sexsmoke)<-c("male","female")
> colnames(sexsmoke)<-c("smoke","nosmoke")
> sexsmoke <- as.table(sexsmoke)
> sexsmoke
smoke nosmoke
male 70 120
female 65 140

> x1=c(6,9,8)
> x2=c(12,11,15,18)
> x3=c(14,11,15,18,12,10)
> x4=c(12,8,11,9,10)
> z1=list(x,y,x1,x2,x3,x4)
> lapply(z1,mean)
[[1]]
[1] 12.6

[[2]]
[1] 12

[[3]]
[1] 7.666667

[[4]]
[1] 14

[[5]]
[1] 13.33333

[[6]]
[1] 10

> sapply(z1,mean)
[1] 12.600000 12.000000 7.666667 14.000000 13.333333 10.000000

for (i in 0:7){
+ den_bin=(factorial(n)/(factorial(i)*factorial(n-i)))*(prob_succ^i)*(prob_fail^(n-
i))
+ den_bin
+ print(den_bin)}

Syntax for Writing Functions in R

func_name <- function (argument) {

statement

In the function calls, the argument matching of formal argument to the


actual arguments takes place in positional order.
This means that, in the call ,x^y the formal arguments and are assigned
8 and 2 respectively.
We can also call the function using named arguments.
When calling a function in this way, the order of the actual arguments
doesn’t matter. For example, all of the function calls given below are
equivalent.
Furthermore, we can use named and unnamed arguments in a single
call.
In such case, all the named arguments are matched first and then the
remaining unnamed arguments are matched in a positional order.
In all the examples given below, argument x gets the value 8 and other
i.e y gets the value 2.

mf1<-function(x,y){
+ p1<- x^y
+ p2<-x^2+x^y+y^2+y*exp(x)
+ p3<-seq(x,y)
+ p4<-sample(y,x)
+ print(p1);print(p2);print(p3);print(p4)}
> mf1(2,10)
[1] 1024
[1] 1201.891
[1] 2 3 4 5 6 7 8 9 10
[1] 10 4

Default Values for Arguments

We can assign default values to arguments in a function in R.


This is done by providing an appropriate value to the formal argument in
the function declaration.
The use of default value to an argument makes it optional when calling
the function.
Here, x and y are optional and will take the value 2 and 4 when not
provided.
mf1<-function(x=2,y=4){
+ p1<- x^y
+ p3<-seq(x,y)
+ print(p1);print(p3)}
> mf1()
[1] 16
[1] 2 3 4
> mf1(3,4)
[1] 81
[1] 3 4

Example
v1_1<-seq(1,100)
mf1<-function(x){
m1=mean(x)
md=median(x)
m3=sum(x)
print(m1);print(md);print(m3)}
mf1(v1_1)

R if statement

IFELSE statement
res=c(30,48,59,68,18,59)
> ifelse(res<=40,"FAIL","PASS")
[1] "FAIL" "PASS" "PASS" "PASS" "FAIL" "PASS"

Compound IFELSE statement


Comparing all values of vector a and b.
> a=c(1,4,5)
> b=c(56,75,82)
> ifelse(a==4 & b>=60,"good","poor")
[1] "poor" "good" "poor"
> ifelse(a>=4 & b>=60,"good","poor")
[1] "poor" "good" "good"

Comparing only first value of vector a and b.

> ifelse(a>=4 && b>=60,"good","poor")


[1] "poor"

The syntax of if statement is:

if (test_expression) {

statement

If the test_expression is TRUE, the statement gets executed. But if


it’s FALSE, nothing happens.
Here, test_expression can be a logical or numeric vector, but only the
first element is taken into consideration.
In the case of numeric vector, zero is taken as FALSE, rest as TRUE.

Flowchart of if statement
Example: if statement
x <- 5 ;x1<-20
> if(x < x1){
+ print(x1)
+ }
[1] 20

if…else statement
The syntax of if…else statement is:

if (test_expression) {

statement1

} else {

statement2

}
The else part is optional and is only evaluated
if test_expression is FALSE.
It is important to note that else must be in the same line as the closing
braces of the if statement.
if (x > x1) {
+ print(x)
+ } else {
+ print(x1)
+ }
[1] 20

Flowchart of if…else statement

The above conditional can also be written in a single line as follows.


if(x > 0) print("Non-negative number") else print("Negative number")
[1] "Non-negative number"

This feature of R allows us to write construct as shown below.


x <- -2
> y <- if(x > 0) 1 else 0
> y
[1] 0

if…else Ladder
The if…else ladder (if…else…if) statement allows you execute a block of
code among more than 2 alternatives
if ( test_expression1) {

statement1

} else if ( test_expression2) {

statement2

} else if ( test_expression3) {

statement3

} else {

statement4

}
The syntax of if…else statement is:
Only one statement will get executed depending upon the
test_expressions.

Example of nested if…else


x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else{
print("Zero")}
[1] "Zero"

for(i in 1:30) {
if(i<=10){# i-th element of `u1` squared into `i`-th position of `usq`
usq[i] <- u1[i]*u1[i]
print(usq[i])
}else if(i>10 & < = 20){
usq[i] <- 2*u1[i]
print(usq[i])
}else{
usq[i]=0
}
}
Usq
1.199174587 0.001471373 1.488226525 0.641106497 0.001976277 0.146392429
[7] 0.146102733 0.003291107 2.735986866 3.312243036 -0.906243243 -0.6430
24231
[13] 0.330566852 -5.501139218 0.762957134 2.840590767 -0.508421993 4.7278
94961
[19] 2.252671368 -2.228362085 0.000000000 0.000000000 0.000000000 0.0000
00000
[25] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.0000
00000

Use of curve function

curve(3*x+2, 0, 3,xlim=c(0,6))
par(new=TRUE)
curve(2*x-.5*x^2,3,6,xlim=c(0,6), col = "violet")
x + *2x^2

2 80
2 * x3 -* 0.5

-6

0 1 2 3 4 5 6

u1 <- rnorm(30)
print("This loop calculates the square of the first 10 elements of vector u1")

# Initialize `usq`
usq <- 0
u1sum<-0
for(i in 1:10) {
# i-th element of `u1` squared into `i`-th position of `usq`
usq[i] <- u1[i]*u1[i]
print(usq[i])
}
1.199175
[1] 0.001471373
[1] 1.488227
[1] 0.6411065
[1] 0.001976277
[1] 0.1463924
[1] 0.1461027
[1] 0.003291107
[1] 2.735987
[1] 3.312243

for(i in 1:30) {
if(i<=10){# i-th element of `u1` squared into `i`-th position of `usq`
usq[i] <- u1[i]*u1[i]
print(usq[i])
}else{
usq[i] <- 2*u1[i]
print(usq[i])}
}
usq
[1] 1.199174587 0.001471373 1.488226525 0.641106497 0.001976277 0.146392429
[7] 0.146102733 0.003291107 2.735986866 3.312243036 -0.906243243 -0.643024231
[13] 0.330566852 -5.501139218 0.762957134 2.840590767 -0.508421993 4.727894961
[19] 2.252671368 -2.228362085 2.065822758 0.509759047 0.639117330 0.857311001
[25] 0.692543199 1.025886933 1.949234891 -1.185488173 -2.395137389 2.456806074

>

You might also like