You are on page 1of 22

Finding help for functions and operators in R:

Remember that if you have questions about a particular R function, you can access its
documentation with a question mark followed by the function name:
?function_name_here.
However, in the case of an operator like the colon used above,you must enclose the
symbol in backticks like this: ?`:`.
(NOTE: The backtick (`)
key is generally located in the top left corner of a keyboard, above the Tab key.If you
don't have a backtick key, you can use regular quotes.)

Seq() operator ,same as : operator.


Length() operator gives the length of the list.

seq(along.with=my_seq) ; seq_along(my_seq) both takes the argument


of length of my_Seq
If we're interested in creating a vector that contains 40 zeros, we can use rep(0,
times = 40).
Two join two vector c(y) and c(z) -> c(y,z)

Difference btwn atomic vectors and integers


There are multiple classes that are grouped together as "numeric" classes, the 2 most
common of which are double (for double precision floating point numbers) and integer. R
will automatically convert between the numeric classes when needed, so for the most
part it does not matter to the casual user whether the number 3 is currently stored as an
integer or as a double. Most math is done using double precision, so that is often the
default storage.
Sometimes you may want to specifically store a vector as integers if you know that they
will never be converted to doubles (used as ID values or indexing) since integers require
less storage space. But if they are going to be used in any math that will convert them to
double, then it will probably be quickest to just store them as doubles to begin with.

Vectors:
Atomic Vectors :
Contains exactly one data type.

Logical Operators : The `<` and `>=` symbols in these examples are
called 'logical operators'. Otherlogical operators include `>`, `<=`, `==` for
exact equality, and `!=` for inequality

Character objects : Character vectors are also very common in R. Double


quotes are used to distinguish character objects.
my_char<-c("My","name","is")
1. Paste() join two strings together
2. The `collapse` argument to the paste() function tells R that when we join together
the elements of the my_char character vector, we'd like to separate them with
single spaces
3. To add or contcaneate use the c() function

Paste( sep ,collapse): sep , collapse explained


a <- "apple"
b <- "banana"
# Put a and b together, with a space in between:
paste(a, b)
#> [1] "apple banana"
# With no space, use sep="", or use paste0():
paste(a, b, sep="")
#> [1] "applebanana"
paste0(a, b)
#> [1] "applebanana"
# With a comma and space:
paste(a, b, sep=", ")
#> [1] "apple, banana"

# With a vector
d <- c("fig", "grapefruit", "honeydew")
# If the input is a vector, use collapse to put the elements together:
paste(d, collapse=", ")
#> [1] "fig, grapefruit, honeydew"
# If the input is a scalar and a vector, it puts the scalar with each
# element of the vector, and returns a vector:
paste(a, d)
#> [1] "apple fig"
"apple grapefruit" "apple honeydew"
# Use sep and collapse:
paste(a, d, sep="-", collapse=", ")
#> [1] "apple-fig, apple-grapefruit, apple-honeydew"
[collapse for vectors , sep for scalars)

If Vectors are of different length : Vector Recycling

Lists:
Contains multiple data types.

Dimensions are extra attributes applied to a vector to turn it into a matrix or a higher
dimensional array: so a dim(int_list) is NULL

Length()
> mylist <- list (1:10)
> length (mylist)
[1] 1

In such a case you are not looking for the length of the list, but of its first element :
> length (mylist[[1]])
[1] 10

This is a "true" list :


> mylist <- list(1:10, rnorm(25), letters[1:3])
> length (mylist)
[1] 3

Also, it seems that R considers a data.frame as a list :


> df <- data.frame (matrix(0, ncol = 30, nrow = 2))
> typeof (df)
[1] "list"

In such a case you may be interested in ncol() and nrow() rather than length() :
> ncol (df)
[1] 30
> nrow (df)
[1] 2

Though length() will also work (but it's a trick when your data.frame has only one
column) :
> length (df)
[1] 30
> length (df[[1]])
[1] 2

Missing Values [NA and NAN]


NA

In R, NA is used to represent any value that is 'not available' or 'missing'


Is.variablename checks the variable[is.na checks if NA is there]
NA is a placeholder for something that does not exist ,it is not a value.

NAN:
When 1/0 or inf-inf etc

Subsetting Vectors , extracting


elements form vectors

The way you tell R that you want to select some particular elements (i.e.
a'subset') from a vector is by placing an 'index vector' in

square bracketsimmediately following the name of the vector , x[1:10], picks the
first 10 elements

Index vectors come in four different flavors -- logical vectors, vectors of positive
integers, vectors of negative integers,

Logical vectors-subsetting

y<-x[!is.na(x)] for all elements not NA


Combing multiple conditions in subsetting : x[!is.na(x)&x>0]
Many programming languages use what's called 'zero-based indexing', which
means that the first element of a vector is considered element 0. R uses 'onebasedindexing', which (you guessed it!) means the first element of a vector is 1.

Postive and negative integer subsetting

R accepts negative integer indexes. Whereas x[c(2, 10)] gives us ONLY the 2nd
and 10th elements of x, x[c(-2, -10)] gives us all elements of x EXCEPT for the 2nd
and 10 elements. Try x[c(-2, -10)] now to see this

Named vector subsetting :

Names(x) : gives the arguments of the vector x


For named Vectors : subsetting slightly different

vect
foo bar norf
11 2 NA
Vect[bar] gives the elemen: bar 2
Vect[2] gives the 2nd element: bar 2

Matrices and data frames


Matrices contain only one type of data
Matrices are vectors with dimension attribute
To represent tabular data with rows and columns

When using : , c() is not required when creating a vector.


Dim(vector)= Null unlike Matrixes
To assign dim attribute of a vector , dim(vec)<-c(3,4) : creates a matrix of
3x4

Classes in R
Class(of numeric vector ) gives numeric

DATA frames
Data frames are represented as a special type of list where every element of
the list has to have the same length. Each element of the list can be thought of
as a column and the length of each element of the list is the number of rows.
Unlike matrices, data frames can store different classes of objects in each
column. Matrices must have every element be the same class (e.g. all integers or
Each represent a
all numeric).
colum

Len(foo) has to = len(bar)


Len of foo= no of rows of foo
Belongs to class data.frame

To assign colnames
> colnames(my_data)<-cnames

Where cnames is the character vector.

Logic in R
There are two logical values in R, also called Roolean values. They are TRUE and FALSE.
In R you can construct logical expressions which will evaluate to either TRUE or FALSE.
In order to negate Roolean expressions you can use the NOT operator. An
exclamation point `!` will cause !TRUE (say: not true) to evaluate to FALSE and !
FALSE (say: not false) to evaluate to TRUE.
You can use the `&` operator to evaluate AND across a vector. The `&&` version
of AND only evaluates the first member of a vector. Lets test both for practice.
Type the expression TRUE & c(TRUE, FALSE, FALSE).
The OR operator follows a similar set of rules. The `|` version of OR evaluates OR
across an entire vector, while the `||` version of OR only evaluates the first
member of a vector

Precednece of Operators
AND operators are evaluated before OR operators

R functions for Logical operations


Is.true() : If that argument evaluates to TRUE, the
function will return TRUE. Otherwise, the function will return FALSE.
1

The function identical() will return TRUE if the two R objects passed to it as
arguments are identical
Xor function: the xor() function, which takes two arguments. The xor() function
stands for exclusive OR. If one argument evaluates to TRUE and one argument
evaluates to FALSE, then this function will return TRUE, otherwise it will return
FALSE.(if both arguments true -> False)
true or true = true
true xor true = false

Which():The which() function takes a logical vector as an argument and returns


the indices of the vector that are TRUE. For example which(c(TRUE, FALSE, TRUE))
would return the
vector c(1, 3).
the functions any() and all() take logical vectors as their argument. The any()
function will return TRUE if one or more of the elements
in the logical vector is TRUE. The all() function will return TRUE if every element in the
logical vector is TRUE

Functions

Functions are one of the fundamental building blocks of the R language. They are
small pieces of reusable code that can be treated like any other R object. Function
name followed by parantheses.

Creating a Function in R (use R script ,submit()in


console to run)
function_name <- function(arg1, arg2){
Manipulate arguments in some way
Return a value
}

To view the source code of function created type function_name (no


parantheses)
If arguments specified explicitly , can be used whichever order
func_name(x,y) same as func name(y=,x=)
Args(func_name) returns arguments of the function.

Using fucntions as arguments to functions


evaluate <- function(func, dat){
func(dat)
}

Where func is a function be defined with dat as its arguments..

Method of the
function
Passing functions into functions , without the argument function being defined.

Anonymised fucntions :
> evaluate(function(x){x+1},6)
[1] 7

Using lapply and anonymised functions to get 2nd


element
lapply(unique_vals, function(elem) elem[2])

Sending ellipses () as arguments and unpacking it :


mad_libs <- function(...){
# Do your argument unpack
args<-list(...)
place<-args[[place]]
adjective<-args[[adjective]]
paste(News from, place, today where, adjective, students took to the
streets in protest of the new, noun, being installed on campus.)
}

Create new operators %p% to perform operations you


want:
%p% <- function(a,b){ # Remember to add arguments!
Paste(a,b)

Loops
For loop

Create a place
holder for the
results

LOOP fucntions :Lapply and sapply

These powerful functions, along with their close relatives (vapply() and tapply(), among
others) offer a concise and convenient means of implementing the Split-Apply-Combine
strategy for data analysis.
Each of the *apply functions will SPLIT up some data into smaller pieces, APPLY afunction
to each piece, then COMBINE the results. A more detailed discussion of this strategy is
found in Hadley Wickhams Journal of Statistical Software paper titled The Split-ApplyCombine Strategy for Data Analysis.

Lapply()
The lapply() function takes a list as input, applies a function to each element ofthe list,
then returns a list of the same length as the original one.
Data frame is a list of vectors.
Since a data frame is really just a list of vectors (you can see this with as.list(flags)), we
can use lapply() to apply the class() function to each column of the flags dataset.

Sapply():
sapply() allows you to automate this process by calling lapply() behind the scenes,but
then attempting to simplify (hence the s in sapply) the result for you. Use sapply()

if the result is a set of character vectors sapply does this automatically as


opposed to lapply.
It simplifies , classes to matrix, vectors etc .

When Sapply cannot simplify works the same way as lapply (i.e when lengths of the
elements are not equal , if len=1 returns 1 , if len>1 matrix.

Subsetting :How to get all rows , and columns


use flag_colors <- flags[, 11:17] to extract the columns containing the color data and
store them in a new data frame called flag_colors. (Note the commabefore 11:17. This
subsetting command tells R that we want all rows, but only
| columns 11 through 17

Vapply and Tapply

vapply() allows you to specify it explicitly. If the result doesnt match the format
you specify,vapply() will throw an error, causing the operation to stop. This can
prevent significant problems in your code that might be caused by getting
unexpected return values from sapply().

vapply(flags, class, character(1)). The character(1) argument tells R that we


expect the class function to return a character vector of length 1 when applied to
EACH column of the flags dataset.

Tapply
As a data analyst, youll often wish to split your data up into groups based on thevalue of
some variable, then apply a function to the members of each group. Thenext function
well look at, tapply(), does exactly that.

Table() in R: The table function simply needs an object that can be


interpreted as a categorical variable (called a factor in R). and counts the
instances of the variable in the dataset

Factors
Factors are used to represent categorical data and can be unordered or ordered.
One can think of a factor as an integer vector where each integer has a label.
Use tapply(flags$animate, flags$landmass, mean) to apply the mean function to the
'animate' variable separately for each of the six landmass groups, thus giving
| us the proportion of flags containing an animate image WITHIN each landmass group.
> tapply(flags$animate, flags$landmass, mean)

1
2
3
4
5
6
0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000

Numerical
variable or
vetor

Categorical
variablle or
factor

Working with data

Type ls() to list the variables in your workspace or ls(dataname) to get data
specific variables.

You can see nrow()_-> for number of columns and ncolumns()-> for
number of columns
Names() -> attribute of objects(vectors,lists , matrices etc) , which gives
the variables of the data as a character vector.
Summary(data_name) to get summary ..
Depending on the class , R creates a summary
statistics. :summary() provides different output for each
variable, depending on its class. For numeric data such as
Precip_Min, summary() displays the minimum, 1st quartile,
median, mean, 3rd quartile, and maximum. These values
help us understand how the data are distributed

For categorical variables (called 'factor' variables in R),


summary() displays the number of times each value (or 'level')
occurs in the data. For example, each value of Scientific_Name
only appears once, since it is unique to a specific plant. In
contrast, the summary for Duration (also a factorvariable) tells
us that our dataset contains 3031 Perennial plants, 682 Annual
plants, etc.

Understanding the structure of your data


Str(),
The beauty of str() is that it combines many of the features of the other functions you've
already seen, all in a concise and readable format. At the very top, it tells us that the
class of plants is 'data.frame' and that it has 5166 observations and 10 variables. It then
gives us the name and

class of each variable, as well as a preview of its contents -> can be


objects in R.

applied to most

Simulation in R
simulate rolling four six-sided dice: sample(1:6, 4, replace = TRUE).
> sample(1:6, 4, replace = TRUE)

A vector ,
from which
to choose

How many times


to choose

Sampling with replacement simply


means that each number is
"replaced" after it is selected, so
that the same number can show up
more than once.

Binomial random variable (? needs further revision)


Binomial random variable represent the no of successes (heads) in a given
number of independent trials (COIN FLIPS).
Therefore, we can generate a single random variable thatrepresents the number of heads
in 100 flips of our unfair coin using rbinom(1, size = 100, prob = 0.7).

Normal Distribution

Dates and times in R


Dates are represented by the 'Date' class and times are represented by the 'POSIXct'
(large integer) and 'POSIXlt'(list) classes. Internally, dates are stored as the number of
days since 1970-01-01 and times are stored as either the number of seconds since 197001-01 (for 'POSIXct') or a list of seconds, minutes,hours, etc.

As.date() -> for coercing R objects into dates.

strptime() converts character vectors to POSIXlt. In that sense,


it is similar to as.POSIXlt(), except that the input doesn't have to
be in a particular format (YYYY-MM-DD)

Cars$variabe -> gives the data of that variable


R Data Frames :
Besides storing a single number in a variable, you can also store text, vectors, matrices,
and
| even whole data sets, which R users call DATA FRAMES.

Continious Variables :
A CONTINUOUS VARIABLE is one that can assume any value within the scope of the
problem. For the
'cars' data set:
price, city MPG, and weight are all continuous variables, as there is no
inherent restriction on the possible values of each of the three variables (except for
negative numbers, of course!

Discrete Variables :
A DISCRETE VARIABLE is a variable that may take on one of a limited, and
usually fixed, number of possible values. A CATEGORICAL VARIABLE is similar to a
discrete variable, however, instead of assuming a fixed value, the variable is ascribed a
fixed categorical description.
In our 'cars' data set, the number of passengers is a discrete variable, while the type and
drive train are examples of categorical variables.
Notice that the number of passengers is a discrete variable since the capacity of each car
may only be described in whole numbers.

Descriptive statistics:
Before a statistician engages in a thorough analysis of the data set, it
is useful to first visualize the data.
By organizing the data into a PLOT or GRAPH, a statistician is able to
explore and summarize some basic properties of the data set. The discipline of
quantitatively describing the main properties of a data set is known as DESCRIPTIVE
STATISTICS.

Central Tendency :
One of the most common descriptive statistics is a MEASURE OF CENTRAL
TENDENCY. For a specific data set, CENTRAL TENDENCY seeks to locate a central value or
a typical value for a certain variable.
It may also be referred to as the AVERAGE or the CENTER of the data.
The most common measures of central tendency are the MEAN, MEDIAN, and MODE of a
set of data.

Mean or Average : The ARITHMETIC MEAN, or simply just the MEAN or


AVERAGE, is the most common measurement of the center of the data. To
calculate the mean, you must first sum all of the values of interest, and
| then divide that sum by the total number of values used in the sum.

x 1 + x 2+ .. x n
n

R command : mean(cars$price)

Median (middle number): When the total number of values is odd, finding the
median, or 'middle' value, is quite easy. However, if the total number of values is
even, there is no middle value amongst the
Actual data. In this case, you must calculate the average of the two middle values
amongst the data points. In other words, in the case of an even number of values, the
MEDIAN is reported as the MEAN of the two middle values of the data set.

>> R command : median(cars$price)

>> R command :
summary(cars$mpgCity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.00 19.00 21.00 23.31 28.00 46.00

Types of Plots:
Dot Plots(good for smaller data sets):

The simplest type of plot is the DOT PLOT, which is used to visually convey the values of
one variable. In a dot plot, there is only a horizontal x-axis, and the data points are
represented as dots above this axis.
Since dot plots effectively display the specific numerical value of one
variable for each individual in the data set, it is a particularly useful tool for smaller data
sets.

Range :
This discussion of the dot plot brings me to our first descriptive statistic, the RANGE of
adata set. Just as the name would seem to imply, the range is the difference between the
maximum and minimum values of the data set. Range = Max-Min
R command : range(cars$price)

Data Analysis

Inferential Statistics VS Descriptive Statistics


The purpose of analyzing a sample is to draw conclusions about the population from
which the sample was selected. This is called INFERENCE and is the primary goal of
INFERENTIAL STATISTICS
In order to make any inferences about the population, we first need to describe the
sample. This is the primary goal of DESCRIPTIVE STATISTICS.

Central Tendency
Arithmetic Mean
The ARITHMETIC MEAN, or simply the MEAN or AVERAGE, is the most common
measurement of central tendency. To calculate the mean of a dataset, you first sum all of
the values and then divide that sum by the total number of values in the dataset

Arithmetic median
An alternative to the mean, which is not influenced at all by extreme values, is the
MEDIAN. The median is computed by sorting all values from least to greatest and then
selecting the middlevalue. If there is an even number of values, then there are actually 2
middle values. In this case, the MEDIAN is equal to the MEAN of the 2 middle values

MODE
Finally, we may be most interested in finding the value that shows up the most in our
dataset.In other words, what is the most common value in our dataset? This is called the
MODE and it is found by counting the number of times that each value appears in the
dataset and selecting the
| most frequent value.
-Use table function

Dispersion
While measures of central tendency are used to estimate the middle values of a dataset,
measuresof dispersion are important for describing thespread of the data.
The term dispersion refers to degree to which the data values are scattered around an
average value. Dispersion is synonymous with other words such as variability and
spread.

Range
The first descriptive statistic that can describe the variability of a data set is known as
the
RANGE. The range is the difference between the maximum and minimum values of the
data set
Range=max_value-min_value

Variance
The second important measure of variability is known as VARIANCE. Mathematically,
VARIANCE is
The average of the squared differences from the mean. More simply, variance represents
the total distance of the data from the mean

Standard deviation

SD= var

The standard deviation is very important when analyzing our data set. A small standard
deviation indicates that the data points tend to be located near the mean value, while a
large standard deviation indicates that the data points are spread further from the mean.

Visual representations of Dispersion


A BOX PLOT, also called a BOX-AND-WHISKER PLOT, is used to summarize the main
descriptive statistics of a particular data set and this type of plot helps illustrate the
concept of variability.
A box plot is used to visually represent the MINIMUM, FIRST QUARTILE (Q1),
MEDIAN,THIRD QUARTILE (Q3), and MAXIMUM of a data set.

The height of each box is referred to as the INTERQUARTILE RANGE (IQR). The more
variability within the data, the larger the IQR. On the other hand, less variability within
the data means a smaller IQR. The bottom of the box in the box plot corresponds to the
value of the first quartile (Q1), and the top of the box corresponds to the value of the
third quartile (Q3).
Whiskers are approx. 25% above and below the data

Visualising Data (for continuous


and discrete data)

Dot Plots
Histogram

Here I have created a histogram using the miles per gallon data for all of our cars. As you
may notice, the values of the MPG along the x-axis are partitioned into bins with a range
of 5.
The second bin, for example, groups together all of the cars that get 21-25 MPG in the
city, and so forth.
Note that the bin to the left of this contains those cars with 20 MPG since this value
cannot be counted in both bins. The frequency of values in each bin,or the number of
cars in each of the intervals, is reported along the y-axis.

Significance of Histograms
Data Density
Taller bars signify the range of values in which the majority of the data is located,
whereas shorter bars represent a range of values in which only a little bit of the data is
located. In other words, histograms provide a view of the DATA DENSITY.

Skewness
Histograms are particularly useful in viewing and describing the shape of the distribution
of the data. A distribution of data may have a left skew, a right skew, or no skew at all.
SKEWNESS is a measure of the extent to which the distribution of the data 'leans' to one
side or the other.
A distribution that has a left skew is one in which the left TAIL of the plot is longer. In
other words, on a histogram the majority of the distribution is located to the right of the
mean.

Value of mean less than that of the median .

Right skew > Mean is greater than the medium

Stem and Leaf Plot


A special type of histogram is known as a STEM-AND-LEAF PLOT. This plot organizes
numerical data in order of decimal place value. The left-hand column of the plot contains
the STEMS, or the numerical values of the tens digit for each of the data points,
organized vertically in increasing order.
The LEAVES are located in the right-hand column of the plot and are the values of the
ones digit for each data point of the corresponding stem,
organized horizontally in
increasing order

Statistical Interference
Introduction

Regression Models
Introduction

First plot(child~parent,galton) data

plot(jitter(child,4)~parent,galton)
by using R's function "jitter" on the children's heights, we can spread out the data
to simulate the measurement errors and make high
frequency heights more visible.

Getting and Cleaning data

The dpylr package


Needs to be added and loaded each time working with data.. library(dpylr)
package..

The first step of working with data in dplyr is to load the data into what the
package authors call a 'data frame tbl' or 'tbl_df'.Use the following code to create
a new tbl_df called cran: cran <- tbl_df(mydf).

Produces a more compact version to view the file in the ouput

The dplyr philosophy is to have small functions that each do one thing well."
Specifically, dplyr supplies five 'verbs' that cover most fundamental data
manipulation tasks: select(),filter(), arrange(), mutate(), and summarize().

Vectorization in R:
R is a high-level, interpreted computer language. This means that R takes care
of a lot of basic computer tasks for you. For instance, when you type
i <- 5.0

you dont have to tell your computer:

That 5.0 is a floating-point number

That i should store numeric-type data

To find a place in memory for to put 5

To register i as a pointer to that place in memory

R needs to answer these


questions s itself.Compare
this to a compiled language :
like C

Comparision with C
int i
i = 5

This tells the computer the i will store data of the type int (integers), and
assign the value 5 to it. If I try to assign 5.5 to it, something will go wrong.

Depending on my set-up, it might throw an error, or just silently assign 5 to i.


But C doesnt have to figure out what type of data is is represented by i and
this is part of what makes it faster
Internal functions like fft created with C wrap , hence faster processing time.
However , R still has to work out argument inputs costing time

Solution Vectorization for faster loops


If you need to run a function over all the values in a vector, you could pass a
whole vector through the R function to the compiled code, or you could call
the R function repeatedly for each value. If you do the latter, R has to do the
figuring out stuff, as well as the translation, each time. But if you call it once,
with a vector, the figuring out part happens just once.

When to use vectorization:


With just a few function calls that are actually calling compiled code, itll
be more efficient than if you write long program, with the added
overhead of many function calls.
BLAS is generally designed to be highly efficient and has things like
built-in parallel processing, hardware-specific implementation, and a
host of other tricks. So if your calculations can be expressed in actual
linear algebra terms, such as matrix multiplication

Apply functions:
Another thing that ply functions help with is avoiding what are known
as side effects. When you run aply function, everything happens inside that
function, and nothing changes in your working environment (this is known as
functional programming). In a for loop, on the other hand, when you do
something like for(i in 1:10), you get the leftover i in your environment.
This is considered bad practice sometimes.