Professional Documents
Culture Documents
LECTURE BY
DR.A.SHANTHINI
ASSOCIATE PROFESSOR
DEPT OF DSBS
SRM IST
Introduction to R
The annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of
a comma-separated-value (CSV) file. The read.csv() function is used to import the CSV file.
This dataset is stored to the R variable sales using the assignment operator <-
# import a CSV file of the total annual sales for each customer (reading yearly_sales dataset
to sales
sales <- read.csv(“c:/data/yearly_sales.csv”)
# examine the imported dataset (head -1st 6 rows of the dataset, tail –last 6 rows of data set,
summary )
head(sales)
tail(sales)
summary(sales)
# plot num_of_orders vs. sales (plot of 2 attributes from the dataset, histogram of )
plot(sales$num_of_orders,sales$sales_total, main=“Number of Orders vs. Sales”)
hist(sales$sales_total)
R Graphical User Interfaces
+
+*
Histogram plot Scatter plot
Data Import and Export
► jpeg() function, the following R code creates a new JPEG file, adds a histogram
plot to the file, and then closes the file. png(), bmp(), pdf(), and postscript() formats
can also be used to save the image
>jpeg(file="c:/Users/SHANTHINI A/Documents/sales_hist.jpeg")
> hist(sales$num_of_orders)
> dev.off()
► Vectors
► Factors
► Array and matrices
► List
► Data frames
Vectors:
Creating Vectors
The c() function can be used to create vectors of objects by concatenating things together.
>x
[1] 0 0 0 0 0 0 0 0 0 0
Factors:
Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer
vector where each integer has a label. Factors are important in statistical modeling and are treated specially by modelling
functions like lm() and glm().
>x
Levels: no yes
> table(x)
no yes
23
Arrays and Matrix
Arrays
A = array(1:10)
>A1= array(1:8,c(2,4))
>A1
[1,] 1 3 5 7
[2,] 2 4 6 8
is.array(A1)
as.array(A1)
Matrix
Matrices are vectors with a dimension attribute. The
dimension attribute is itself an integer vector of length 2 cbind() and rbind()
(number of rows, number of columns)
> m <- matrix(nrow = 2, ncol = 3)
>m Matrices can be created by column-binding
[,1] [,2] [,3] or row-binding with the cbind() and rbind()
[1,] NA NA NA functions.
[2,] NA NA NA
> dim(m) > x <- 1:3
[1] 2 3 > y <- 10:12
> attributes(m) > cbind(x, y)
$dim
[1] 2 3 xy
► Matrices are constructed column-wise, so entries can be [1,] 1 10
thought of starting in the “upper left” corner and running [2,] 2 11
down the columns.
> m <- matrix(1:6, nrow = 2, ncol = 3) [3,] 3 12
>m > rbind(x, y)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
x123
y 10 11 12
List
Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them
well. Lists, in combination with the various “apply” functions discussed later, make for a powerful combination.
Lists can be explicitly created using the list() function, which takes an arbitrary number of
arguments.
We can also create an empty list of a prespecified length with the vector() function
> x <- vector("list", length = 5)
>x
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Data frames
Data frames are used to store tabular data in R. They are an important type of object in R and are used in a
variety of statistical modeling applications.
Binding in data frames:
> df1 = data.frame(name = c("Rahul","joe","Adam","Brendon"), married_year = c(2016,2015,2016,2008))
> df1
name married_year
1 Rahul 2016
2 joe 2015
3 Adam 2016
4 Brendon 2008
> df2
Birth_place Birth_year
1 Delhi 1988
2 Seattle 1990
3 London 1989
4 Moscow 1984
> cbinded_df = cbind(df1,df2)
> cbinded_df
name married_year Birth_place Birth_year
1 Rahul 2016 Delhi 1988
2 joe 2015 Seattle 1990
3 Adam 2016 London 1989
4 Brendon 2008 Moscow 1984
► Null Hypothesis – Hypothesis testing is carried out in order to test the validity of
a claim or assumption that is made about the larger population.
► This claim that involves attributes to the trial is known as the Null Hypothesis.
► The null hypothesis testing is denoted by H0.
► Alternative Hypothesis – An alternative hypothesis would be considered valid if
the null hypothesis is false.
► The evidence that is present in the trial is basically the data and the statistical
computations that accompany it.
► The alternative hypothesis testing is denoted by H1or Ha
Example - Null hypothesis and Alternate
hypothesis
Hypothesis Testing in R
► Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence
or in other words what the data are about the population.
► The p-value ranges between 0 and 1.
It can be interpreted in the following way:
► A small p-value (typically ≤ 0.05) indicates strong evidence against the null
hypothesis, so you reject it.
► A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so
you fail to reject it.
► A p-value very close to the cutoff (0.05) is considered to be marginal and could go
either way.
Difference of mean:
Output?
Two Sample t-test
data: x and y
t = -1.7828, df = 28, p-value = 0.08547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.1611557 0.4271893
sample estimates:
mean of x mean of y
102.2136 105.0806
The commands used in the Student’s t-test
► Listed below are the commands used in the Student’s t-test and their explanation:
► t.test(data.1, data.2) – The basic method of applying a t-test is to compare two
vectors of numeric data.
► var.equal = FALSE – If the var.equal instruction is set to TRUE, the variance is
considered to be equal and the standard test is carried out. If the instruction is set
to FALSE (the default), the variance is considered unequal and the Welch
two-sample test is carried out.
► mu = 0 – If a one-sample test is carried out, mu indicates the mean against which
the sample should be tested.
► alternative = “two.sided” – It sets the alternative hypothesis. The default value for this
is “two.sided” but a greater or lesser value can also be assigned. You can abbreviate the
instruction.
► conf.level = 0.95 – It sets the confidence level of the interval (default = 0.95).
► paired = FALSE – If set to TRUE, a matched pair T-test is carried out.
► t.test(y ~ x, data, subset) – The required data can be specified as a formula of the form
response ~ predictor. In this case, the data should be named and a subset of the predictor
variable can be specified.
► subset = predictor %in% c(“sample.1”, sample.2”) – If the data is in the form
response ~ predictor, the two samples to be selected from the predictor should be
specified by the subset instruction from the column of the data.
Welch’s t-test
When the equal population variance assumption is not justified in performing Student’s t test for the
difference of means, Welch’s t-test can be used.
In Welch’s test, under the remaining assumptions of random samples from two normal populations with
the same mean, the distribution of T is approximated by the t distribution.
The following R code performs the Welch’s t-test on the same set of data analyzed in the earlier Student’s
t-test example
t.test(x, y, var.equal=FALSE) # run the Welch’s t-test
Output?
► The two types of error that can occur from the hypothesis testing:
► Type I Error -is the rejection of the null hypothesis when the null hypothesis is
TRUE
► Type II Error - is the acceptance of a null hypothesis when the null hypothesis is
FALSE
Wilcoxon Ranksum Test