You are on page 1of 200

Research Methods for Commerce

HIMANSHU GOEL

2023-01-16

Table of Contents
Introduction to
R ................................................................................................................................................... 4
Basics of
R ................................................................................................................................................
........... 4 Numeric, integer, Character, and logical data
type ........................................................................ 4 Examine the characteristic of a variable
............................................................................................ 4 Simple Common
Operators ...................................................................................................................... 5
Simple
Functions ..................................................................................................................................
....... 6
Vectors ..........................................................................................................................................
........................ 9 Scalar
operations ..................................................................................................................................
.... 12 Indexing and subsetting
vectors ......................................................................................................... 13
Operations on
Vectors ............................................................................................................................ 15
Array and
Matrices .......................................................................................................................................
17
Matrix
multiplication ............................................................................................................................
.. 19
Factors ...........................................................................................................................................
.................... 22
Lists ..........................................................................................................................................
...................... 22
User Defined
Functions ................................................................................................................................... 28
Functions in
R ................................................................................................................................................
. 28 Function with no argument and no
return ..................................................................................... 28 Function with no argument
but it returns a value ...................................................................... 28 Function with
arguments but no return value .............................................................................. 29
Function with arguments with return
value .................................................................................. 29
Conditional
Statements.................................................................................................................................... 29
Types of
Statements .................................................................................................................................
.... 29 If
Statement ..................................................................................................................................
............... 29 If Else
Statement ..................................................................................................................................
..... 30
Nested If Else
Statement ........................................................................................................................ 30
Loops In
R ................................................................................................................................................
......... 31 For
Loop ..........................................................................................................................................
............. 31 While
Loop ..........................................................................................................................................
........ 31
Repeat
Loop ................................................................................................................................................
32
Data
Preparation ....................................................................................................................................
............ 35 Exploring Data in
R ....................................................................................................................................... 35
R Functions For understanding Data in Data
Frames ..................................................................... 37
Reading
data ................................................................................................................................................
.... 40 Subsetting
dataframe ...................................................................................................................................
42 Data
Summary .......................................................................................................................................
......... 44
Descriptive
Statistics ....................................................................................................................................
51
Tidyverse ..........................................................................................................................................
..................... 71 Spreading across the
tibbles ..................................................................................................................... 71 Visualise
changes over time ......................................................................................................................
74 Pivot
Wider ............................................................................................................................................
.......... 76 Separating and
uniting ................................................................................................................................ 78
Missing
Values ...........................................................................................................................................
..... 80 Quick
plots ..............................................................................................................................................
......... 85 A beutiful barplot
here .............................................................................................................................. 105 A
pleasant histogram
here ....................................................................................................................... 106
A delightful data frame
here ................................................................................................................... 107
Date....................................................................................................................................................
.................... 108 Parse Date,Time,Months,and
Year ........................................................................................................ 108 Creating Date and
Time ............................................................................................................................ 108
Time Spans and
Duration ......................................................................................................................... 109
Assignment
1 .....................................................................................................................................................
111
Q1 ..................................................................................................................................................
.................... 111
Q2 ..................................................................................................................................................
.................... 112
Q3 ..................................................................................................................................................
.................... 115
Q4 ..................................................................................................................................................
.................... 116
Q5 ..................................................................................................................................................
.................... 118
Q6 ..................................................................................................................................................
.................... 121
Q7 ..................................................................................................................................................
.................... 121
Assignment
2 .....................................................................................................................................................
122
Q1 Define a user defined function to check whether the word is a palindrome. ............... 122
Q2 Identify whether the number is divisible by
3. ......................................................................... 122 Q3 Identify all prime numbers less than a
100. ............................................................................... 122 Q4 Calculate factorial of
n. ....................................................................................................................... 123
Q5 Find g.c.d of
(x,y) ................................................................................................................................... 123
Hypothesis
Testing ..........................................................................................................................................
123 One Sample T-
Test ...................................................................................................................................... 123
Paired Sample T-
Test ................................................................................................................................. 124
Independent Sample T-
Test .................................................................................................................... 124
ANOVA ........................................................................................................................................
..................... 125 Simple
Regression ....................................................................................................................................
... 125 Step-wise
Regression .................................................................................................................................
125
Multiple
Regression ....................................................................................................................................
127
Introduction to R
Basics of R
Numeric, integer, Character, and logical data type
#Create a numeric variable
i=1.5 j=4+i ab=1.3
k=data.class(i)
data.class(k)

## [1] "character"

#create a character variable


Country="INDIA" Country

## [1] "INDIA"

View(Country)
#Create a logical variable

flag=F
Examine the characteristic of a variable
#Abstract class class(i)

## [1] "numeric"

#storage type typeof(i)

## [1] "double"

class(Country) ## [1]

"character"

typeof(Country) ## [1]

"character" class(flag)

## [1] "logical"

typeof(flag)

## [1] "logical"

p=1 class(p)
## [1] "numeric"

Simple Common Operators


#Subtraction 4-3

## [1] 1

#Multiplication
5*3

## [1] 15

#Division
4/3

## [1] 1.333333

#Exponential 4^3

## [1] 64

#Nesting Operators
(4- 3)^2

## [1] 1

(5- 3)*2

## [1] 4

## "NA" stands for Not available s=c(5,7,9,11)


s

## [1] 5 7 9 11

g=c(s,14)
g

## [1] 5 7 9 11 14

#length of a object
length(s) ## [1] 4 s[5] ##
[1] NA s[11]

## [1] NA
length(i)

## [1] 1

##Infinite Value
3^1250

## [1] Inf

-3^1250

## [1] -Inf

3/0

## [1] Inf log(0)

## [1] -Inf

#list of stored objects ls()

## [1] "ab" "Country" "flag" "g" "i" "j" "k" ## [8] "p" "s"

# Object Removal rm(r)

## Warning in rm(r): object 'r' not found

Simple Functions
#natural log
log(15) ## [1]
2.70805 exp(4)

## [1] 54.59815 sqrt(2)

## [1] 1.414214 sin(15)

## [1] 0.6502878 abs(-56)

## [1] 56
# round off the value to 3 digits
round(19.5438, 3) ## [1] 19.544
round(19.5432, 1) ## [1] 19.5

# log to the base b log(15,


2) ## [1] 3.906891
ceiling(15.45432)

## [1] 16 floor(15.85432)

## [1] 15

#testing the variable type and coercion into other type---- is.logical(i) ## [1]
FALSE is.numeric(i) ## [1] TRUE

j=3+4i data.class(j)
## [1] "complex"
is.complex(j)

## [1] TRUE

#created as numeric variable not integer is.integer(i)


## [1] FALSE a=as.integer(Country)

## Warning: NAs introduced by coercion

ac=as.character(i)
data.class(ac) ## [1]
"character" is.integer(a)
## [1] TRUE

j=1.5 is.integer(j)

## [1] FALSE

#converting numeric variable into integer j=as.integer(j);j

## [1] 1 is.integer(j)

## [1] TRUE

View(j)

#length of variable
#creating a vector
i=c(2,3,4,5)
k=c(2,"india") i=35
#concatenate any type of vector and automatically convert to one class g=c(i,k)
f=c(7,8,9,10, 14, 12, 13, 14)
# Operations and functions on variables #for Addition length
of vectors has to be same i+f

## [1] 42 43 44 45 49 47 48 49 i-f

## [1] 28 27 26 25 21 23 22 21 max(i)

## [1] 35 min(i) ## [1] 35 prod(i) ## [1]

35 prod(f)

## [1] 154103040 prod(i,f)


## [1] 5393606400

mean(i) ## [1] 35

var(i) ## [1] NA sort(i)

## [1] 35 sort.list(i) ##

[1] 1 length(i) ## [1] 1

length(Country) ## [1]

1 length(flag) ## [1] 1

s=2+3i is.complex(s)
## [1] TRUE
class(s)

## [1] "complex" sqrt(s)

## [1] 1.674149+0.895977i

##Nan is not a number sqrt(-17)

## Warning in sqrt(-17): NaNs produced

## [1] NaN

Vectors
#Basic building block for data in R
#Simple R variables are actually vectors
#A vector can take values from the same class---- e=c(2,3,4,5)
I=c(2L,3L,4L,5L)
h=c("2","3", "4","5")
y=e==I is.vector(i) ## [1]
TRUE is.vector(Country) ##
[1] TRUE is.vector(flag)

## [1] TRUE

#Creation and manipulation of vectors


#using combine function c() or the colon operator':' v=c("Cricket","Badminton","Football")
v

## [1] "Cricket" "Badminton" "Football"

View(v) v[1]

## [1] "Cricket" v[2]

## [1] "Badminton" v[8]

## [1] NA

length(v) ##

[1] 3 class(v)

## [1] "character"

vn=c(2,3,4,5)
class(vn) ## [1]
"numeric" length(vn)

## [1] 4

#Generating a sequence v1=1:5;v1


## [1] 1 2 3 4 5

View(v1)
v1[3] ##
[1] 3
sum(v1)

## [1] 15
### Seq()function ### The sequence function is more common facility for generating a sequence.
# Sequence function can take four argument (from=value),(to=value),(by=value) and (length=value)
vs=seq(1,20, by=0.5);vs
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
## [31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 seq

## function (...)
## UseMethod("seq")
## <bytecode: 0x000000003b091538> ##
<environment: namespace:base>

vs1=seq(2,3, by=0.2) vs2=seq(2, by=0.2,


length=8) vs1;vs2

## [1] 2.0 2.2 2.4 2.6 2.8 3.0

## [1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4

##Rep function##---- # vn is
repeated thrice vr=rep(v1,
times=3) # element wise
repetition vr1= rep(v1, each=3)
vr;vr1

## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

##Another way of defining vector---- v5=vector(mode="numeric", length=4);v5

## [1] 0 0 0 0
v5[3]=1.4;v5

## [1] 0.0 0.0 1.4 0.0 v5[1]=2.3;v5

## [1] 2.3 0.0 1.4 0.0 v5[2]=4.7;v5

## [1] 2.3 4.7 1.4 0.0 v5[4]=5;v5

## [1] 2.3 4.7 1.4 5.0 v6=vector(mode="integer", length=3);v6

## [1] 0 0 0

v6[1]=2L v6[2]=3
v6[3]="India"
v7=vector(mode="character", length=4)
v7

## [1] "" "" "" "" v7[1]=0;v7

## [1] "0" "" "" "" length(v6)

## [1] 3

#vectors seem to be one dimensional arrays,


#but they are dimensionless
length(v5) ## [1] 4 dim(v5) ##
NULL

Scalar operations
#Multiplication with a constant; addition and division
v2=v1*2;v2;length(v2) ## [1] 2 4 6 8 10

## [1] 5 va=v1+2;va
## [1] 3 4 5 6 7

print(va) ## [1] 3 4 5

6 7 vb=v1/2;vb

## [1] 0.5 1.0 1.5 2.0 2.5 vs=v1-2;vs

## [1] -1 0 1 2 3 v1

## [1] 1 2 3 4 5

#Addition/subtraction of two vectors---- v1=c(9, 10,


11, 12);v1 ## [1] 9 10 11 12 v2=c(8, 11, 12, 13);v2

## [1] 8 11 12 13 v3=v1+v2;v3 ## [1] 17 21 23 25

k=c(2,3,4,5,6,7, 10,11) length(k)

## [1] 8
# if length of vectors differ than addition, sub, multiplication and division are not possible a=v1+k
a

## [1] 11 13 15 17 15 17 21 23

Indexing and subsetting vectors


v1=1:10 v1

## [1] 1 2 3 4 5 6 7 8 9 10

#individual element of the vector can be accessed v1[3]

## [1] 3
#output will give all items of the vector except third v1[-3]

## [1] 1 2 4 5 6 7 8 9 10

#output will give all items of the vector except third and fourth d=v1[-3:-4]; d

## [1] 1 2 5 6 7 8 9 10

View(d)
v

## [1] "Cricket" "Badminton" "Football" h=c(v[-3], v[-1]);h

## [1] "Cricket" "Badminton" "Badminton" "Football"

h=v1[-3] h1=h[-4];h1

## [1] 1 2 4 6 7 8 9 10 v3

## [1] 17 21 23 25

#identify which values are in v3 are greater than 8 v3>8

## [1] TRUE TRUE TRUE TRUE

#display values greater than 18 v3[v3<18]


## [1] 17

#extract elements greater than 22 subset(v3,


v3>=22) ## [1] 23 25 v3[v3>=22]

## [1] 23 25

#extract random sample sample(v3, 2)

## [1] 17 25

#different results each time you execute sample(v3,2)


## [1] 17 25 sample(v3, 3,

replace=F)

## [1] 17 23 25

# greater than 20 or less than 18 v3[v3>20|v3<18]

## [1] 17 21 23 25

# greater than 15 and less than 18 v3[v3>15 & v3<18]

## [1] 17

Operations on Vectors
#Multiplication of two vectors---- v4=v2*v3;v2;v3

## [1] 8 11 12 13 ## [1] 17

21 23 25

r=s*k v4;s;k

## [1] 136 231 276 325

## [1] 2+3i

## [1] 2 3 4 5 6 7 10 11 r

## [1] 4+ 6i 6+ 9i 8+12i 10+15i 12+18i 14+21i 20+30i 22+33i

#Division of two vectors---- v5=v3/v2;v5

## [1] 2.125000 1.909091 1.916667 1.923077 v5

## [1] 2.125000 1.909091 1.916667 1.923077

#quotient of division
v6=v3%/%v2 v6

## [1] 2 1 1 1

#Reminder of division
v7=v3%%v2 v7
## [1] 1 10 11 12

#exponent in vectors
v8=v5^v2 v8

## [1] 415.7875 1227.7013 2457.8710 4919.9029 class(v8)

## [1] "numeric" typeof(v8) ## [1] "double" as.integer(v8)

## [1] 415 1227 2457 4919

#functional transformations of vectors---- v1=c(2,3,4,5)


v9=log(v1);v9

## [1] 0.6931472 1.0986123 1.3862944 1.6094379

v10=sin(v1) v10

## [1] 0.9092974 0.1411200 -0.7568025 -0.9589243 v11=sqrt(v1);v11

## [1] 1.414214 1.732051 2.000000 2.236068

###LOGICAL VECTORS###----
x=c(1,2,3) y=c(5,6,3) x==y

## [1] FALSE FALSE TRUE x>=y

## [1] FALSE FALSE TRUE y>=x

## [1] TRUE TRUE TRUE

###CHARACTER VECTOR###----
##Paste Function##
#it Obtains a random number of arguments and
#concatenated them one by one
#into the character strings vr

## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

p=paste(vr,1:5, sep="")
p
## [1] "11" "22" "33" "44" "55" "11" "22" "33" "44" "55" "11" "22" "33" "44"
"55"

Array and Matrices


#array()function can be used to restructure a vector in to an array #4 states, 4 quarters and 3
years
A=array(0, dim= c(4,3,3));A

## , , 1
##
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0

A[2,2,2]=1000;
A[1,2,3]=5000;
A[1,1,1]=800;
A

## , , 1
##
## [,1] [,2] [,3]
## [1,] 800 0 0
## [4,] 0 0 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 1000 0
## [3,] 0 0 0
## [4,] 0 0 0
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 0 5000 0
## [2,] 0 0 0
## [3,] 0 0 0 ## [4,] 0 0
0 dim(A)

## [1] 4 3 3

#A 2D array is matrix----
#matrix()function
M=matrix(0, nrow=3, ncol=3)
M[1,1]=1
M[1,2]=2
M[1,3]=5;M

## [,1] [,2] [,3]


## [1,] 1 2 5
## [2,] 0 0 0
## [3,] 0 0 0

M[2,]=c(11,13,0)
M[3,]=c(5,4,3)
M

## [,1] [,2] [,3]


## [1,] 1 2 5
## [2,] 11 13 0
## [3,] 5 4 3

#By default it will take values in column


M1=matrix(c(1,2,2,6,0,5,4,4,8), nrow=3, ncol=3);M1

## [,1] [,2] [,3]


## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8
## [,1] [,2] [,3]
## [1,] 1 2 2
## [2,] 6 0 5
## [3,] 4 4 8

Mk=matrix(c(1,2,2,6,0,5,4,4,8), nrow=3, ncol=3);Mk

## [,1] [,2] [,3]


## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8

Ml=matrix(c(1,2,3,6), nrow=2, ncol=2, byrow=T);Ml

## [,1] [,2]
## [1,] 1 2
## [2,] 3 6

##Transpose of a Matrix----
tM1=t(M1) tM1

## [,1] [,2] [,3]


## [1,] 1 2 2
## [2,] 6 0 5
## [3,] 4 4 8
#c bind and r bind functions attach vectors together column wise and row wise respectively.----
Mv=cbind(P=1:2, Q=5:8, R=9:12);Mv

## PQ R
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 1 7 11
## [4,] 2 8 12

Mv1=rbind(P=1:4, Q=5:8, R=9:12);Mv1

## [,1] [,2] [,3] [,4]


## P 1 2 3 4
## Q 5 6 7 8
## R 9 10 11 12

Matrix multiplication
library(matrixcalc)

#Element wise multiplication mat=M1*M2;M1;M2;mat

## [,1] [,2] [,3]


## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8

## [,1] [,2] [,3]


## [1,] 1 2 2
## [2,] 6 0 5
## [3,] 4 4 8

## [,1] [,2] [,3]


## [1,] 1 12 8
## [2,] 12 0 20
## [3,] 8 20 64

# %*% is used in R to multiply matrices


M3=M1%*%M2;M3

## [,1] [,2] [,3]


## [1,] 53 18 64
## [2,] 18 20 36
## [3,] 64 36 93

View(M3)
M1%*%M1

## [,1] [,2] [,3]


## [1,] 21 26 60
## [2,] 10 32 40
## [3,] 28 52 92

#inverse matrix function . matrix inverse()---- matrix.inverse(M1)

## [,1] [,2] [,3]


## [1,] 0.7142857 1.00 -0.8571429
## [2,] 0.2857143 0.00 -0.1428571
## [3,] -0.3571429 -0.25 0.4285714

#Power of matrix
M4=matrix.power(M1,5);M4

## [,1] [,2] [,3]


## [1,] 27017 53386 90076
## [2,] 19498 37700 64536
## [3,] 42484 83224 141232

#Transpose of a matrix. t()---- t(M1);

## [,1] [,2] [,3]


## [1,] 1 2 2
## [2,] 6 0 5
## [3,] 4 4 8
M1

## [,1] [,2] [,3]


## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8

M1[2,]

## [1] 2 0 4

M1[,3]

## [1] 4 4 8

M1[1,3]

## [1] 4

M1[2,2]

## [1] 0

M1[3] ##

[1] 2

a1=seq(1:4) a2=rep(1:2,2)
a3=c(3,4,5,6)
MAT1=matrix(cbind(a1,a2,a3),nrow=3, ncol=4, byrow=TRUE);MAT1

## [,1] [,2] [,3] [,4]


## [1,] 1 2 3 4
## [2,] 1 2 1 2
## [3,] 3 4 5 6

Mat2=matrix(cbind(a1,a2,a3),nrow=3, ncol=4);Mat2

## [,1] [,2] [,3] [,4]


## [1,] 1 4 1 4
## [2,] 2 1 2 5
## [3,] 3 2 3 6

# variable in R script ls()

## [1] "a" "A" "a1" "a2" "a3" "ab" "ac"


## [8] "Country" "d" "e" "f" "flag" "g" "h"
## [15] "h1" "i" "I" "j" "k" "M" "M1"
## [22] "M2" "M3" "M4" "mat" "MAT1" "Mat2" "Mk"
## [29] "Ml" "Mv" "Mv1" "p" "r" "s" "tM1"
## [36] "v" "v1" "v10" "v11" "v2" "v3" "v4"
## [43] "v5" "v6" "v7" "v8" "v9" "va" "vb"
## [50] "vn" "vr" "vr1" "vs" "vs1" "vs2" "x" ## [57] "y"

ChMat=matrix(c("a","b","c","d"), nrow=2, ncol=2) ChMat

## [,1] [,2]
## [1,] "a" "c"
## [2,] "b" "d"

Factors
# A factor is basically the division of large values into smaller values. # In statistical data, different
categorical variables are used to specify some subdivision of data

U=c(0,4,1,1,2) f=factor(U,
levels=0:3)
levels(f)=c("none", "more","medium","large")
f

## [1] none <NA> more more medium


## Levels: none more medium large

#Converting factor into numeric for computation as.numeric(f)

## [1] 1 NA 2 2 3

Gender=c(1,1,1,1,2,2,2,1,2,3,2,2,2,2) f1=factor(Gender, levels=1:3)


levels(f1)=c("Male", "Female", "prefer not to say")
f1
## [1] Male Male Male Male
## [5] Female Female Female Male
## [9] Female prefer not to say Female Female
## [13] Female Female
## Levels: Male Female prefer not to say

Lists
# A collection of objects of various types including other lists #list()function
#Double bracket notation[[]]
i=c(5L,4L) l=1.5 k=as.integer(l)
typeof(i) ## [1] "integer" class(i)
## [1] "integer"

j=4L class(j)

## [1] "integer"

M=matrix(0, nrow=3, ncol=3) class(M)

## [1] "matrix" "array"

v=c("Cricket","Badminton","Football") class(v)

## [1] "character"

L=list("Delhi",i,j,v,M);L

## [[1]]
## [1] "Delhi"
##
## [[2]]
## [1] 5 4
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] "Cricket" "Badminton" "Football"
##
## [[5]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0 0
0 l1=list(M1, M);l1

## [[1]]
## [,1] [,2] [,3]
## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0 0
0 l2=list(L,l1);l2
## [[1]]
## [[1]][[1]]
## [1] "Delhi"
##
## [[1]][[2]]
## [1] 5 4
##
## [[1]][[3]]
## [1] 4
##
## [[1]][[4]]
## [1] "Cricket" "Badminton" "Football"
##
## [[1]][[5]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
##
##
## [[2]]
## [[2]][[1]]
## [,1] [,2] [,3]
## [1,] 1 6 4
## [2,] 2 0 4
## [3,] 2 5 8
##
## [[2]][[2]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0
0 0 class(L) ## [1] "list"
length(L)

## [1] 5

L[5]

## [[1]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0

L[[5]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0

0 0 class(L[5]) ## [1] "list"

length(L[5]) ## [1] 1

class(L[[4]]) ## [1]

"character" length(L[[4]]) ##

[1] 3 str(L)

## List of 5
## $ : chr "Delhi"
## $ : int [1:2] 5 4
## $ : int 4
## $ : chr [1:3] "Cricket" "Badminton" "Football"
## $ : num [1:3, 1:3] 0 0 0 0 0 0 0 0 0

#Append the list----


L1=append(L, list(n=c(2,3,4,5)), after=3) L1

## [[1]]
## [1] "Delhi"
##
## [[2]]
## [1] 5 4
##
## [[3]]
## [1] 4
##
## $n
## [1] 2 3 4 5
##
## [[5]]
## [1] "Cricket" "Badminton" "Football"
##
## [[6]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0 0
0 str(L1)

## List of 6
## $ : chr "Delhi"
## $ : int [1:2] 5 4
## $ : int 4
## $ n: num [1:4] 2 3 4 5
## $ : chr [1:3] "Cricket" "Badminton" "Football"
## $ : num [1:3, 1:3] 0 0 0 0 0 0 0 0 0

n1=list(c("delhi", "mumbai"), 1L, c(11,12,13,14)) str(n1)

## List of 3
## $ : chr [1:2] "delhi" "mumbai"
## $ : int 1
## $ : num [1:4] 11 12 13 14

L2=append(L,n1, after=4) str(L2)

## List of 8
## $ : chr "Delhi"
## $ : int [1:2] 5 4
## $ : int 4
## $ : chr [1:3] "Cricket" "Badminton" "Football"
## $ : chr [1:2] "delhi" "mumbai"
## $ : int 1
## $ : num [1:4] 11 12 13 14
## $ : num [1:3, 1:3] 0 0 0 0 0 0 0 0 0

a1=c(2,7,3,8) a2=1:8
l1=list(a1,a2)
View(l1)
#Random sample of 2 elements from the list sample(L1,2)

## [[1]]
## [1] "Cricket" "Badminton" "Football"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0 ## [3,] 0 0 0
sample(L1, 3, replace = FALSE)
## $n
## [1] 2 3 4 5
##
## [[2]]
## [1] 5 4
##
## [[3]] ## [1] 4 sample(L1, 3,
replace = T)

## [[1]]
## [1] "Delhi"
##
## [[2]]
## [1] "Cricket" "Badminton" "Football"
##
## [[3]]
## [1] 5 4

#convert list in to vector


LU=unlist(L1)
LU

## n1 n2
## "Delhi" "5" "4" "4" "2" "3"
## n3 n4
## "4" "5" "Cricket" "Badminton" "Football" "0"
## ## "0" "0" "0" "0" "0"
"0" ## ## "0" "0"

# Mathematical operations can not be performed on a list. #Because list


contain a combination of numeric, #string and logical operators.
#However apply function allows you to do
#mathematical operations on numeric elements of a list.

l1

## [[1]]
## [1] 2 7 3 8
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 lapply(l1,

function(x)sqrt(x))

## [[1]]
## [1] 1.414214 2.645751 1.732051 2.828427 ##
## [[2]]
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
2.828427 lapply(l1, function(x) sum(x))

## [[1]]
## [1] 20
##
## [[2]] ##
[1] 36 l1[1]

## [[1]]
## [1] 2 7 3 8

User Defined Functions


Functions in R
Function with no argument and no return
#A user-defined function is designed according to the needs of a user
SF=function(){
a=10 b=10
c=a^b*(a+b-500)
print("value of function") print(c)
}
SF()

## [1] "value of function"


## [1] -4.8e+12

Function with no argument but it returns a value


SumF1=function(){
a=10 b=10 c=a+b
print("Addition of two numbers") return(c)
}
SumF1()

## [1] "Addition of two numbers"

## [1] 20

Function with arguments but no return value


SumF3=function(a,b){
a=10 b=10 c=a+b
print("Addition of two numbers") print(c)
}
SumF3()

## [1] "Addition of two numbers"


## [1] 20

SumF3(1900, 300)

## [1] "Addition of two numbers" ## [1] 20

#structure: If (condition){Statements} m=6


m=as.integer(m)
if(m>0){"The entered number is positive"}

Function with arguments with return value


SumF4=function(a,b){ c=a+b
print("Addition of two numbers") return(c)
}
SumF4(100,200)

## [1] "Addition of two numbers"

## [1] 300

MF=function(a,b,c){ d=a*b*c
print("Multiplication of three variables") print(d)
}
MF(1,2,3)

## [1] "Multiplication of three variables" ## [1] 6

Conditional Statements
Types of Statements
If Statement
## [1] "The entered number is positive"
If Else Statement
#Structure: If (condition){Statements}else {Statements}
#Example 1 m=-6
m=as.integer(m)
if(m>0){"The entered number is positive"}else {"The entered
number is not positive"}
## [1] "The entered number is not positive"

#Example 2
n=8 n=as.integer(n) if((n%%2)==0){ print("The
entered number is even.") }else { print("The
entered number is odd.") }

## [1] "The entered number is even."

Nested If Else Statement


#Structure:if(condition){Statements}
#else if(condition){statements}else{statements}

#Example 1
#Find the greatest number among three numbers a=5
a=as.integer(a) b=10
b=as.integer(b) c=15
c=as.integer(c) if
((a>b)&(a>c))
{ print("First number is the greatest")
}else if (b>c)
{
print("Second number is the greatest") } else {
print("Third number is the greatest") }

## [1] "Third number is the greatest"

#Example 2 PMarks=85

PMarks=as.integer(PMarks) if
((PMarks<50))
{
print("student has failed in annual examination.")
} else if ((50<= PMarks) &(PMarks<60))
{
print("Student passed with second division.") } else
{
print ("Student passed with first division.")
}

## [1] "Student passed with first division."


Loops In R
For Loop
#Structure: for(name in expression){Statements}
#Example 1 for(i in 1:5)
{print(1:i)}

## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5

#Example 2
for(n in c(2,5,10,20,50)) { print(2^n)
}

## [1] 4
## [1] 32
## [1] 1024
## [1] 1048576
## [1] 1.1259e+15
While Loop
#structure: Loop variable initialization
# while(condition){Statements
#loop variables increment/ decrements}

#Example 1
i=1 #initialization of variable i while (i<10)
{ print(4+i)
i=i+1
}
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13

#Example 2 #Fibonacci Series


print("Fibonacci Series") ## [1]
"Fibonacci Series"

a=-1 b=1
Fibbo=function(n)
{
while(n>0)
{ s=a+b
print(s) a=b
b=s n=n-1
}
} n=10
n=as.integer(n)
Fibbo(n)

## [1] 0
## [1] 1
## [1] 1
## [1] 2
## [1] 3
## [1] 5
## [1] 8
## [1] 13
## [1] 21
## [1] 34

Repeat Loop

Repeat Statement
#Structure: Loop variable initialization
#Example 1
print("Repeat Statement")
## [1] "Repeat Statement"

i=1 repeat
{print(paste("i=", i)) i=i+1
if(i==10)
{
#break statement
break }}

## [1] "i= 1"


## [1] "i= 2"
## [1] "i= 3"
## [1] "i= 4"
## [1] "i= 5"
## [1] "i= 6"
## [1] "i= 7"
## [1] "i= 8"
## [1] "i= 9"

Break and Next Statement


#Example 2
#Use of break and next statement
print("Break statement") ## [1] "Break
statement"

i=1
while(i<10)
{print(paste("i=", i)) i=i+1
if(i==5) #break statement
break
}

## [1] "i= 1"


## [1] "i= 2"
## [1] "i= 3" ## [1] "i= 4" print("The loop terminated when i
reaches 5") ## [1] "The loop terminated when i reaches 5"
print("--------------------------------------") ## [1]
"--------------------------------------" print("Next Statement")
## [1] "Next Statement" print("The loop continue execute after j

reaches 5") ## [1] "The loop continue execute after j reaches 5"

for(j in 1:10)
{
print (paste("j=", j)) if(j==5)
#next statement
next
}

## [1] "j= 1"


## [1] "j= 2"
## [1] "j= 3"
## [1] "j= 4"
## [1] "j= 5"
## [1] "j= 6"
## [1] "j= 7"
## [1] "j= 8"
## [1] "j= 9"
## [1] "j= 10"

Number in a list
#Example 3
#Searching number in a list n=25
n=as.integer(n) for (i in 1:50)
{
if (i==n) {d=1
print("Search is successful") break }
else
{d=0
}
}

## [1] "Search is successful"

if (d==0)
{print("Number is not in the list")

}
Data Preparation
Exploring Data in R
#Creating Data Frame
EmpNo= c(1000, 1001, 1002, 1003, 1004)
EmpName=c("Jack", "Jane", "Margaritta", "Joe", "Dave") ProjName= c("PO1",
"PO2", "PO3", "PO4", "PO5")

Employee=data.frame(EmpNo, EmpName, ProjName)


View(Employee)
#Data Frame Access
#By Providing the Index Number in Square Brackets
Employee[2]

## EmpName
## 1 Jack
## 2 Jane
## 3 Margaritta
## 4 Joe
## 5 Dave

Employee[1:2]

## EmpNo EmpName
## 1 1000 Jack
## 2 1001 Jane
## 3 1002 Margaritta
## 4 1003 Joe
## 5 1004 Dave

Employee [3,]

## EmpNo EmpName ProjName


## 3 1002 Margaritta PO3

Employee[3]

## ProjName
## 1 PO1
## 2 PO2
## 3 PO3
## 4 PO4
## 5 PO5

Employee[,3]

## [1] "PO1" "PO2" "PO3" "PO4" "PO5"

# New row
"Employee 4", "Employee 5") row.names (Employee)

## [1] "Employee 1" "Employee 2" "Employee 3" "Employee 4" "Employee 5"

##Row identification in data frame


Employee

## EmpNo EmpName ProjName


## Employee 1 1000 Jack PO1
## Employee 2 1001 Jane PO2
## Employee 3 1002 Margaritta PO3
## Employee 4 1003 Joe PO4
## Employee 5 1004 Dave PO5

Employee ["Employee 1",]

## EmpNo EmpName ProjName


## Employee 1 1000 Jack PO1

Employee [c ("Employee 3", "Employee 5"),]

## EmpNo EmpName ProjName


## Employee 3 1002 Margaritta PO3
## Employee 5 1004 Dave PO5

#Column identification in data frame


Employee[["EmpName"]]

## [1] "Jack" "Jane" "Margaritta" "Joe" "Dave"

Employee$EmpName

## [1] "Jack" "Jane" "Margaritta" "Joe" "Dave"

Employee[c("EmpNo", "ProjName")]

## EmpNo ProjName
## Employee 1 1000 PO1
## Employee 2 1001 PO2
## Employee 3 1002 PO3
## Employee 4 1003 PO4
## Employee 5 1004 PO5

#new column
Employee$EmpExpYears =c(5, 9, 6, 12, 7)
Employee

## EmpNo EmpName ProjName EmpExpYears


## Employee 1 1000 Jack PO1 5
## Employee 2 1001 Jane PO2 9
## Employee 3 1002 Margaritta PO3 6
## Employee 4 1003 Joe PO4 12
## Employee 5 1004 Dave PO5 7

#Ordering the Data Frames----


#Ascending order
Employee[order(Employee$EmpExpYears),]

## EmpNo EmpName ProjName EmpExpYears


## Employee 1 1000 Jack PO1 5
## Employee 3 1002 Margaritta PO3 6
## Employee 5 1004 Dave PO5 7
## Employee 2 1001 Jane PO2 9
## Employee 4 1003 Joe PO4 12

#Descending order
Employee[order(-Employee$EmpExpYears),]

## EmpNo EmpName ProjName EmpExpYears


## Employee 4 1003 Joe PO4 12
## Employee 2 1001 Jane PO2 9
## Employee 5 1004 Dave PO5 7
## Employee 3 1002 Margaritta PO3 6
## Employee 1 1000 Jack PO1 5
R #1. Dimension dim(Employee)

## [1] 5 4

#2.nrow() function returns the number of rows nrow(Employee)

## [1] 5

#3.ncol() Function returns the number of columns ncol(Employee)

## [1] 4

#4.str()function compactly displays the internal structure of R objects str (Employee)

## 'data.frame': 5 obs. of 4 variables:


## $ EmpNo : num 1000 1001 1002 1003 1004
## $ EmpName : chr "Jack" "Jane" "Margaritta" "Joe" ...
## $ ProjName : chr "PO1" "PO2" "PO3" "PO4" ...
## $ EmpExpYears: num 5 9 6 12 7

#5.summary() function to return result summaries for each column summary (Employee)

Functions For understanding Data in Data Frames


## EmpNo EmpName ProjName EmpExpYears
## Min. :1000 Length:5 Length:5 Min. : 5.0
## 1st Qu.:1001 Class :character Class :character 1st Qu.: 6.0
## Median :1002 Mode :character Mode :character Median : 7.0
## Mean :1002 Mean : 7.8
## 3rd Qu.:1003 3rd Qu.: 9.0
## Max. :1004 Max. :12.0

getmode <- function(v) { uniqv <- unique(v)


uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(Employee$EmpExpYears)

## [1] 5

#6. names()function returns the names of the objects. names (Employee)

## [1] "EmpNo" "EmpName" "ProjName" "EmpExpYears"

#7. head()function is used to obtain the first n observations #where n is set as 6 by


default. head(Employee)

## EmpNo EmpName ProjName EmpExpYears


## Employee 1 1000 Jack PO1 5
## Employee 2 1001 Jane PO2 9
## Employee 3 1002 Margaritta PO3 6
## Employee 4 1003 Joe PO4 12 ## Employee 5 1004
Dave PO5 7

data()
View(JohnsonJohnson)
View(mtcars) head(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6
225 105 2.76 3.460 20.22 1 0 3 1 tail(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb


## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 ## Volvo 142E 21.4 4
121.0 109 4.11 2.780 18.6 1 1 4 2

#Structure of data set str(mtcars)

## 'data.frame': 32 obs. of 11 variables:


## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4
1 1 2 1 4 2 2 4 ...

head(mtcars, n=8)

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

#8.Tail()function is used to obtain the last n observations #where n is set as 6 by


default. tail(Employee)

## EmpNo EmpName ProjName EmpExpYears


## Employee 1 1000 Jack PO1 5
## Employee 2 1001 Jane PO2 9
## Employee 3 1002 Margaritta PO3 6
## Employee 4 1003 Joe PO4 12 ## Employee 5 1004
Dave PO5 7 tail(Employee, n=2)

## EmpNo EmpName ProjName EmpExpYears


## Employee 4 1003 Joe PO4 12
## Employee 5 1004 Dave PO5 7

#9.Edit() function will invoke the text editor on the R object.


#Dynamic editing of data set edit(Employee)
## EmpNo EmpName ProjName EmpExpYears
## Employee 1 1000 Jack PO1 5
## Employee 2 1001 Jane PO2 9
## Employee 3 1002 Margaritta PO3 6
## Employee 4 1003 Joe PO4 12
## Employee 5 1004 Dave PO5 7

#Save changes in data set


fix(Employee) View(Employee)

Reading data library(tidyverse)


## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 1.0.0
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.5.0
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter() ## x dplyr::lag()
masks stats::lag()

library(datasets) data()
library(readxl) data()
table1

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272 ## 6 China
2000 213766 1280428583 table2

## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766 ## 12 China 2000
population 1280428583

df=read_excel(file.choose()) getwd()

## [1] "D:/Om DUbey"

df1=read_excel("D:/Om DUbey/SedanCar.xlsx") getwd()

## [1] "D:/Om DUbey"

setwd("D:/Om DUbey") getwd()

## [1] "D:/Om DUbey"

df3=read_excel("SedanCar.xlsx")

mtcars

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## Volvo 142E 21.4 4
121.0 109 4.11 2.780 18.60 1 1 4 2

carsdataframe=mtcars View(carsdataframe)

Subsetting dataframe
#Subsetting data frame submtcars=subset(carsdataframe, hp>=100)
View(submtcars)

#To subset the data frame and display only the category to which the items belong
subset(carsdataframe, hp>=100, select = c(carb,gear))

## carb gear
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Hornet 4 Drive 1 3
## Hornet Sportabout 2 3
## Valiant 1 3
## Duster 360 4 3
## Merc 280 4 4
## Merc 280C 4 4
## Merc 450SE 3 3
## Merc 450SL 3 3
## Merc 450SLC 3 3
## Cadillac Fleetwood 4 3
## Lincoln Continental 4 3
## Chrysler Imperial 4 3
## Dodge Challenger 2 3
## AMC Javelin 2 3
## Camaro Z28 4 3
## Pontiac Firebird 2 3
## Lotus Europa 2 5
## Ford Pantera L 4 5
## Ferrari Dino 6 5
## Maserati Bora 8 5 ## Volvo 142E
2 4 subset

## function (x, ...)


## UseMethod("subset")
## <bytecode: 0x000000001481ffe0>
## <environment: namespace:base>
#To subset the data frame and display only the items where the category is given
subset(carsdataframe, gear == 5 | gear == 3)

## mpg cyl disp hp drat wt qsec vs am gear carb


## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## Maserati Bora 15.0 8 301.0
335 3.54 3.570 14.60 0 1 5 8 subset

## function (x, ...)


## UseMethod("subset")
## <bytecode: 0x000000001481ffe0>
## <environment: namespace:base>

#Read from a tab separated and csv file


tab_sep=read.table("D:/Om DUbey/tabsep.txt", sep=" ", header=TRUE) csvitem=read.csv("D:/Om
DUbey/item.csv", sep=",", header=TRUE)

#merging data frame


merged=merge(x=csvitem, y=tab_sep) merged
## Itemcode ItemCategory ItemPrice
Itemcode.ItemQtyonHand.ItemReorderLv1
## 1 I1001 Electronics 700
I1001\t75\t25
## 2 I1002 Desktop supplies 300
I1001\t75\t25
## 3 I1003 Office supplies 350
I1001\t75\t25
## 4 I1001 Electronics 700
I1002\t30\t25
## 5 I1002 Desktop supplies 300
I1002\t30\t25
## 6 I1003 Office supplies 350
I1002\t30\t25
## 7 I1001 Electronics 700
I1003\t35\t25
## 8 I1002 Desktop supplies 300
I1003\t35\t25
## 9 I1003 Office supplies 350 I1003\t35\t25

Data Summary
#Data Summary#---- summary(Employee)

## EmpNo EmpName ProjName EmpExpYears


## Min. :1000 Length:5 Length:5 Min. : 5.0
## 1st Qu.:1001 Class :character Class :character 1st Qu.: 6.0
## Median :1002 Mode :character Mode :character Median : 7.0
## Mean :1002 Mean : 7.8
## 3rd Qu.:1003 3rd Qu.: 9.0
## Max. :1004 Max. :12.0 min(Employee[4])#4th Column

## [1] 5

max(Employee[4]) ##

[1] 12

range(Employee[4])

## [1] 5 12

Employee[,4]

## [1] 5 9 6 12 7 mean(Employee[,4])

## [1] 7.8 median(Employee[,4])

## [1] 7

#median absolute deviation mad(Employee[,4])


## [1] 2.9652

IQR(Employee[,4])

## [1] 3 quantile(Employee[,4])

## 0% 25% 50% 75% 100%


## 5 6 7 9 12

#sapply() function is used to obtain the descriptive statistics sapply(Employee, mean, na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: ## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:


## returning NA

## EmpNo EmpName ProjName EmpExpYears ## 1002.0


NA NA 7.8 sapply(mtcars, mean, trim=0.05, na.rm=TRUE)

## mpg cyl disp hp drat wt


## 19.9533333 6.2000000 228.0000000 143.5666667 3.5800000 3.2005000
## qsec vs am gear carb ## 17.7920000 0.4333333
0.4000000 3.6666667 2.7000000 sapply(mtcars, mean, na.rm=TRUE)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500

#Finding the Missing Values x


=c(2,5,86,9,NA,45,3) y =
c("red",NA,"NA")

#is.na function is used to find and create missing values is.na(x)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE is.na(y)

## [1] FALSE TRUE FALSE


#na.action provides options for treating the missing data.
#Possible na.action settings include:
#na.omit, na.exclude: This function returns the object # by removing the
missing values' observation.
#na.pass: This function returns object unchanged even with missing objects.
#na.fail: This function returns object if it has no missing values.

C= as.data.frame (matrix(c(1:5,NA),ncol=2))
C

## V1 V2
## 1 1 4
## 2 2 5 ##
3 3 NA
na.omit(C)

## V1 V2
## 1 1 4 ## 2 2
5 na.exclude(C)

## V1 V2
## 1 1 4 ## 2 2
5 na.pass(C)

## V1 V2
## 1 1 4
## 2 2 5 ## 3 3
NA

library(tidyverse) who

## # A tibble: 7,240 x 60
## country iso2 iso3 year new_s~1 new_s~2 new_s~3 new_s~4 new_s~5 new_s~6
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 NA NA NA NA NA
NA
## 2 Afghanistan AF AFG 1981 NA NA NA NA NA
NA
## 3 Afghanistan AF AFG 1982 NA NA NA NA NA
NA
## 4 Afghanistan AF AFG 1983 NA NA NA NA NA
NA
## 5 Afghanistan AF AFG 1984 NA NA NA NA NA
NA
## 6 Afghanistan AF AFG 1985 NA NA NA NA NA
NA
## 7 Afghanistan AF AFG 1986 NA NA NA NA NA
NA
## 8 Afghanistan AF AFG 1987 NA NA NA NA NA
NA
## 9 Afghanistan AF AFG 1988 NA NA NA NA NA
NA
## 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA
## # ... with 7,230 more rows, 50 more variables: new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>, ## # new_sn_f1524
<int>, new_sn_f2534 <int>, new_sn_f3544 <int>, ...

v=na.exclude(who)
#Invalid Values and outliers
# An invalid value can be NA, NaN, Inf or -Inf.
#Functions for these invalid values include anyNA(x), anyInvalid(x) and is.invalid(x),
#where the value of x can be a vector, matrix or array.
#Here, anyNA function returns a TRUE value if the input has any Na or NaN values.
#Else, it returns a FALSE value. This function is equivalent to any(is.na(x)).

anyNA(c(-9,NaN,9)) ## [1]

TRUE is.finite(c(-9, Inf,9)) ##

[1] TRUE FALSE TRUE

is.nan(c(-9, Inf,9)) ## [1]

FALSE FALSE FALSE

is.nan(c(-9, Inf, NaN)) ## [1]

FALSE FALSE TRUE

custdata=read.table('D:/Om DUbey/custdata.txt', header = T, sep ='\t') View(custdata)


summary(custdata)

## custid sex is.employed income


## Min. : 2068 Length:1000 Mode :logical Min. : -8700
## 1st Qu.: 345667 Class :character FALSE:73 1st Qu.: 14600
## Median : 693403 Mode :character TRUE :599 Median : 35000
## Mean : 698500 NA's :328 Mean : 53505
## 3rd Qu.:1044606 3rd Qu.: 67000
## Max. :1414286 Max. :615000
##
## marital.stat health.ins housing.type recent.move
## Length:1000 Mode :logical Length:1000 Mode :logical
## Class :character FALSE:159 Class :character FALSE:820
## Mode :character TRUE :841 Mode :character TRUE :124
## NA's :56
##
##
##
## num.vehicles age state.of.res
## Min. :0.000 Min. : 0.0 Length:1000
## 1st Qu.:1.000 1st Qu.: 38.0 Class :character
## Median :2.000 Median : 50.0 Mode :character
## Mean :1.916 Mean : 51.7
## 3rd Qu.:2.000 3rd Qu.: 64.0
## Max. :6.000 Max. :146.7
## NA's :56 summary(custdata$income)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8700


14600 35000 53505 67000 615000 summary(custdata$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 38.0
50.0 51.7 64.0 146.7 t=hist(custdata$income, xlab="Income", col =
"red")
plot(density(custdata$income), data= custdata)

## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter


## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a


## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter


hist(custdata$income, breaks=500, xlim=c(-8000, 615000), col="lightblue", main="Histogram of Miles
per Gallon",
xlab ="Income", ylab="Probability", probability= TRUE) lines(density(custdata$income))

min(custdata$income)
## [1] -8700 max(custdata$income)

## [1] 615000

Descriptive Statistics
# Calculates the duration.
dis=mtcars$disp
#Apply max and min function to return the range
Range=max(dis) - min(dis);Range

## [1] 400.9

#Frequencies and Mode head(subset(mtcars, select =


"gear"))

## gear
## Mazda RX4 4
## Mazda RX4 Wag 4
## Datsun 710 4
## Hornet 4 Drive 3
## Hornet Sportabout 3 ## Valiant
3 factor(mtcars$gear)
## [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
## Levels: 3 4 5

w= table(mtcars$gear) w

##
## 3 4 5 ## 15 12 5 cbind(w) #cbind()function can be used to display the result in column format

## w
## 3 15
## 4 12
## 5 5

#Create the function getmode = function(y){ uniqy =


unique(y) uniqy[which.max(tabulate(match(y,uniqy)))]
}
# Define the input vector values v =
c(5,6,4,8,5,7,4,6,5,8,3,2,1)
#Calculate the mode with user-defined functions resultmode=
getmode(v) print(resultmode)

## [1] 5

#Define characters as input vector values charv =


c("as","is","is","it","in")
#Calculate mode using user-defined function resultmode =
getmode(charv) print(resultmode) ## [1] "is"

cardata=mtcars str(cardata)

## 'data.frame': 32 obs. of 11 variables:


## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

#conversion to factor cardata$vs=as.factor(cardata$vs)


str(cardata)

## 'data.frame': 32 obs. of 11 variables:


## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4
1 1 2 1 4 2 2 4 ...

levels(cardata$vs)=c("V shaped", "Straight or Inline") str(cardata)

## 'data.frame': 32 obs. of 11 variables:


## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "V shaped","Straight or Inline": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4
1 1 2 1 4 2 2 4 ...

cardata$am=as.factor(cardata$am)
levels(cardata$am)=c("Automatic", "Manual")
count=table(cardata$vs) count

##
## V shaped Straight or Inline ## 18
14 barplot(count)
bar=barplot(count, main="Cars of Engine Type", xlab="Type of Engine", ylab="Number of Cars",
col="yellow")
#provide value labels to the existing bar, cex=1, indicate minimum font size of the value label,
#pos=3 indicates that the labels are slightly above the lower horizontal line of the bar
text(bar, 0, count, cex=1, pos=3)
#for horizontal bar plots bar1=barplot(count, main="Cars of Engine Type", xlab="Type of
Engine",
ylab="Number of Cars", col="yellow", horiz = TRUE)

bar2=barplot(count, main="Cars of Engine Type", xlim=c(0,20),xlab="Type of Engine", ylab="Number


of Cars", col="yellow", horiz = TRUE)

# Clustered Bar chart


stbc=table(cardata$am,cardata$vs ) stbc

##
## V shaped Straight or Inline
## Automatic 12 7 ## Manual 6
7
barplot(stbc, main="Cars by Engine Type and Transmission Type",xlab="Engine Type and
Transmission Type",
ylab="Number of cars", legend= rownames(stbc), col=c("Lightblue", "green"),beside=TRUE)
barplot(stbc, main="Cars by Engine Type and Transmission Type",xlab="Engine Type and
Transmission Type",
ylab="Number of cars", legend=rownames(stbc), col=c("Lightblue",
"green"))
barplot(stbc, main="Cars by Engine Type and Transmission Type",xlab="Engine Type and
Transmission Type",
ylab="Number of cars", legend=rownames(stbc), col=c("Lightblue",
"green"),
args.legend = list(x="topright", bty="y", inset=c(0,-0.1)))

barplot(stbc, xlim= c(0:1), width=0.2, main="Cars by Engine Type and Transmission


Type",xlab="Engine Type and Transmission Type",
ylab="Number of cars", legend=rownames(stbc), col=c("Lightblue",
"green"),
args.legend = list(x="topright", bty="y", inset=c(0,0)))
# Histogram
#It is used for identifying normality. Normality is a precondition for some statistical tests.
#Therefore, it is important to examine normality before choosing any statistical technique.
# we can not draw a histogram for factor variable - hist(cardata$vs) hist(cardata$mpg)
View(cardata) hist(cardata$mpg, breaks=10, col="lightblue", main="Histogram of Miles per
Gallon",
xlab="mpg", ylab="Frequency")
hist(cardata$mpg, breaks=10, xlim=c(10,35), col="lightblue", main="Histogram of Miles per Gallon",

xlab="mpg", ylab="Frequency")
# For relative frequency hist(cardata$mpg, breaks=10, xlim=c(10,35), col="lightblue",
main="Histogram
of Miles per Gallon",
xlab="mpg", ylab="Probability", probability= TRUE) lines(density(cardata$mpg))

#Box Plot
boxplot(cardata$mpg)
of Cars", col="green")
fix(cardata) View(cardata)
boxplot(cardata$mpg~cardata$am, main="Milage per Gallon", ylab="Number of Cars", col="green")

boxplot(cardata$mpg~cardata$am, main="Milage per Gallon",sub=paste("Outlier rows:",


boxplot.stats(cardata$mpg)$out, boxplot.stats(cardata$am)$out), ylab="Number of Cars", col="green")

## Warning in Ops.factor(x[floor(d)], x[ceiling(d)]): '+' not meaningful for ## factors


perc=round(carcyl/sum(carcyl)*100,2)
lbl=paste(names(carcyl), "cylinders", perc, "%", sep=" ")
pie(carcyl, main="Cars by number of cylinders", labels=lbl, col=c("red", "blue", "green"))
#Scatter plot
plot (cardata$wt, cardata$mpg, main="Scatter plot of Weight versus Mileage", xlab=" weight(1000lbs)",
ylab="Miles per gallon")
#Regression line can be generated by executing the following command
abline(lm(cardata$mpg~cardata$wt), col="blue")

plot of Weight versus


"
#Pairwise scatter plot can be generated pairs(~mpg+wt+hp, data=cardata, main="Scatter Plot
Matrix")
x=seq(1,25,0.1) y=sin(x)
plot(x,y)
Tidyverse
Spreading across the tibbles
library(tidyverse) library(ggplot2)
data() View(who) table1

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272 ## 6 China
2000 213766 1280428583 table2

## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766 ## 12 China 2000
population 1280428583 table3

## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# Spread across two tibbles
# cases table4a

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

# population table4b

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583

# Only table1 is tidy as


#Put each data set in a tibble.
#Put each variable in a column.
# Compute rate per 10,000 mutate() will add a column in table1 table1

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

#pipe operator %>% table1 %>% mutate(rate = cases / population *


10000)

## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
View(who)
t=mutate(table1, rate=cases/population*10000) t

## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67 ## 6 China
2000 213766 1280428583 1.67 mtcars

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## Volvo 142E 21.4 4
121.0 109 4.11 2.780 18.60 1 1 4 2 mtcars %>% head() %>% summary()
## mpg cyl disp hp drat
## Min. :18.10 Min. :4 Min. :108.0 Min. : 93.0 Min. :2.760
## 1st Qu.:19.27 1st Qu.:6 1st Qu.:160.0 1st Qu.:106.2 1st Qu.:3.098
## Median :21.00 Median :6 Median :192.5 Median :110.0 Median :3.500
## Mean :20.50 Mean :6 Mean :211.8 Mean :117.2 Mean :3.440
## 3rd Qu.:21.30 3rd Qu.:6 3rd Qu.:249.8 3rd Qu.:110.0 3rd Qu.:3.888
## Max. :22.80 Max. :8 Max. :360.0 Max. :175.0 Max. :3.900
## wt qsec vs am gear
## Min. :2.320 Min. :16.46 Min. :0.0 Min. :0.0 Min. :3.0
## 1st Qu.:2.684 1st Qu.:17.02 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:3.0
## Median :3.045 Median :17.82 Median :0.5 Median :0.5 Median :3.5
## Mean :2.988 Mean :18.13 Mean :0.5 Mean :0.5 Mean :3.5
## 3rd Qu.:3.384 3rd Qu.:19.23 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:4.0 ## Max. :3.460 Max. :20.22
Max. :1.0 Max. :1.0 Max. :4.0
## carb
## Min. :1.000
## 1st Qu.:1.000
## Median :1.500
## Mean :2.167
## 3rd Qu.:3.500
## Max. :4.000

# Compute cases per year table1 %>%


count(country, wt = cases)

## # A tibble: 3 x 2
## country n
## <chr> <int>
## 1 Afghanistan 3411
## 2 Brazil 118225
## 3 China 426024

Visualise changes over time library(ggplot2)


ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "black") + geom_point(aes(colour = country))
#Longer: A common problem is a dataset where some of
#the column names----
#are not names of variables,but values of a variable.
table4a

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

tidy4a = table4a %>% pivot_longer(c(`1999 `, `2000`), names_to = "year" ,


values_to = "cases" )
tidy4a

## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766

table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898 ## 3 China
1272915272 1280428583

tidy4b=table4b %>% pivot_longer(c(`1999`, `2000`), names_to = "year",


values_to = "population") tidy4b

## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272 ## 6 China
2000 1280428583 tidy4=left_join(tidy4a,
tidy4b);tidy4 ## Joining, by = c("country", "year")

## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Pivot Wider
#Joining, by = c("country", "year")
#Wider: pivot_wider() is the opposite of pivot_longer().----
#You use it when an observation is scattered across multiple rows.

table2

## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766 ## 12 China 2000 population
1280428583 table2 %>% pivot_wider(names_from = type, values_from = count)

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

M1=matrix(c(2,2,3,3,4,4,5,5,6,6,7,8,9,10,11,11), nrow=4, ncol=4)


rownames(M1)=c("A", "B", "C","D")
colnames(M1)=c("E", "F", "G", "I") M1

## E F G I
## A 2 4 6 9
## B 2 4 6 10
## C 3 5 7 11 ## D 3
5 8 11

t=as_tibble(M1)
t

## # A tibble: 4 x 4
## E F G I
## <dbl> <dbl> <dbl> <dbl>
## 1 2 4 6 9
## 2 2 4 6 10
## 3 3 5 7 11 ## 4 3 5 8 11 t%>
%pivot_wider(names_from = E, values_from = F)

## # A tibble: 4 x 4
## G I `2` `3`
## <dbl> <dbl> <dbl> <dbl>
## 1 6 9 4 NA
## 2 6 10 4 NA
## 3 7 11 NA 5
## 4 8 11 NA 5
Separating and uniting
#Separate
#separate() pulls apart one column into multiple columns, #by splitting wherever
a separator character appears. table3

## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272 ## 6 China 2000
213766/1280428583 table3 %>% separate(rate, into = c("cases",
"population"))

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 table3
%>% separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table3 %>% separate(year, into = c("century", "year"), sep = 2)

## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583

# Unite :unite() is the inverse of separate(): #it combines multiple


columns into a single column. table5

## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272 ## 6 China 20
00 213766/1280428583
table5 %>% unite(new, century, year) #default will place an underscore (_) between the values from different
columns

## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272 ## 6 China
20_00 213766/1280428583 table5 %>% unite(new, century,
year, sep = "/")

## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19/99 745/19987071
## 2 Afghanistan 20/00 2666/20595360
## 3 Brazil 19/99 37737/172006362
## 4 Brazil 20/00 80488/174504898
## 5 China 19/99 212258/1272915272
## 6 China 20/00 213766/1280428583
Missing Values
#Missing values----
#Changing the representation of a dataset brings up an important subtlety of missing values.
#Surprisingly, a value can be missing in one of two possible ways:
#Explicitly, i.e. flagged with NA.
#Implicitly, i.e. simply not present in the data.

stocks = tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA,
0.92, 0.17, 2.66)
)
stocks

## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66

#There are two missing values in this dataset:


#1. The return for the fourth quarter of 2015 is explicitly missing, #because the cell where its value
should be instead contains NA.

# 2. The return for the first quarter of 2016 is implicitly missing, #because it simply does not appear in
the dataset.

stocks %>% pivot_wider(names_from = year, values_from = return)

## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17 ## 4 4 NA
2.66

t=stocks %>%
pivot_wider(names_from = year, values_from = return) %>% pivot_longer(
cols = c(`2015`, `2016`), names_to =
"year", values_to = "return",
values_drop_na = TRUE
)t

## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17 ## 6 4
2016 2.66
stocks %>% complete(year, qtr)#complete() takes a set of columns, and finds all unique combinations.

## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66

#rowswise readable tibble treatment = tribble(~ person, ~ treatment, ~response,"Derrick


Whitmore",
1, 7, NA, 2,10,NA,3,9,"Katherine Burke",1,4) treatment

## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 Katherine Burke 1 4

# adding column names to matrix a <-


matrix(m,nrow=1) colnames(a) <-
colnames(data)
#fill in missing values with fill().
#It takes a set of columns where you want missing values to be
#replaced by the most recent non-missing value
#(sometimes called last observation carried forward).

treatment %>% fill(person)


## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4

View(who) str(who)

## tibble [7,240 x 60] (S3: tbl_df/tbl/data.frame)


## $ country : chr [1:7240] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ iso2 : chr [1:7240] "AF" "AF" "AF" "AF" ...
## $ iso3 : chr [1:7240] "AFG" "AFG" "AFG" "AFG" ...
## $ year : int [1:7240] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
## $ new_sp_m014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_m65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sp_f65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_m65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_sn_f65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_m65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ new_ep_f65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_m65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f014 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f1524: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f2534: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f3544: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f4554: int [1:7240] NA NA NA NA NA NA NA NA NA NA ...
## $ newrel_f5564: int [1:7240] NA NA NA NA NA NA NA NA NA NA ... ## $
newrel_f65 : int [1:7240] NA NA NA NA NA NA NA NA NA NA ...

who1 = who %>% pivot_longer(


cols = new_sp_m014:newrel_f65,
names_to = "key", values_to = "cases",
values_drop_na = TRUE
)
View(who1$key) who1
%>% count(key)

## # A tibble: 56 x 2
## key n
## <chr> <int>
## 1 new_ep_f014 1032
## 2 new_ep_f1524 1021
## 3 new_ep_f2534 1021
## 4 new_ep_f3544 1021
## 5 new_ep_f4554 1017
## 6 new_ep_f5564 1017
## 7 new_ep_f65 1014
## 8 new_ep_m014 1038
## 9 new_ep_m1524 1026
## 10 new_ep_m2534 1020
## # ... with 46 more rows
who2 = who1 %>% mutate(key = stringr::str_replace(key,
"newrel", "new_rel")) str(who2)

## tibble [76,046 x 6] (S3: tbl_df/tbl/data.frame)


## $ country: chr [1:76046] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ iso2 : chr [1:76046] "AF" "AF" "AF" "AF" ...
## $ iso3 : chr [1:76046] "AFG" "AFG" "AFG" "AFG" ...
## $ year : int [1:76046] 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
## $ key : chr [1:76046] "new_sp_m014" "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" ...
## $ cases : int [1:76046] 0 10 6 3 5 2 0 5 38 36 ...

tail(who2)

## # A tibble: 6 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Zimbabwe ZW ZWE 2013 new_rel_f1524 2069
## 2 Zimbabwe ZW ZWE 2013 new_rel_f2534 4649
## 3 Zimbabwe ZW ZWE 2013 new_rel_f3544 3526
## 4 Zimbabwe ZW ZWE 2013 new_rel_f4554 1453
## 5 Zimbabwe ZW ZWE 2013 new_rel_f5564 811 ## 6 Zimbabwe
ZW ZWE 2013 new_rel_f65 725

who3 = who2 %>% separate(key, c("new", "type", "sexage"), sep = "_") View(who3) who3
%>% count(new)

## # A tibble: 1 x 2
## new n
## <chr> <int> ## 1
new 76046

who4 = who3 %>% select(-new, -iso2, -iso3) View(who4) who5 = who4 %>%
separate(sexage, c("sex", "age"), sep = 1) View(who5)

#Imputation of mean in place of NA df=who df$new_sp_m014[is.na(df$new_sp_m014)] =


mean(df$new_sp_m014, na.rm = T)
df

## # A tibble: 7,240 x 60
## country iso2 iso3 year new_s~1 new_s~2 new_s~3 new_s~4 new_s~5 new_s~6
## <chr> <chr> <chr> <int> <dbl> <int> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 83.7 NA NA NA NA
NA
## 2 Afghanistan AF AFG 1981 83.7 NA NA NA NA
NA
## 3 Afghanistan AF AFG 1982 83.7 NA NA NA NA
NA
## 4 Afghanistan AF AFG 1983 83.7 NA NA NA NA
NA
## 5 Afghanistan AF AFG 1984 83.7 NA NA NA NA
NA
## 6 Afghanistan AF AFG 1985 83.7 NA NA NA NA
NA
## 7 Afghanistan AF AFG 1986 83.7 NA NA NA NA
NA
## 8 Afghanistan AF AFG 1987 83.7 NA NA NA NA
NA
## 9 Afghanistan AF AFG 1988 83.7 NA NA NA NA
NA
## 10 Afghanistan AF AFG 1989 83.7 NA NA NA NA NA
## # ... with 7,230 more rows, 50 more variables: new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>, ## # new_sn_f1524
<int>, new_sn_f2534 <int>, new_sn_f3544 <int>, ... str(who5)

## tibble [76,046 x 6] (S3: tbl_df/tbl/data.frame)


## $ country: chr [1:76046] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int [1:76046] 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
## $ type : chr [1:76046] "sp" "sp" "sp" "sp" ...
## $ sex : chr [1:76046] "m" "m" "m" "m" ...
## $ age : chr [1:76046] "014" "1524" "2534" "3544" ...
## $ cases : int [1:76046] 0 10 6 3 5 2 0 5 38 36 ...

Quick plots
###Quick plots
#qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=,
#facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=) summary(mtcars)

## mpg cyl disp hp


## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
000

##
yl)

y)

ndom sample # of 100 diamonds


me of diamond
)
he input
`.
"AsIs"), aplha

`.
pg, data = mtcars)
value with

fill=cylinder,
`.
transmission = factor(mtcars$am, levels=c(0, 1),
labels=c("Automatic", "Manual"))

mtcars$cyl = factor(mtcars$cyl, levels=c(4, 6, 8),


labels=c("4 cylinders", "6 cylinders", "8 cylinders")) mtcars$am = factor(mtcars$am,
levels=c(0, 1), labels=c("Automatic", "Manual"))

qplot(wt,mpg, data=mtcars, alpha=0.8 ,facets=am~cyl, size=hp)

data(singer, package="lattice") View(singer)


qplot(height, data=singer, geom=c("density"), facets=voice.part~., fill=voice.part)
qplot(height, data=singer,colour = I("yellow"), geom=c("density"), facets=voice.part~.,
fill=voice.part)

# Dashboard ## A
nice qplot here
qplot(factor(cyl),wt,data=mtcars,geom=c("boxplot","jitter"),
fill=factor(cyl),
main="Box plots with superimposed data",
xlab="Number of cylinders", ylab="Miles per gallon")

A beutiful barplot here


cardata=mtcars
count=table(cardata$vs) count

##
## 0 1 ## 18 14
barplot(count)
A pleasant histogram here
hist(cardata$mpg)

hist(cardata$mpg,breaks=10,col="Sky blue",
main="histogram of miles per gallon",
xlab="mpg",ylab="frequency")

A delightful data frame here


df = data.frame(
A = c('A', 'B', 'C', 'J', 'E', NA,'M'),
B= c(12.5, 9, 16.5, NA, 9, 20, 14.5), C = c(NA, 3, 2,
NA, 1, NA, 0)) print(df)

## A B C
## 1 A 12.5 NA
## 2 B 9.0 3
## 3 C 16.5 2
## 4 J NA NA
## 5 E 9.0 1
## 6 <NA> 20.0 NA ## 7
M 14.5 0

colnames(df)[2] <- "Om" rownames(df)[2] <-


"Dubey" print(df)

## A Om C
## 1 A 12.5 NA
## Dubey B 9.0 3
## 3 C 16.5 2
## 4 J NA NA
## 5 E 9.0 1
## 6 <NA> 20.0 NA
## 7 M 14.5 0

Date
Parse Date,Time,Months,and Year
library(tidyverse) library(lubridate)

## Loading required package: timechange

##
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':


##
## date, intersect, setdiff, union

library(nycflights13) today()

## [1] "2023-01-16" now()

## [1] "2023-01-16 21:18:56 IST"

#Date ymd("2022-11-09") ## [1]

"2022-11-09" mdy("November

09th, 2022") ## [1] "2022-11-09"

dmy("09-Nov-2022") ## [1]

"2022-11-09" ymd(20221109)

## [1] "2022-11-09"

Creating Date and Time ymd_hms("2022-11-09


20:11:59")

## [1] "2022-11-09 20:11:59 UTC"


mdy_hm("11/09/2022 08:01")

## [1] "2022-11-09 08:01:00 UTC"

#Greenwich England ymd(20221109, tz

= "UTC") ## [1] "2022-11-09 UTC"

ymd(20221109, tz = "EST") ## [1]

"2022-11-09 EST" as.Date("2022-11-

09") ## [1] "2022-11-09"

unclass(as.Date("2022-11-09"))

## [1] 19305

Time Spans and Duration


#Time spans
h_age= today() - ymd(20031125) h_age

## Time difference of 6992 days

#Duration, which represent an exact number of seconds.


#Duration in seconds as.duration(h_age)

## [1] "604108800s (~19.14 years)"

dseconds(15) ## [1] "15s" dminutes(10)

## [1] "600s (~10 minutes)" dhours(c(12, 24))

## [1] "43200s (~12 hours)" "86400s (~1 days)" ddays(0:5)

## [1] "0s" "86400s (~1 days)" "172800s (~2 days)" ## [4] "259200s (~3 days)"
"345600s (~4 days)" "432000s (~5 days)" dweeks(3)

## [1] "1814400s (~3 weeks)"


#Multiplication
2 * dyears(1)

## [1] "63115200s (~2 years)"

#Addition dyears(1) + dweeks(12) +


dhours(15)

## [1] "38869200s (~1.23 years)"

# Add and subtract duration's to and from days tomorrow =


today() + ddays(1) last_year= today() - dyears(1)

one_pm = ymd_hms("2022-11-16 13:00:00", tz = "America/New_York") one_pm

## [1] "2022-11-16 13:00:00 EST"

new= ymd_hms("2022-11-16 13:00:00", tz = "UCT") diff=new-one_pm;diff

## Time difference of -5 hours one_pm +

ddays(1)

## [1] "2022-11-17 13:00:00 EST"

#Periods are time spans but don't have a fixed length in seconds, #instead they work with
"human" times, like days and months. one_pm

## [1] "2022-11-16 13:00:00 EST" one_pm +

days(1)

## [1] "2022-11-17 13:00:00 EST"

seconds(15) ## [1] "15S" minutes(10) ##

[1] "10M 0S" hours(c(12, 24))

## [1] "12H 0M 0S" "24H 0M 0S" days(7)

## [1] "7d 0H 0M 0S" months(1:6)


## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S" weeks(3)

## [1] "21d 0H 0M 0S" years(1)

## [1] "1y 0m 0d 0H 0M 0S"

#Addition and multiplication of period


10 * (months(6) + days(1)) ## [1] "60m 10d

0H 0M 0S" days(50) + hours(25) +

minutes(2)

## [1] "50d 25H 2M 0S"

# A leap year ymd("2016-01-01") +


dyears(1) ## [1] "2016-12-31 06:00:00
UTC" ymd("2016-01-01") + years(1)

## [1] "2017-01-01"

# Daylight Savings Time one_pm +


ddays(1)

## [1] "2022-11-17 13:00:00 EST" one_pm +

days(1)

## [1] "2022-11-17 13:00:00 EST"

Assignment 1
Q1
#Q1----
v1=c(6,4,8,11,3,2,10,12,2,18)
v2=c(12, 14, 18, 11, 9,4, 2, 1, 10, 9)
#(A) v1*8

## [1] 48 32 64 88 24 16 80 96 16 144

#(B) v1+v2

## [1] 18 18 26 22 12 6 12 13 12 27
#(C) log(v2)

## [1] 2.4849066 2.6390573 2.8903718 2.3978953 2.1972246 1.3862944 0.6931472 ## [8] 0.0000000
2.3025851 2.1972246

#(D) v1[-7]

## [1] 6 4 8 11 3 2 12 2 18

#(E) v2[2]

## [1] 14

#(F) sample(v2,4)

## [1] 1 9 12 18

#(G) v1[v1>4]

## [1] 6 8 11 10 12 18

Q2
#Q2----
M1=matrix(c(5,4,1,8,9,2,3,6,7),nrow=3,ncol=3,byrow = T);M1

## [,1] [,2] [,3]


## [1,] 5 4 1
## [2,] 8 9 2
## [3,] 3 6 7

M2=matrix(c(11, 19, 14, 17, 13, 12),nrow = 2,ncol = 3,byrow = T);M2

## [,1] [,2] [,3]


## [1,] 11 19 14
## [2,] 17 13 12

M3=matrix(c(12, 9, 3, 11, 7, 5, 13, 5, 7),nrow=3,ncol=3);M3

## [,1] [,2] [,3]


## [1,] 12 11 13
## [2,] 9 7 5
## [3,] 3 5 7

M4=matrix(c(2,3,4,1,5,7),nrow=3,ncol=2);M4

## [,1] [,2]
## [1,] 2 1
## [2,] 3 5
## [3,] 4 7

#(A)
M1*M3

## [,1] [,2] [,3]


## [1,] 60 44 13
## [2,] 72 63 10
## [3,] 9 30 49

#(B)
M1%*%M3

## [,1] [,2] [,3]


## [1,] 99 88 92
## [2,] 183 161 163
## [3,] 111 110 118

M1%*%M4

## [,1] [,2]
## [1,] 26 32
## [2,] 51 67
## [3,] 52 82

M2%*%M1

## [,1] [,2] [,3]


## [1,] 249 299 147
## [2,] 225 257 127

M2%*%M3

## [,1] [,2] [,3]


## [1,] 345 324 336
## [2,] 357 338 370

M2%*%M4

## [,1] [,2]
## [1,] 135 204
## [2,] 121 166

M3%*%M1

## [,1] [,2] [,3]


## [1,] 187 225 125
## [2,] 116 129 58
## [3,] 76 99 62

M3%*%M4
## [,1] [,2]
## [1,] 109 158
## [2,] 59 79
## [3,] 49 77

M4%*%M2

## [,1] [,2] [,3]


## [1,] 39 51 40
## [2,] 118 122 102
## [3,] 163 167 140

#(C)
M1+M3

## [,1] [,2] [,3]


## [1,] 17 15 14
## [2,] 17 16 7
## [3,] 6 11 14

#(D) t(M1)

## [,1] [,2] [,3]


## [1,] 5 8 3
## [2,] 4 9 6 ## [3,] 1 2
7 t(M3)

## [,1] [,2] [,3]


## [1,] 12 9 3
## [2,] 11 7 5
## [3,] 13 5 7

#(E)
M2[1,]

## [1] 11 19 14

#(F)
M1[sample(nrow(M2),size=2),sample(ncol(M1),size=2)]

## [,1] [,2]
## [1,] 5 4
## [2,] 8 9

#(G)
M1[,2][M1[,2]<5]="na" ;M1

## [,1] [,2] [,3]


## [1,] "5" "na" "1"
## [2,] "8" "9" "2"
## [3,] "3" "6" "7"

#(H)
M1[2,][M1[2,]<5]="na" ;M1

## [,1] [,2] [,3]


## [1,] "5" "na" "1"
## [2,] "8" "9" "na"
## [3,] "3" "6" "7"

Q3
#Q3----
C1=c(1,2,3,4,5)#roll number class(C1)

## [1] "numeric"

C2=c("a","b","c","d","e")#first name class(C2)

## [1] "character"

C3=c("aa","bb","cc","dd","ee")#last name class(C3)

## [1] "character"

result=c(1,1,2,1,2)
C4=factor(result,levels = 1:2)
levels(C4)=c("pass","fail")#pass fail C4

## [1] pass pass fail pass fail


## Levels: pass fail

#(A) firstprac1=list(C1,C2,C3,C4);firstprac1

## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c" "d" "e"
##
## [[3]]
## [1] "aa" "bb" "cc" "dd" "ee"
##
## [[4]]
## [1] pass pass fail pass fail
## Levels: pass fail
firstprac1=data.frame(C1,C2,C3,C4)
colnames(firstprac1)=c("rno","first name","last name","pass status")#change column firstprac1

## rno first name last name pass status


## 1 1 a aa pass
## 2 2 b bb pass
## 3 3 c cc fail
## 4 4 d dd pass
## 5 5 e ee fail

#(B) firstprac1["first name"]

## first name
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e

#(C) firstprac1[firstprac1$`pass status`=="pass",]

## rno first name last name pass status


## 1 1 a aa pass
## 2 2 b bb pass
## 4 4 d dd pass

#(D) sample(firstprac1,2,replace = F)

## rno pass status


## 1 1 pass
## 2 2 pass
## 3 3 fail
## 4 4 pass
## 5 5 fail

Q4
#Q4----
C5=c(58,59,24,88,10) C6=c(77,98,20,56,22)
enrolled=c(1,2,1,1,2)
C7=factor(enrolled,levels = 1:2) levels(C7)=c("science","commerce")
#(A)
firstprac2=list(C5,C6,C7);firstprac2
## [[1]]
## [1] 58 59 24 88 10
##
## [[2]]
## [1] 77 98 20 56 22
##
## [[3]]
## [1] science commerce science science commerce
## Levels: science commerce firstprac2=data.frame(C5,C6,C7);firstprac2

## C5 C6 C7
## 1 58 77 science
## 2 59 98 commerce
## 3 24 20 science
## 4 88 56 science ## 5 10
22 commerce

colnames(firstprac2)=c("mathematics","english","student enrolled") firstprac2

## mathematics english student enrolled


## 1 58 77 science
## 2 59 98 commerce
## 3 24 20 science
## 4 88 56 science
## 5 10 22 commerce

#(B)
Combinelist=c(firstprac1,firstprac2);Combinelist

## $rno
## [1] 1 2 3 4 5
##
## $`first name`
## [1] "a" "b" "c" "d" "e"
##
## $`last name`
## [1] "aa" "bb" "cc" "dd" "ee"
##
## $`pass status`
## [1] pass pass fail pass fail
## Levels: pass fail
##
## $mathematics
## [1] 58 59 24 88 10
##
## $english
## [1] 77 98 20 56 22
##
## $`student enrolled`
## [1] science commerce science science commerce ## Levels: science
commerce

#(C)
unlist(firstprac1)
## rno1 rno2 rno3 rno4 rno5 first name1
## "1" "2" "3" "4" "5" "a"
## first name2 first name3 first name4 first name5 last name1 last name2
## "b" "c" "d" "e" "aa" "bb"
## last name3 last name4 last name5 pass status1 pass status2 pass status3
## "cc" "dd" "ee" "1" "1"
"2"
## pass status4 pass status5
## "1" "2"

Q5
#Q5----
Employeeno=c(1,2,3,4,5,6,7,8,9,10)
Designation=c("Sr Manager","Sr Manager","Manager","Sr Manager","Manager","Sr
Executive","Sr Executive","Manager","Sr Executive","Sr Executive")
Experience=c(15,14,12,17,13,10,9,10,7,8)
Education=c("PostGraduate","PostGraduate","PostGraduate","PostGraduate","Post
Graduate","Graduate","Graduate","Graduate","PostGraduate","PostGraduate")
Induction=c("F","F","T","F","F","T","T","T","T","T")
ex251.csv=data.frame(Employeeno,Designation,Experience,Education,Induction) ex251.csv

## Employeeno Designation Experience Education Induction


## 1 1 Sr Manager 15 PostGraduate F
## 2 2 Sr Manager 14 PostGraduate F
## 3 3 Manager 12 PostGraduate T
## 4 4 Sr Manager 17 PostGraduate F
## 5 5 Manager 13 PostGraduate F
## 6 6 Sr Executive 10 Graduate T
## 7 7 Sr Executive 9 Graduate T
## 8 8 Manager 10 Graduate T
## 9 9 Sr Executive 7 PostGraduate T
## 10 10 Sr Executive 8 PostGraduate T

Training1=c(50,52,62,65,54,78,65,59,81,57)
Training2=c(77,NA,NA,78,NA,88,72,NA,90,62)
Training3=c(65,55,67,69,NA,89,75,67,91,63)
Training4=c(55,57,NA,NA,71,79,72,63,NA,61)
ex252.csv=data.frame(Training1,Training2,Training3,Training4,Training5) ex252.csv

## Training1 Training2 Training3 Training4 Training5


## 1 50 77 65 55 62
## 2 52 NA 55 57 NA
## 3 62 NA 67 NA NA
## 4 65 78 69 NA 69
## 5 54 NA NA 71 NA
## 6 78 88 89 79 84
## 7 65 72 75 72 81
## 8 59 NA 67 63 NA
## 9 81 90 91 NA 88 ## 10 57 62 63
61 65

#(A)
merged=data.frame(ex251.csv,ex252.csv) merged

## Employeeno Designation Experience Education Induction Training1


## 1 1 Sr Manager 15 PostGraduate F 50
## 2 2 Sr Manager 14 PostGraduate F 52
## 3 3 Manager 12 PostGraduate T 62
## 4 4 Sr Manager 17 PostGraduate F 65
## 5 5 Manager 13 PostGraduate F 54
## 6 6 Sr Executive 10 Graduate T 78
## 7 7 Sr Executive 9 Graduate T 65
## 8 8 Manager 10 Graduate T 59
## 9 9 Sr Executive 7 PostGraduate T 81
## 10 10 Sr Executive 8 PostGraduate T 57
## Training2 Training3 Training4 Training5
## 1 77 65 55 62
## 2 NA 55 57 NA
## 3 NA 67 NA NA
## 4 78 69 NA 69
## 5 NA NA 71 NA
## 6 88 89 79 84
## 7 72 75 72 81
## 8 NA 67 63 NA
## 9 90 91 NA 88
## 10 62 63 61 65

#(B) #(i)
merged[,3]

## [1] 15 14 12 17 13 10 9 10 7 8

#(ii) merged[3,]
## Employeeno Designation Experience Education Induction Training1 Training2
## 3 3 Manager 12 PostGraduate T 62
NA
## Training3 Training4 Training5
## 3 67 NA NA

#(iii) merged[rowSums(is.na(merged))<3,]

## Employeeno Designation Experience Education Induction Training1


## 1 1 Sr Manager 15 PostGraduate F 50
## 2 2 Sr Manager 14 PostGraduate F 52
## 4 4 Sr Manager 17 PostGraduate F 65
## 6 6 Sr Executive 10 Graduate T 78
## 7 7 Sr Executive 9 Graduate T 65
## 8 8 Manager 10 Graduate T 59
## 9 9 Sr Executive 7 PostGraduate T 81
## 10 10 Sr Executive 8 PostGraduate T 57
## Training2 Training3 Training4 Training5
## 1 77 65 55 62
## 2 NA 55 57 NA
## 4 78 69 NA 69
## 6 88 89 79 84
## 7 72 75 72 81
## 8 NA 67 63 NA
## 9 90 91 NA 88
## 10 62 63 61 65

#(iv)
totscore=Training3+Training4+Training5
mergedf=data.frame(merged,totscore) mergedf

## Employeeno Designation Experience Education Induction Training1


## 1 1 Sr Manager 15 PostGraduate F 50
## 2 2 Sr Manager 14 PostGraduate F 52
## 3 3 Manager 12 PostGraduate T 62
## 4 4 Sr Manager 17 PostGraduate F 65
## 5 5 Manager 13 PostGraduate F 54
## 6 6 Sr Executive 10 Graduate T 78
## 7 7 Sr Executive 9 Graduate T 65
## 8 8 Manager 10 Graduate T 59
## 9 9 Sr Executive 7 PostGraduate T 81
## 10 10 Sr Executive 8 PostGraduate T 57
## Training2 Training3 Training4 Training5 totscore
## 1 77 65 55 62 182
## 2 NA 55 57 NA NA
## 3 NA 67 NA NA NA
## 4 78 69 NA 69 NA
## 6 88 89 79 84 252
## 7 72 75 72 81 228
## 8 NA 67 63 NA NA
## 9 90 91 NA 88 NA
## 10 62 63 61 65 189
Q6
#Q6---- #(A) f1=c(1,1,0,1,1,0,0,1,1,1,0,1,1,0,1)
A=factor(f1,levels = 1:2)
levels(A)=c("pass","fail")
f2=c(1,1,2,1,2,3,3,2,1,2,3,2,1,2,1) B=factor(f2,levels = 1:3)
levels(B)=c("Arts","Commerce","Science")

#(C) f1[9]

## [1] 1

#(D) f2[15]

## [1] 1

Q7
#Q7---- vna=c(12, 3, NA, 6, 4, 19, NA, NA, 11,10,5, NA, NA, 18, 8)
#(A)
sum(is.na(vna))

## [1] 5

#(B) na.omit(vna)

## [1] 12 3 6 4 19 11 10 5 18 8
## attr(,"na.action")
## [1] 3 7 8 12 13
## attr(,"class") ## [1]
"omit" mean(na.omit(vna))

## [1] 9.6

#(C)
vna[is.na(vna)]=9.6 vna

## [1] 12.0 3.0 9.6 6.0 4.0 19.0 9.6 9.6 11.0 10.0 5.0 9.6 9.6 18.0
8.0
Assignment 2
Q1 Define a user defined function to check whether the word is a palindrome.

#Q1----
Pal=function(a){
b=stringi::stri_reverse(a) if(a==b){
print("Given word is a palindrome")} else{
print("Given word is not a palindrome")
}

Q2 Identify whether the number is divisible by 3.


#Q2---- n=96
n=as.integer(n) if (n
%%3==0){
print("The number is divisible by three.") }else
{
print("The number is not divisible by three.")
}

## [1] "The number is divisible by three."


Q3 Identify all prime numbers less than a 100.
prime_numbers <- function(n) { if (n >=
2) { x = seq(2, n) prime_nums = c() for
(i in seq(2, n)) { if (any(x == i)) {
prime_nums = c(prime_nums, i)
x = c(x[(x %% i) != 0], i)
}
}
return(prime_nums)
} else { stop("Input number should be at least 2.") }
}
prime_numbers(100)

## [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
89 97
Q4 Calculate factorial of n.
n =96 factorial = 1
if(n < 0) {
print("factorial does not exist for negative numbers")
} else if(n == 0) {
print("The factorial of 0 is 1")
} else { for(i in 1:n)
{
factorial = factorial * i
}
print(paste("The factorial of", n ,"is",factorial))
}

## [1] "The factorial of 96 is 9.91677934870949e+149"


Q5 Find g.c.d of (x,y)
hcf <- function(x, y) {

if(x > y) {
smaller = y }
else {
smaller = x
}
for(i in 1:smaller) {
if((x %% i == 0) && (y %% i == 0)) { hcf = i
}
}
return(hcf)
}

num1 = as.integer(69) num2 = as.integer(96) print(paste("The H.C.F. of", num1,"and",


num2,"is", hcf(num1, num2)))

## [1] "The H.C.F. of 69 and 96 is 3"

Hypothesis Testing
One Sample T-Test
# Hypothesis - H0:u = 155 (Average Height)
# Output
# One Sample T-test
# Data: Height
# t = 7.5105, df = 99, p-value = 2.641e-11
# alternative hypothesis: true mean is not equal to 155 # 95 percent
confidence interval:
# 157.6696 159.5866 #
sample estimates:
# mean of x
# 158.6281
# Interpretation
# p-value of the test is 2.641e-11 which is less than the significance level = 0.05. We can conclude that
mean height is significantly different from 155 with a p-value of 2.641e-11.
Paired Sample T-Test
# Hypothesis - There is no significant difference between serum levels before and after
# Output
# Paired t-test
# data: Serum_Prolactin_mVL_before and Serum_Prolactin_mVL_after
# t = 4.5876, df = 21, p-value = 0.0001595
# alternative hypothesis: true difference in means is not equal to 0 # 95 percent confidence
interval: # 48.45684 128.81588 # sample estimates:
# mean of the differences
# 88.63636
# Interpretation
# The p-value of the test is 0.0001595 which is less than significance level 0.05. We can conclude that the
serum levels before treatment is significantly different from the serum levels after treatment with a p-
value = 0.0001595

Independent Sample T-Test


# Hypothesis - CPH levels are same on the basis of gender # Output
# Welch Two Sample t-test
# data: CPHlevels by Gender
# t = 0.20026, df = 9.334, p-value = 0.8456
# alternative hypothesis: true difference in means between group Female and group Male is not equal
to 0 # 95 percent confidence interval: # -1.066094 1.274428 # sample estimates:
# mean in group Female mean in group Male
# 2.391667 2.287500
# Interpretation
# Since p-value is 0.8456 which is greater than the significance level which is 0.05. We conclude that
man in group female and mean in group male is significantly different

ANOVA
# Hypothesis - There is no significant difference between a,b, and c # Output
# Rcmdr> AnovaModel.1 <- aov(Sales.data.for.territory.1 ~ Terr, data=Dataset)
# Rcmdr> summary(AnovaModel.1)
# Df Sum Sq Mean Sq F value Pr(>F)
# Terr 2 9.16 4.580 1.215 0.313
# Residuals 27 101.81 3.771
# Rcmdr> with(Dataset, numSummary(Sales.data.for.territory.1, groups=Terr, statistics=c("mean",
"sd")))
# mean sd data:n
# A 3.600000 2.065591 10
# B 4.571429 1.272418 7
# C 4.846154 2.115268 13
# Interpretation
# As the p-value is greater than the significance level 0.05, we can conclude

that there are significant differences between the groups a,b,and c


Simple Regression
# Hypothesis - There is significant relationship between x and y # Output
# Rcmdr> RegModel.3 <- lm(y~x, data=Dataset) #
Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 53.9411 45.2139 1.193 0.27172 # x 0.5874
0.1523 3.857 0.00623 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 11.26 on 7 degrees of freedom
# Multiple R-squared: 0.68, Adjusted R-squared: 0.6343
# F-statistic: 14.88 on 1 and 7 DF, p-value: 0.006234 # Interpretation
# It can be seen that the p-value of F-statistic is 0.006234 which is significant. This means that, at least,
one of the predictor variables is

significantly related to the outcome variable.


Step-wise Regression
# Hypothesis - Model is highly significant
# Output
# Rcmdr> stepwise(RegModel.2, direction='backward/forward', criterion='BIC') # Direction:
backward/forward
# Criterion: BIC
# Start: AIC=64.55
# Sales ~ Adcost + Boys + Competitor + Customer + Outlets + Varieties
# Df Sum of Sq RSS AIC
# - Competitor 1 0.036 313.57 61.848
# - Adcost 1 11.283 324.82 62.377
# - Customer 1 25.600 339.14 63.024
# - Varieties 1 28.738 342.27 63.162
# - Boys 1 39.971 353.51 63.646
# <none> 313.54 64.554
# - Outlets 1 269.144 582.68 71.142
# Step: AIC=61.85
# Sales ~ Adcost + Boys + Customer + Outlets + Varieties
# Df Sum of Sq RSS AIC
# - Adcost 1 14.95 328.52 59.838
# - Customer 1 35.05 348.62 60.729
# - Boys 1 40.06 353.63 60.943
# - Varieties 1 40.47 354.04 60.961
# <none> 313.57 61.848
# + Competitor 1 0.04 313.54 64.554
# - Outlets 1 339.77 653.34 70.151
# Step: AIC=59.84
# Sales ~ Boys + Customer + Outlets + Varieties
# Df Sum of Sq RSS AIC
# - Varieties 1 54.78 383.30 59.444
# - Customer 1 54.99 383.51 59.452
# <none> 328.52 59.838
# - Boys 1 91.04 419.56 60.800
# + Adcost 1 14.95 313.57 61.848
# + Competitor 1 3.70 324.82 62.377
# - Outlets 1 585.94 914.46 72.486
# Step: AIC=59.44
# Sales ~ Boys + Customer + Outlets
# Df Sum of Sq RSS AIC
# - Customer 1 18.88 402.18 57.457
# <none> 383.30 59.444
# + Varieties 1 54.78 328.52 59.838
# + Adcost 1 29.26 354.04 60.961
# - Boys 1 133.15 516.45 61.208
# + Competitor 1 0.87 382.43 62.118
# - Outlets 1 534.55 917.85 69.834
# Step: AIC=57.46
# Sales ~ Boys + Outlets
# Df Sum of Sq RSS AIC
# <none> 402.18 57.457
# + Adcost 1 38.87 363.32 58.641
# + Customer 1 18.88 383.30 59.444
# + Varieties 1 18.67 383.51 59.452
# + Competitor 1 2.40 399.78 60.075
# - Boys 1 219.04 621.22 61.271
# - Outlets 1 855.76 1257.95 71.854 # Call:
# lm(formula = Sales ~ Boys + Outlets, data = Dataset) # Coefficients:
# (Intercept) Boys Outlets
# -11.817 1.640 1.753
# In the step-wise regression, it can be seen that boys and outlets are significant, whereas when we run it
in multiple regression, it can be seen that other variables are impacting the boys due to which boys
became insignificant.
# It can be concluded that the best model is the one with only 2 variables

which are boys and outlets which are significant.


Multiple Regression
# Hypothesis - Whole model is significant except outlets # Output
# Call: lm(formula = Sales ~ Adcost + Boys + Competitor + Customer +
# Outlets + Varieties, data = Dataset) # Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.37195 32.58644 0.196 0.8498
# Adcost 0.69922 1.30317 0.537 0.6062
# Boys 0.91879 0.90979 1.010 0.3421
# Competitor 0.06665 2.21103 0.030 0.9767
# Customer 0.24157 0.29890 0.808 0.4423
# Outlets 1.62045 0.61836 2.621 0.0306 * # Varieties -1.97797
2.30988 -0.856 0.4167 # ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 6.26 on 8 degrees of freedom
# Multiple R-squared: 0.9534, Adjusted R-squared: 0.9184
# F-statistic: 27.25 on 6 and 8 DF, p-value: 0.00006579 # Interpretation
# It can be seen that the p-value of F-statistic is 0.00006579, which is highly significant. This means that
one of the predictor variables is

significantly related to the outcome variable.

You might also like