You are on page 1of 27

EMPIRICAL SOFTWARE

ENGINEERING(SWE504)
PRACTICAL FILE

Delhi Technological University


Shahbad Daulatpur Village, Rohini, Delhi-110042

Submitted To: Submitted By:


Dr. Abhilasha Sharma Yankit Kumar
Assistant Professor M.Tech (SWE)
CSE Dept. 2K19/SWE/16
List of Experiments

S.No. Experiment Page No.

1. Introduction to R Programming. 1-11


To summarize descriptive statistics for each variable considering
suitable dataset:
2. 12-14
a) Types of variables
b) Frequency distribution for the variables (counts & percentages).
To generate measures of central tendency & measures of dispersion for
3. 15-18
each attributes in the dataset.
To calculate univariate outliers for each variable using box plot and z-
4. 19-22
scores considering suitable dataset.
To calculate correlation between two data samples:
a) Pearson’s correlation coefficient to summarize the linear

5. relationship. 23-25
b) Spearman’s correlation coefficient to summarize the monotonic
relationship.
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Experiment - 1
Aim:
Introduction of R Programming

Introduction:
R is a powerful language used widely for data analysis and statistical computing. It was developed
in early 90s. Since then, endless efforts have been made to improve R’s user interface. The journey
of R language from a rudimentary text editor to interactive R Studio and more recently Jupyter
Notebooks has engaged many data science communities across the world.
This was possible only because of generous contributions by R users globally. Inclusion of
powerful packages in R has made it more and more powerful with time. Packages such as dplyr,
tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization and
computation much faster.

Why Learn R?
1. The style of coding is quite easy.
2. It’s open source. No need to pay any subscription charges.
3. Availability of instant access to over 7800 packages customized for various computation
tasks.
4. The community support is overwhelming. There are numerous forums to help you out.
5. Get high performance computing experience ( require packages)
6. One of highly sought skill by analytics and data science companies.

How To Install R / RStudio?


1. Go to https://www.rstudio.com/products/rstudio/download/

1
SWE504 EMPIRICAL SOFTWARE ENGINEERING

2. In ‘Installers for Supported Platforms’ section, choose and click the R Studio installer
based on your operating system. The download should begin as soon as you click.

3. Click Next. Then again click Next. Finally, Click Install.

2
SWE504 EMPIRICAL SOFTWARE ENGINEERING

4. Installation Complete.

5. To Start R Studio, click on its desktop icon or use ‘search windows’ to access the
program.
RStudio looks like this:

The interface of R Studio:


1. R Console: This area shows the output of code you run. Also, you can directly write
codes in console. Code entered directly in R console cannot be traced later. This is where
R script comes to use.

3
SWE504 EMPIRICAL SOFTWARE ENGINEERING

2. R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click on
little ‘Run’ button location at top right corner of R Script.
3. R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly in R,
always look at this area.
4. Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.

How To Install R packages ?


To install a package, simply type:
install.packages("package name")

Basic Computations in R:
1. Addition
> 2 + 3
[1] 5

2. Division
> 6 / 3
[1] 2

4
SWE504 EMPIRICAL SOFTWARE ENGINEERING

3. Multiplication
> (3 * 8) / (2 * 3)
[1] 4

4. Logarithmic
> log(12)
[1] 2.484907

5. Square Root
> sqrt (121)
[1] 11

6. Creating Variables Using <— or = operator


> x <- 8 + 7
> x
[1] 15

> y = 15 - 9
> y
[1] 6

5
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Objects in R:
R has five basic or ‘atomic’ classes of objects. Everything you see or create in R is an object. A
vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic
classes of objects. This includes:
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)
An object can have following attributes:
1. names, dimension names
2. dimensions
3. class
4. length
Attributes of an object can be accessed using attributes() function. The most basic object in R is
known as vector. We can create an empty vector using vector(). We can create vector using c() or
concatenate command also.
> a <- c(1.8, 4.5) #numeric
> b <- c(1 + 2i, 3 - 6i) #complex
> d <- c(23, 44) #integer
> e <- vector("logical", length = 5)

6
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Data Types in R:
1. Vector:
A vector contains object of same class. But, we can mix objects of different classes too.
When objects of different classes are mixed in a list, coercion occurs. This effect causes
the objects of different types to ‘convert’ into one class. For example:
> qt <- c("Time", 24, "January", TRUE, 3.33) #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character

To check the class of any object, use class(“vector name”) function.


> class(qt)
[1] "character"

To convert the class of a vector, you can use as. command .


> bar <- 0:5
> class(bar)
[1] "integer"
> as.numeric(bar)
[1] 0 1 2 3 4 5
> bar <- as.numeric(bar)
> class(bar)
[1] "numeric"
> as.character(bar)
[1] "0" "1" "2" "3" "4" "5"
> bar <- as.character(bar)
> class(bar)
[1] "character"

Similarly, we can change the class of any vector.

7
SWE504 EMPIRICAL SOFTWARE ENGINEERING

2. List:
A list is a special type of vector which contain elements of different data types. For
example:
> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 22

[[2]]
[1] "ab"

[[3]]
[1] TRUE

[[4]]
[1] 1+2i

As we can see, the output of a list is different from a vector. This is because, all the objects
are of different types. The double bracket [[1]] shows the index of first element and so on.
Hence, we can easily extract the element of lists depending on their index. Like this:
> my_list[[3]]
[1] TRUE

You can use [] single bracket too. But, that would return the list element with its index
number, instead of the result above. Like this:
> my_list[3]
[[1]]
[1] TRUE

8
SWE504 EMPIRICAL SOFTWARE ENGINEERING

3. Matrices:
When a vector is introduced with row and column i.e. a dimension attribute, it becomes a
matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data
structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2
columns:
> my_matrix <- matrix(1:6, nrow = 3, ncol = 2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

The dimensions of a matrix can be obtained using either dim() or attributes() command.
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2

To extract a particular element from a matrix, simply use the index shown above. For
example:
> my_matrix[,2] #extracts second column
[1] 4 5 6
> my_matrix[,1] #extracts first column
[1] 1 2 3
> my_matrix[2,] #extracts second row
[1] 2 5
> my_matrix[1,] #extracts first row
[1] 1 4

We can also create a matrix from a vector.


> age <- c(23, 44, 15, 12, 31, 16)
> age
[1] 23 44 15 12 31 16
> dim(age) <- c(2, 3)
> age
[,1] [,2] [,3]
[1,] 23 15 31
[2,] 44 12 16
> class(age)
[1] "matrix"

We can also join two vectors using cbind() and rbind() functions.
> x <- c(1, 2, 3, 4, 5, 6)
> y <- c(20, 30, 40, 50, 60)
> cbind(x,y)
x y
[1,] 1 20
[2,] 2 30
[3,] 3 40

9
SWE504 EMPIRICAL SOFTWARE ENGINEERING

[4,] 4 50
[5,] 5 60
[6,] 6 20
> rbind(x,y)
[,1] [,2] [,3] [,4] [,5] [,6]
x 1 2 3 4 5 6
y 20 30 40 50 60 20
> class(cbind(x,y))
[1] "matrix"
> class(rbind(x,y))
[1] "matrix"

10
SWE504 EMPIRICAL SOFTWARE ENGINEERING

4. Data Frame:
This is the most commonly used member of data types family. It is used to store tabular
data. It is different from matrix. In a matrix, every element must have same class. But, in a
data frame, you can put list of vectors containing different classes. This means, every
column of a data frame acts like a list. Every time we will read data in R, it will be stored
in the form of a data frame. Hence, it is important to understand the majorly used
commands on data frame:
> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,
56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91
> dim(df)
[1] 4 2
> str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91
> nrow(df)
[1] 4
> ncol(df)
[1] 2
> mean(df$score)
[1] 75.25

df is the name of data frame. dim() returns the dimension of data frame as 4 rows and 2
columns. str() returns the structure of a data frame i.e. the list of variables stored in the data
frame. nrow() and ncol() return the number of rows and number of columns in a data set
respectively. mean() return the mean value of selected column.

11
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Experiment - 2
Aim:
To summarize descriptive statistics for each variable considering suitable dataset:
a) Types of variables
b) Frequency distribution for the variables (counts & percentages)

What is Summary Statistics/Descriptive Statistics?


All the data which is gathered for any analysis is useful when it is properly represented so that it
is easily understandable by everyone and helps in proper decision making. After we carry out the
data analysis, we delineate its summary so as to understand it in a much better way. This is known
as summarizing the data.
With the help of descriptive statistics, we can represent the information about our datasets. They
also form the platform for carrying out complex computations as well as analysis. Therefore, even
though they are developed with simple methods, they play a crucial role in the process of analysis.
R has 5 basic classes of objects. This includes:
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)
The frequency distribution of a data variable is a summary of the data occurrence in a collection
of non-overlapping categories.

Dataset Used:
R built-in data frame named “painters”. It is a compilation of technical information of a few
eighteenth century classical painters. The data set belongs to the “MASS” package, and has to be
pre-loaded into the R workspace prior to its use.

Source Code:
library(MASS)
library(janitor)
library(tibble)
painter<-rownames_to_column(painters, var="Painter")

cat("\n\nTypes of Variables\n\n")
painter.class <- sapply(painter,class)
print(painter.class)

cat("\nFrequency Distribution for the variables (counts & percentages)")


for(x in names(painters)) {

12
SWE504 EMPIRICAL SOFTWARE ENGINEERING

print(tabyl(painter,x))
}

Output:
Types of Variables

Painter Composition Drawing Colour Expression School


"character" "integer" "integer" "integer" "integer" "factor"

Frequency Distribution for the variables (counts & percentages)

Composition n percent
0 1 0.01851852
4 3 0.05555556
5 1 0.01851852
6 3 0.05555556
8 6 0.11111111
9 1 0.01851852
10 6 0.11111111
11 2 0.03703704
12 4 0.07407407
13 5 0.09259259
14 3 0.05555556
15 14 0.25925926
16 2 0.03703704
17 1 0.01851852
18 2 0.03703704
Drawing n percent
6 5 0.09259259
8 5 0.09259259
9 2 0.03703704
10 7 0.12962963
12 3 0.05555556
13 5 0.09259259
14 7 0.12962963
15 10 0.18518519
16 5 0.09259259
17 4 0.07407407
18 1 0.01851852
Colour n percent
0 1 0.01851852
4 4 0.07407407
5 1 0.01851852
6 6 0.11111111
7 2 0.03703704
8 5 0.09259259
9 3 0.05555556
10 7 0.12962963
12 3 0.05555556
13 2 0.03703704
14 3 0.05555556
15 2 0.03703704
16 8 0.14814815
17 5 0.09259259
18 2 0.03703704
Expression n percent
0 5 0.09259259
2 1 0.01851852
3 2 0.03703704
4 7 0.12962963
5 2 0.03703704

13
SWE504 EMPIRICAL SOFTWARE ENGINEERING

6 12 0.22222222
7 1 0.01851852
8 6 0.11111111
9 1 0.01851852
10 3 0.05555556
12 2 0.03703704
13 4 0.07407407
14 2 0.03703704
15 2 0.03703704
16 1 0.01851852
17 2 0.03703704
18 1 0.01851852
School n percent
A 10 0.18518519
B 6 0.11111111
C 6 0.11111111
D 10 0.18518519
E 7 0.12962963
F 4 0.07407407
G 7 0.12962963
H 4 0.07407407
Results and Discussions:
We successfully summarized the descriptive statistics for each variable of given dataset. We
summarized the types of variables and the frequency distribution of the given dataset.

Learning and findings:


In this experiment we learned how to summarize the descriptive statistics such as type of variables,
frequency distribution.

14
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Experiment - 3
Aim:
To generate measures of central tendency & measures of dispersion for each attributes in the
dataset

Theory:
After data collection, descriptive statistics can be used to summarize and analyze the nature of the
data. The descriptive statistics are used to describe the data, for example, extracting attributes with
very few data points or determining the spread of the data.

Measures of Central Tendency


Measures of central tendency are used to summarize the average values of the attributes. These
measures include mean, median, and mode. They are known as measures of central tendency as
they provide idea about the central values of the data around which all the other values tend to
gather.
1. Mean
Mean can be computed by taking the average values of the data set. Mean is defined as the
ratio of sum of values of the data points to the total number of data points.

2. Median
The median is that value which divides the data into two halves. Half of the number of data
points are below the median values and half number of the data points are above the median
values. For odd number of data points, median is the central value, and for even number of
data points, median is the mean of the two central values.
Median is not useful, if number of categories in the ordinal type of scale are very low. In
such cases, mode is the preferred measure of central tendency.
3. Mode
Mode gives the value that has the highest frequency in the distribution.

15
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Measures of Dispersion
The measures of dispersion indicate the spread or the range of the distributions in the data set.
Measures of dispersion include range, standard deviation, variance, and quartiles.
1. Range
The range is defined as the difference between the highest value and the lowest value in
the distribution. It is the easiest measure that can be quickly computed.
Range = Maximum Value – Minimum Value
2. Standard deviation
The standard deviation is a measure of variation which is commonly used with
interval/ratio data. It’s a measurement of how close the observations in the data set are to
the mean.

For normally distributed data – 68% of data points fall within the mean ± 1 standard
deviation, 95% of data points fall within the mean ± 2 standard deviations, and 99.7% of
data points fall within the mean ± 3 standard deviations.
Standard deviation may not be appropriate for skewed data.
3. Standard error of the mean
Standard error of the mean is a measure that estimates how close a calculated mean is likely
to be to the true mean of that population. It is commonly used in tables or plots where
multiple means are presented together.

The standard error is the standard deviation of a data set divided by the square root of the
number of observations. Standard error of the mean may not be appropriate for skewed
data.
4. Five-number summary, quartiles, percentiles
The median is the same as the 50th percentile, because 50% of values fall below this value.
Other percentiles for a data set can be identified to provide more information. Typically,
the 0th, 25th, 50th, 75th, and 100th percentiles are reported. This is sometimes called the
five-number summary. These values can also be called the minimum, 1st quartile, 2nd
quartile, 3rd quartile, and maximum.

16
SWE504 EMPIRICAL SOFTWARE ENGINEERING

The five-number summary is a useful measure of variation for skewed interval/ratio data
or for ordinal data. 25% of values fall below the 1st quartile and 25% of values fall above
the 3rd quartile. This leaves the middle 50% of values between the 1st and 3rd quartiles,
giving a sense of the range of the middle half of the data. This range is called the
interquartile range (IQR).

Percentiles and quartiles are relatively robust, as they aren’t affected much by a few
extreme values. They are appropriate for both skewed and unskewed data.

Dataset Used:
R built-in data frame named “painters”. It is a compilation of technical information of a few
eighteenth century classical painters. The data set belongs to the “MASS” package, and has to be
pre-loaded into the R workspace prior to its use.

Source Code:
library(MASS)
library(pastecs)

#Storing painters dataset into a variable


pnt <- painters

cat("\n\nMeasures of central tendency and dispersion (std deviation, std error of the
mean)\n\n")
pnt.tend <- stat.desc(pnt)
print(pnt.tend)

cat("\n\nMeasures of dispersion (five-number summary)\n\n")


pnt.disp <- summary(pnt)
print(pnt.disp)

Output:
Measures of central tendency and dispersion (std deviation, std error of the mean)

Composition Drawing Colour Expression School


nbr.val 54.0000000 54.0000000 54.0000000 54.0000000 NA
nbr.null 1.0000000 0.0000000 1.0000000 5.0000000 NA
nbr.na 0.0000000 0.0000000 0.0000000 0.0000000 NA
min 0.0000000 6.0000000 0.0000000 0.0000000 NA
max 18.0000000 18.0000000 18.0000000 18.0000000 NA
range 18.0000000 12.0000000 18.0000000 18.0000000 NA
sum 624.0000000 673.0000000 591.0000000 414.0000000 NA
median 12.5000000 13.5000000 10.0000000 6.0000000 NA
mean 11.5555556 12.4629630 10.9444444 7.6666667 NA
SE.mean 0.5561841 0.4704496 0.6330169 0.6528976 NA
CI.mean.0.95 1.1155641 0.9436024 1.2696712 1.3095468 NA
var 16.7044025 11.9514326 21.6383648 23.0188679 NA
std.dev 4.0871020 3.4570844 4.6517056 4.7977982 NA
coef.var 0.3536915 0.2773886 0.4250289 0.6257998 NA

17
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Measures of dispersion (five-number summary)

Composition Drawing Colour Expression School


Min. : 0.00 Min. : 6.00 Min. : 0.00 Min. : 0.000 A :10
1st Qu.: 8.25 1st Qu.:10.00 1st Qu.: 7.25 1st Qu.: 4.000 D :10
Median :12.50 Median :13.50 Median :10.00 Median : 6.000 E : 7
Mean :11.56 Mean :12.46 Mean :10.94 Mean : 7.667 G : 7
3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:16.00 3rd Qu.:11.500 B : 6
Max. :18.00 Max. :18.00 Max. :18.00 Max. :18.000 C : 6
(Other): 8

Results and Discussions:


We successfully calculated the measures of central tendency and measures of dispersion for each
variable of given dataset. We calculated the mean, median, standard deviation, standard error of
mean and five-number summary.

Learning and findings:


In this experiment we learned how to find measures of central tendency and measures of
dispersion.

18
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Experiment - 4
Aim:
To calculate univariate outliers for each variable using box plot and z-scores considering suitable
dataset

Theory:
Outlier analysis is carried out to detect the data points that are overinfluential and must be
considered for removal from the data sets. The outliers can be divided into three types: univariate,
bivariate, and multivariate.
Univariate outliers are influential data points that occur within a single variable. Once the outliers
are detected, the researcher must make the decision of inclusion or exclusion of the identified
outlier. The outliers generally signal the presence of anomalies, but they may sometimes provide
interesting patterns to the researchers. The decision is based on the reason of the occurrence of the
outlier.
Box plots, z-scores, and scatter plots can be used for detecting univariate outliers.
1. Boxplot
Box plots are based on median and quartiles. Box plots are constructed using upper and
lower quartiles.

Box Plot Example

The two boundary lines signify the start and end of the tail. These two boundary lines
correspond to ±1.5 IQR (IQR = Q3 – Q1).

Thus, once the value of IQR is known, it is multiplied by 1.5. The values shown inside of
the box plots are known to be within the boundaries, and hence are not considered to be

19
SWE504 EMPIRICAL SOFTWARE ENGINEERING

extreme. The data points beyond the start and end of the boundaries or tail are considered
to be outliers.
2. Z-Score
Z-score is another method to identify outliers and is used to depict the relationship of a
value to its mean, and is given as follows:

Dataset Used:
R built-in data frame named “precip”. It is a compilation of the average amount of precipitation
(rainfall) in inches for each of 70 United States (and Puerto Rico) cities. It come pre-loaded into
the R workspace prior to its use.

Source Code:
#Storing precip dataset into a variable
rain <- precip

#Plotting boxplots for each variable


boxplot(rain,
main = "Annual Precipitation in US Cities",
xlab = "Precipitation (rainfall) in inches",
ylab = "US (and Puerto Rico) cities",
horizontal = TRUE,
notch = TRUE
)

#Calculating z-scores for each variable


rain.zscore<-scale(rain)
colnames(rain.zscore)<-"Z-Score"
print(rain.zscore)

Output:

20
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Z-Score
Mobile 2.342971149
Juneau 1.445596523
Phoenix -2.034466051
Little Rock 0.993261346
Los Angeles -1.523765044
Sacramento -1.290301727
San Francisco -1.034951224
Denver -1.596722331
Hartford 0.621179184
Wilmington 0.387715866
Washington 0.292871394
Jacksonville 1.431005066
Miami 1.817678685
Atlanta 0.978669888
Honolulu -0.874445193
Boise -1.706158261
Chicago -0.035436396
Peoria 0.015633704
Indianapolis 0.278279936
Des Moines -0.298082628
Wichita -0.312674086
Louisville 0.599291998
New Orleans 1.598806825
Portland 0.431490238
Baltimore 0.504447525
Boston 0.555517626
Detroit -0.283491171
Sault Ste. Marie -0.232421070
Duluth -0.341857000
Minneapolis/St Paul -0.655573333
Jackson 1.044331446
Kansas City 0.154252549
St Louis 0.073999534
Great Falls -1.450807758
Omaha -0.341857000
Reno -2.019874594
Concord 0.095886720
Atlantic City 0.774389486
Albuquerque -1.976100222
Albany -0.108393683
Buffalo 0.088590991
New York 0.387715866
Charlotte 0.570109083
Raleigh 0.555517626
Bismark -1.363259014
Cincinnati 0.300167122
Cleveland 0.008337976
Columbus 0.154252549
Oklahoma City -0.254308256
Portland 0.198026921
Philadelphia 0.365828680
Pittsburg 0.095886720
Providence 0.577404812
Columbia 0.840051044
Sioux Falls -0.743122077
Memphis 1.037035718
Nashville 0.810868129
Dallas 0.073999534
El Paso -1.976100222
Houston 0.971374160
Salt Lake City -1.436216300

21
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Burlington -0.174055241
Norfolk 0.716023656
Richmond 0.562813354
Seattle Tacoma 0.285575665
Spokane -1.275710270
Charleston 0.431490238
Milwaukee -0.422110016
Cheyenne -1.479990672
San Juan 1.773904313
attr(,"scaled:center")
[1] 34.88571
attr(,"scaled:scale")
[1] 13.70665

Results and Discussions:


We successfully calculated the z-score and plotted the boxplot for each variable of given dataset.
We successfully performed the univariate analysis.

Learning and findings:


In this experiment we learned how to perform univariate analysis.

22
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Experiment - 5
Aim:
To calculate correlation between two data samples:
a) Pearson’s correlation coefficient to summarize the linear relationship.
b) Spearman’s correlation coefficient to summarize the monotonic relationship.

Theory:
There are different methods to perform correlation analysis:
• Pearson correlation (r), which measures a linear dependence between two variables (x and
y). It’s also known as a parametric correlation test because it depends to the distribution of
the data. It can be used only when x and y are from normal distribution. The plot of y =
f(x) is named the linear regression curve.
• Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-
parametric) Box plots, z-scores, and scatter plots can be used for detecting univariate
outliers.
Pearson correlation formula

where
x and y are two vectors of length n
mx and my corresponds to the means of x and y, respectively
The p-value (significance level) of the correlation can be determined:
a) by using the correlation coefficient table for the degrees of freedom:

where n is the number of observation in x and y variables


b) or by calculating the t value as follow:

the corresponding p-value is determined using t distribution table for df=n−2


If the p-value is < 5%, then the correlation between x and y is significant.

23
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Spearman correlation formula


The Spearman correlation method computes the correlation between the rank of x and the rank of
y variables

Dataset Used:
R built-in data frame named “mtcars”. The data was extracted from the 1974 Motor Trend US
magazine, and comprises fuel consumption and 10 aspects of automobile design and performance
for 32 automobiles (1973–74 models). It come pre-loaded into the R workspace prior to its use.

Source Code:
car <- mtcars

#Correlation test between mpg and wt variables

pearson_result <- cor.test(car$wt, car$mpg, method = "pearson")

print(pearson_result)

#Spearman rank correlation coefficient

spearman_result <-cor.test(car$wt, car$mpg, method = "spearman")

print(spearman_result)

Output:
Pearson's product-moment correlation

data: car$wt and car$mpg


t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
Spearman's rank correlation rho

data: car$wt and car$mpg


S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.886422

24
SWE504 EMPIRICAL SOFTWARE ENGINEERING

Results and Discussions:


We successfully calculated the Pearson’s coefficient and Spearman’s coefficient between mpg and
cyl of the mtcars dataset. We successfully performed the correlation test.

Learning and findings:


In this experiment we learned how to calculate correlation between two data samples using
Pearson’s correlation test and Spearman’s correlation test.

25

You might also like