Professional Documents
Culture Documents
Ese 16 PDF
Ese 16 PDF
ENGINEERING(SWE504)
PRACTICAL FILE
5. relationship. 23-25
b) Spearman’s correlation coefficient to summarize the monotonic
relationship.
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Experiment - 1
Aim:
Introduction of R Programming
Introduction:
R is a powerful language used widely for data analysis and statistical computing. It was developed
in early 90s. Since then, endless efforts have been made to improve R’s user interface. The journey
of R language from a rudimentary text editor to interactive R Studio and more recently Jupyter
Notebooks has engaged many data science communities across the world.
This was possible only because of generous contributions by R users globally. Inclusion of
powerful packages in R has made it more and more powerful with time. Packages such as dplyr,
tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization and
computation much faster.
Why Learn R?
1. The style of coding is quite easy.
2. It’s open source. No need to pay any subscription charges.
3. Availability of instant access to over 7800 packages customized for various computation
tasks.
4. The community support is overwhelming. There are numerous forums to help you out.
5. Get high performance computing experience ( require packages)
6. One of highly sought skill by analytics and data science companies.
1
SWE504 EMPIRICAL SOFTWARE ENGINEERING
2. In ‘Installers for Supported Platforms’ section, choose and click the R Studio installer
based on your operating system. The download should begin as soon as you click.
2
SWE504 EMPIRICAL SOFTWARE ENGINEERING
4. Installation Complete.
5. To Start R Studio, click on its desktop icon or use ‘search windows’ to access the
program.
RStudio looks like this:
3
SWE504 EMPIRICAL SOFTWARE ENGINEERING
2. R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click on
little ‘Run’ button location at top right corner of R Script.
3. R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly in R,
always look at this area.
4. Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.
Basic Computations in R:
1. Addition
> 2 + 3
[1] 5
2. Division
> 6 / 3
[1] 2
4
SWE504 EMPIRICAL SOFTWARE ENGINEERING
3. Multiplication
> (3 * 8) / (2 * 3)
[1] 4
4. Logarithmic
> log(12)
[1] 2.484907
5. Square Root
> sqrt (121)
[1] 11
> y = 15 - 9
> y
[1] 6
5
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Objects in R:
R has five basic or ‘atomic’ classes of objects. Everything you see or create in R is an object. A
vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic
classes of objects. This includes:
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)
An object can have following attributes:
1. names, dimension names
2. dimensions
3. class
4. length
Attributes of an object can be accessed using attributes() function. The most basic object in R is
known as vector. We can create an empty vector using vector(). We can create vector using c() or
concatenate command also.
> a <- c(1.8, 4.5) #numeric
> b <- c(1 + 2i, 3 - 6i) #complex
> d <- c(23, 44) #integer
> e <- vector("logical", length = 5)
6
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Data Types in R:
1. Vector:
A vector contains object of same class. But, we can mix objects of different classes too.
When objects of different classes are mixed in a list, coercion occurs. This effect causes
the objects of different types to ‘convert’ into one class. For example:
> qt <- c("Time", 24, "January", TRUE, 3.33) #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character
7
SWE504 EMPIRICAL SOFTWARE ENGINEERING
2. List:
A list is a special type of vector which contain elements of different data types. For
example:
> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 22
[[2]]
[1] "ab"
[[3]]
[1] TRUE
[[4]]
[1] 1+2i
As we can see, the output of a list is different from a vector. This is because, all the objects
are of different types. The double bracket [[1]] shows the index of first element and so on.
Hence, we can easily extract the element of lists depending on their index. Like this:
> my_list[[3]]
[1] TRUE
You can use [] single bracket too. But, that would return the list element with its index
number, instead of the result above. Like this:
> my_list[3]
[[1]]
[1] TRUE
8
SWE504 EMPIRICAL SOFTWARE ENGINEERING
3. Matrices:
When a vector is introduced with row and column i.e. a dimension attribute, it becomes a
matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data
structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2
columns:
> my_matrix <- matrix(1:6, nrow = 3, ncol = 2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
The dimensions of a matrix can be obtained using either dim() or attributes() command.
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
To extract a particular element from a matrix, simply use the index shown above. For
example:
> my_matrix[,2] #extracts second column
[1] 4 5 6
> my_matrix[,1] #extracts first column
[1] 1 2 3
> my_matrix[2,] #extracts second row
[1] 2 5
> my_matrix[1,] #extracts first row
[1] 1 4
We can also join two vectors using cbind() and rbind() functions.
> x <- c(1, 2, 3, 4, 5, 6)
> y <- c(20, 30, 40, 50, 60)
> cbind(x,y)
x y
[1,] 1 20
[2,] 2 30
[3,] 3 40
9
SWE504 EMPIRICAL SOFTWARE ENGINEERING
[4,] 4 50
[5,] 5 60
[6,] 6 20
> rbind(x,y)
[,1] [,2] [,3] [,4] [,5] [,6]
x 1 2 3 4 5 6
y 20 30 40 50 60 20
> class(cbind(x,y))
[1] "matrix"
> class(rbind(x,y))
[1] "matrix"
10
SWE504 EMPIRICAL SOFTWARE ENGINEERING
4. Data Frame:
This is the most commonly used member of data types family. It is used to store tabular
data. It is different from matrix. In a matrix, every element must have same class. But, in a
data frame, you can put list of vectors containing different classes. This means, every
column of a data frame acts like a list. Every time we will read data in R, it will be stored
in the form of a data frame. Hence, it is important to understand the majorly used
commands on data frame:
> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,
56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91
> dim(df)
[1] 4 2
> str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91
> nrow(df)
[1] 4
> ncol(df)
[1] 2
> mean(df$score)
[1] 75.25
df is the name of data frame. dim() returns the dimension of data frame as 4 rows and 2
columns. str() returns the structure of a data frame i.e. the list of variables stored in the data
frame. nrow() and ncol() return the number of rows and number of columns in a data set
respectively. mean() return the mean value of selected column.
11
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Experiment - 2
Aim:
To summarize descriptive statistics for each variable considering suitable dataset:
a) Types of variables
b) Frequency distribution for the variables (counts & percentages)
Dataset Used:
R built-in data frame named “painters”. It is a compilation of technical information of a few
eighteenth century classical painters. The data set belongs to the “MASS” package, and has to be
pre-loaded into the R workspace prior to its use.
Source Code:
library(MASS)
library(janitor)
library(tibble)
painter<-rownames_to_column(painters, var="Painter")
cat("\n\nTypes of Variables\n\n")
painter.class <- sapply(painter,class)
print(painter.class)
12
SWE504 EMPIRICAL SOFTWARE ENGINEERING
print(tabyl(painter,x))
}
Output:
Types of Variables
Composition n percent
0 1 0.01851852
4 3 0.05555556
5 1 0.01851852
6 3 0.05555556
8 6 0.11111111
9 1 0.01851852
10 6 0.11111111
11 2 0.03703704
12 4 0.07407407
13 5 0.09259259
14 3 0.05555556
15 14 0.25925926
16 2 0.03703704
17 1 0.01851852
18 2 0.03703704
Drawing n percent
6 5 0.09259259
8 5 0.09259259
9 2 0.03703704
10 7 0.12962963
12 3 0.05555556
13 5 0.09259259
14 7 0.12962963
15 10 0.18518519
16 5 0.09259259
17 4 0.07407407
18 1 0.01851852
Colour n percent
0 1 0.01851852
4 4 0.07407407
5 1 0.01851852
6 6 0.11111111
7 2 0.03703704
8 5 0.09259259
9 3 0.05555556
10 7 0.12962963
12 3 0.05555556
13 2 0.03703704
14 3 0.05555556
15 2 0.03703704
16 8 0.14814815
17 5 0.09259259
18 2 0.03703704
Expression n percent
0 5 0.09259259
2 1 0.01851852
3 2 0.03703704
4 7 0.12962963
5 2 0.03703704
13
SWE504 EMPIRICAL SOFTWARE ENGINEERING
6 12 0.22222222
7 1 0.01851852
8 6 0.11111111
9 1 0.01851852
10 3 0.05555556
12 2 0.03703704
13 4 0.07407407
14 2 0.03703704
15 2 0.03703704
16 1 0.01851852
17 2 0.03703704
18 1 0.01851852
School n percent
A 10 0.18518519
B 6 0.11111111
C 6 0.11111111
D 10 0.18518519
E 7 0.12962963
F 4 0.07407407
G 7 0.12962963
H 4 0.07407407
Results and Discussions:
We successfully summarized the descriptive statistics for each variable of given dataset. We
summarized the types of variables and the frequency distribution of the given dataset.
14
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Experiment - 3
Aim:
To generate measures of central tendency & measures of dispersion for each attributes in the
dataset
Theory:
After data collection, descriptive statistics can be used to summarize and analyze the nature of the
data. The descriptive statistics are used to describe the data, for example, extracting attributes with
very few data points or determining the spread of the data.
2. Median
The median is that value which divides the data into two halves. Half of the number of data
points are below the median values and half number of the data points are above the median
values. For odd number of data points, median is the central value, and for even number of
data points, median is the mean of the two central values.
Median is not useful, if number of categories in the ordinal type of scale are very low. In
such cases, mode is the preferred measure of central tendency.
3. Mode
Mode gives the value that has the highest frequency in the distribution.
15
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Measures of Dispersion
The measures of dispersion indicate the spread or the range of the distributions in the data set.
Measures of dispersion include range, standard deviation, variance, and quartiles.
1. Range
The range is defined as the difference between the highest value and the lowest value in
the distribution. It is the easiest measure that can be quickly computed.
Range = Maximum Value – Minimum Value
2. Standard deviation
The standard deviation is a measure of variation which is commonly used with
interval/ratio data. It’s a measurement of how close the observations in the data set are to
the mean.
For normally distributed data – 68% of data points fall within the mean ± 1 standard
deviation, 95% of data points fall within the mean ± 2 standard deviations, and 99.7% of
data points fall within the mean ± 3 standard deviations.
Standard deviation may not be appropriate for skewed data.
3. Standard error of the mean
Standard error of the mean is a measure that estimates how close a calculated mean is likely
to be to the true mean of that population. It is commonly used in tables or plots where
multiple means are presented together.
The standard error is the standard deviation of a data set divided by the square root of the
number of observations. Standard error of the mean may not be appropriate for skewed
data.
4. Five-number summary, quartiles, percentiles
The median is the same as the 50th percentile, because 50% of values fall below this value.
Other percentiles for a data set can be identified to provide more information. Typically,
the 0th, 25th, 50th, 75th, and 100th percentiles are reported. This is sometimes called the
five-number summary. These values can also be called the minimum, 1st quartile, 2nd
quartile, 3rd quartile, and maximum.
16
SWE504 EMPIRICAL SOFTWARE ENGINEERING
The five-number summary is a useful measure of variation for skewed interval/ratio data
or for ordinal data. 25% of values fall below the 1st quartile and 25% of values fall above
the 3rd quartile. This leaves the middle 50% of values between the 1st and 3rd quartiles,
giving a sense of the range of the middle half of the data. This range is called the
interquartile range (IQR).
Percentiles and quartiles are relatively robust, as they aren’t affected much by a few
extreme values. They are appropriate for both skewed and unskewed data.
Dataset Used:
R built-in data frame named “painters”. It is a compilation of technical information of a few
eighteenth century classical painters. The data set belongs to the “MASS” package, and has to be
pre-loaded into the R workspace prior to its use.
Source Code:
library(MASS)
library(pastecs)
cat("\n\nMeasures of central tendency and dispersion (std deviation, std error of the
mean)\n\n")
pnt.tend <- stat.desc(pnt)
print(pnt.tend)
Output:
Measures of central tendency and dispersion (std deviation, std error of the mean)
17
SWE504 EMPIRICAL SOFTWARE ENGINEERING
18
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Experiment - 4
Aim:
To calculate univariate outliers for each variable using box plot and z-scores considering suitable
dataset
Theory:
Outlier analysis is carried out to detect the data points that are overinfluential and must be
considered for removal from the data sets. The outliers can be divided into three types: univariate,
bivariate, and multivariate.
Univariate outliers are influential data points that occur within a single variable. Once the outliers
are detected, the researcher must make the decision of inclusion or exclusion of the identified
outlier. The outliers generally signal the presence of anomalies, but they may sometimes provide
interesting patterns to the researchers. The decision is based on the reason of the occurrence of the
outlier.
Box plots, z-scores, and scatter plots can be used for detecting univariate outliers.
1. Boxplot
Box plots are based on median and quartiles. Box plots are constructed using upper and
lower quartiles.
The two boundary lines signify the start and end of the tail. These two boundary lines
correspond to ±1.5 IQR (IQR = Q3 – Q1).
Thus, once the value of IQR is known, it is multiplied by 1.5. The values shown inside of
the box plots are known to be within the boundaries, and hence are not considered to be
19
SWE504 EMPIRICAL SOFTWARE ENGINEERING
extreme. The data points beyond the start and end of the boundaries or tail are considered
to be outliers.
2. Z-Score
Z-score is another method to identify outliers and is used to depict the relationship of a
value to its mean, and is given as follows:
Dataset Used:
R built-in data frame named “precip”. It is a compilation of the average amount of precipitation
(rainfall) in inches for each of 70 United States (and Puerto Rico) cities. It come pre-loaded into
the R workspace prior to its use.
Source Code:
#Storing precip dataset into a variable
rain <- precip
Output:
20
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Z-Score
Mobile 2.342971149
Juneau 1.445596523
Phoenix -2.034466051
Little Rock 0.993261346
Los Angeles -1.523765044
Sacramento -1.290301727
San Francisco -1.034951224
Denver -1.596722331
Hartford 0.621179184
Wilmington 0.387715866
Washington 0.292871394
Jacksonville 1.431005066
Miami 1.817678685
Atlanta 0.978669888
Honolulu -0.874445193
Boise -1.706158261
Chicago -0.035436396
Peoria 0.015633704
Indianapolis 0.278279936
Des Moines -0.298082628
Wichita -0.312674086
Louisville 0.599291998
New Orleans 1.598806825
Portland 0.431490238
Baltimore 0.504447525
Boston 0.555517626
Detroit -0.283491171
Sault Ste. Marie -0.232421070
Duluth -0.341857000
Minneapolis/St Paul -0.655573333
Jackson 1.044331446
Kansas City 0.154252549
St Louis 0.073999534
Great Falls -1.450807758
Omaha -0.341857000
Reno -2.019874594
Concord 0.095886720
Atlantic City 0.774389486
Albuquerque -1.976100222
Albany -0.108393683
Buffalo 0.088590991
New York 0.387715866
Charlotte 0.570109083
Raleigh 0.555517626
Bismark -1.363259014
Cincinnati 0.300167122
Cleveland 0.008337976
Columbus 0.154252549
Oklahoma City -0.254308256
Portland 0.198026921
Philadelphia 0.365828680
Pittsburg 0.095886720
Providence 0.577404812
Columbia 0.840051044
Sioux Falls -0.743122077
Memphis 1.037035718
Nashville 0.810868129
Dallas 0.073999534
El Paso -1.976100222
Houston 0.971374160
Salt Lake City -1.436216300
21
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Burlington -0.174055241
Norfolk 0.716023656
Richmond 0.562813354
Seattle Tacoma 0.285575665
Spokane -1.275710270
Charleston 0.431490238
Milwaukee -0.422110016
Cheyenne -1.479990672
San Juan 1.773904313
attr(,"scaled:center")
[1] 34.88571
attr(,"scaled:scale")
[1] 13.70665
22
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Experiment - 5
Aim:
To calculate correlation between two data samples:
a) Pearson’s correlation coefficient to summarize the linear relationship.
b) Spearman’s correlation coefficient to summarize the monotonic relationship.
Theory:
There are different methods to perform correlation analysis:
• Pearson correlation (r), which measures a linear dependence between two variables (x and
y). It’s also known as a parametric correlation test because it depends to the distribution of
the data. It can be used only when x and y are from normal distribution. The plot of y =
f(x) is named the linear regression curve.
• Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-
parametric) Box plots, z-scores, and scatter plots can be used for detecting univariate
outliers.
Pearson correlation formula
where
x and y are two vectors of length n
mx and my corresponds to the means of x and y, respectively
The p-value (significance level) of the correlation can be determined:
a) by using the correlation coefficient table for the degrees of freedom:
23
SWE504 EMPIRICAL SOFTWARE ENGINEERING
Dataset Used:
R built-in data frame named “mtcars”. The data was extracted from the 1974 Motor Trend US
magazine, and comprises fuel consumption and 10 aspects of automobile design and performance
for 32 automobiles (1973–74 models). It come pre-loaded into the R workspace prior to its use.
Source Code:
car <- mtcars
print(pearson_result)
print(spearman_result)
Output:
Pearson's product-moment correlation
24
SWE504 EMPIRICAL SOFTWARE ENGINEERING
25