You are on page 1of 32

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF CHEMICAL ENGINEERING




ASSIGNMENT REPORT
PROBABILITY AND STATISTICS
Semester: 222
Class: CC07 – Group: 05
Lecturer: Dr. NGUYEN TIEN DUNG

No. Last name Student ID Faculty


1 Lê Thị Thu Ngọc 2053271 Chemical Engineering
2 Trần Nguyễn Bảo Ngọc 2153626 Chemical Engineering
3 Huỳnh Ngọc Minh Anh 2153156 Chemical Engineering
4 Nguyễn Thị Ngọc Ánh 2153180 Chemical Engineering
5 Trần Anh Khoa 2153472 Chemical Engineering

Ho Chi Minh City - 2023


Member list and Workload

Percentage
No. Last name Student ID Tasks
of work
1 Lê Thị Thu Ngọc 2053271 Code, model analysis 20%
2 Trần Nguyễn Bảo Ngọc 2153626 Code, data analyzer 20%
3 Huỳnh Ngọc Minh Anh 2153156 Code, data visualisation 20%
4 Nguyễn Thị Ngọc Ánh 2153180 Code, data visualisation 20%
5 Trần Anh Khoa 2153472 Theory, dataset overview 20%
CONTENTS
1. THEORY ............................................................................................................ 1

1.1. Logistic Regression Analysis ......................................................................... 1

1.2. Method of Maximum Likelihood Estimation (MLE) ..................................... 2

1.3. ROC - AUC Method ...................................................................................... 2

2. OBJECTIVES AND METHODS ......................................................................... 3

2.1. Objectives......................................................................................................... 3

2.2. Methods............................................................................................................ 4

2.2.1. Model selection ............................................................................................ 4

2.2.2. Evaluate the overall meaning of the model ..................................................... 4

2.2.3. Make predictions........................................................................................... 5

3. ANALYZE THE DATA..................................................................................... 5

3.1. Dataset Overview .......................................................................................... 5

3.2. Import Data ................................................................................................... 6

3.3. Data cleaning ................................................................................................. 7

3.4. Descriptive statistics ...................................................................................... 8

3.4.1. Calculate the descriptive statistics ........................................................ 10

3.4.2. Make a table of the quantity statistics ................................................... 10

3.5. Data Visualization ....................................................................................... 12

3.5.1. Plot histograms showing the distribution of quantitative variables, and plot
statistical bar plots for each classifier ................................................................ 12

3.5.2. Plot a histogram showing the distribution of pH of high/low milk quality .......... 15

3.5.3. Plot a histogram showing the distribution of Temperature of high/low milk


quality ................................................................................................................ 16

3.5.4. Plot a histogram showing the distribution of Colour of high/low milk quality .... 17
3.5.5. Plot a barplot chart with quantitative statistics of Taste and Colour of
high/low milk grade ........................................................................................... 18

3.5.6. Plot a barplot chart with quantitative statistics of Fat and Turbidity of
high/low milk grade ........................................................................................... 19

3.6. Build up a logistic regression model to evaluate the milk quality ................. 20

3.7. Make predictions ......................................................................................... 23

4. CONCLUSION ................................................................................................ 27

5. REFERENCES................................................................................................. 28
1. THEORY
1.1. Logistic Regression Analysis
It is a classification algorithm that is used to predict the probability of a categorical
dependent variable based on one or more independent variables. The dependent variable
in logistic regression is binary, meaning it can take on one of two values, usually
represented as 0 and 1. But in many cases, the dependent variable is not a constant
variable but a binary measure tool: yes/no, ill/healthy, deceased/ alived, occurred/didn't
happen, etc., and the independent variables can be continuous or discontinuous.
In practice, these 0 and 1s will code for two classes to show that the event has
happened or not (no/yes). Given an event frequency x recorded from n subjects, we can
𝑥
calculate the probability of that event as: 𝑝 = .
𝑛

The probability of an event is simply defined as the ratio of the probability of the
event occurring to the probability of the event not occurring.
𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
𝑂= =
1 − 𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑟
𝑂
𝑜𝑟 𝑃 = ; 𝑤𝑖𝑡ℎ: 𝑂 𝑖𝑠 𝑜𝑑𝑑𝑠 𝑎𝑛𝑑 𝑃 𝑖𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
1+𝑂
The logistic regression model is:
𝑝
𝑙𝑜𝑔𝑖𝑡 (𝑝) = log ( )
1−𝑝
Then we can graph the diagram for this function due to the continuous relationship
between p and logit(p)

Figure 1: The graph demonstrates the relationship between p and logit(p)

1
Given an independent variable x (x can be continuous or discontinuous); α and β
are two linear parameters that need to be estimated from the sample data. In summary,
we can have two following equations for the simple logistic regression model:
𝑝
logit(p) = log ( ) = α + βx
1−𝑝
𝑝
odds(p) = = 𝑒 α+βx
1−𝑝
When the odds are 1, two events have equal probability. If the odds is lower than 1, a
negative event is favored and inversely.
Moreover, if we have more than one independent variable (x1, x2, ..., xp-1), we get
the multiple logistic regression model and the equation is expressed as
𝑝
log ( )= 𝛽0+ 𝛽1x1+…+ 𝛽p-1xp-1
1−𝑝

1.2. Method of Maximum Likelihood Estimation (MLE)


Maximum likelihood estimation (MLE) is a statistical method used to estimate the
parameters of a model based on observed data. The basic idea is to find the values of the
model parameters that maximize the likelihood of the observed data.
The likelihood function L (β) is simply the probability of the entire observed data
set for varying parameters. To “fit” this model, that is estimate the 𝛽 parameters, we will
use maximum likelihood.
[0, 1, 2, ..., p-1] = 
1.3. ROC - AUC Method
AUC is a performance metric used in logistic regression to evaluate the accuracy
of a binary classification model which is a measure of the overall quality of the model
and its ability to correctly classify both positive and negative examples.
In logistic regression, the model assigns a probability score to each example,
representing the predicted probability that the example belongs to the positive class. The
AUC metric measures the model's ability to distinguish between positive and negative
examples by plotting the true positive rate (TPR) against the false positive rate (FPR) at
different probability thresholds.
The ROC (Receiver Operating Characteristic) curve is a graphical representation
of the trade-off between the true positive rate (TPR) and the false positive rate (FPR)
for different threshold values. The AUC is the area under the ROC curve, which is a
2
measure of the model's ability to distinguish between positive and negative examples.
A perfect classifier has an AUC of 1.0, while a random classifier has an AUC of 0.5.
Therefore, a higher AUC indicates better performance of the logistic regression model
in discriminating between positive and negative examples.

Figure 2: Description of ROC - AUC


In summary, AUC is a metric used to evaluate the accuracy of a binary
classification model in logistic regression, by measuring the model's ability to
distinguish between positive and negative examples based on the probability scores
assigned by the model. For this dataset, the ROC-AUC method is applied in coding with
R to determine the quality of milk.

2. OBJECTIVES AND METHODS


2.1. Objectives
In the food industry, the selectivity of a certain product is very important. This
process will guarantee the quality of products which is related to the selling price and
also the Food Hygiene and Safety. Hence, we need to find out the best way based on
the dataset of some factors related to the products to classify the product’s quality for
many uses. For this reason, our group decided to work with a dataset corresponding to

3
the most common daily product which is milk. This product is widely produced all over
the world that makes many sources of milk with different quality and used for variant
purposes. Thus, the classified steps have to be fast and accurate.
In our dataset, there are many factors which can impact on milk grade such as pH,
Temperature, Odor, Fat, Turbidity, Colour, Taste. Then we will code with R to make a
program which can predict and classify the grade of milk.
For this assignment, we will build up the best model for our dataset and then we make
the predictions to conclude that this model is highly correct to classify the grade’s milk.
2.2. Methods
2.2.1. Model selection
In this report, we will choose the logistic regression model since the logistic regression
model is usually used when the dependent variable is binary or dichotomous, meaning it can
take on only two values, such as 0 or 1, Yes or No, True or False. In particular, for our dataset,
there are only two options to classify the milk grade of products which are High or Low.
After the logistic regression model is chosen, we code with R to find out the best model
by removing factors which do not have statistical meaning or do not affect the milk’s quality
(p-value < 0.05).
2.2.2. Evaluate the overall meaning of the model

The logistic regression model is built by the data of a sample taken from the
population so it can be affected the sample’s error. Therefore, we must perform the
hypothesis testing to conclude that there is a statistically significant relationship between the
predictor variable (x) and the response variable (y). Let’s denote that the null hypothesis and
the alternative hypothesis are

H0 : β1 = 0; H1 : β1 ≠ 0

Then, we calculate the overall Chi-square value of the model by the formula

𝑁𝑢𝑙𝑙 𝑑𝑒𝑣𝑖𝑎𝑛𝑐𝑒 − 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑑𝑒𝑣𝑖𝑎𝑛𝑐𝑒


𝜒2 =
𝑁𝑢𝑙𝑙 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 − 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚

After that, p-value of this model will be determined by 1 - 𝜒 2 . If p-value is less


than the significant level (p < 0.05), the model is highly useful for predicting the

4
probability that a given individual defaults. In reverse, if p-value is more than the
significant level (p > 0.05), the predictor variable (x) and the response variable (y) do not
have any relationship.

2.2.3. Make predictions


In this case, our team chooses the ROC–AUC method to assess the capability to
distinguish or classify the grade of milk products. To determine the best model of prediction,
we need to consider the accuracy, specificity, TPR (True positive rate) or sensitivity, and FPR
(False positive rate) in which the accuracy, sensitivity and specificity should be as high as
possible, inversely, the FPR should be as low as possible. These factors are calculated by the
following equations with TP (true positive), TN (true negative), FP (false positive), FN (false
negative).

𝑇𝑃 + 𝑇𝑁 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃

𝑇𝑃 𝐹𝑃𝑅 = 1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
𝑇𝑃𝑅 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁

3. ANALYZE THE DATA


3.1. Dataset Overview
Our dataset based on many factors can impact on the milk quality which are:
• pH: A measure of the acidity or basicity of a solution is defined by the concentration
of proton ions (H+) in the milk solution. The range pH of milk is from 3 to 9.5.
• Temperature: Temperature of the milk which ranges from 340C to 900C.
• Odor: Smell or aroma is a property of substances that is detected by our senses.
In our dataset, odor could be 0 (Bad) or 1 (Good).
• Fat: The concentration of fat in milk which could be 0 (Low) or 1 (High).
• Turbidity: Turbidity of milk refers to the degree of cloudiness or hazy appearance
of milk, which is caused by the presence of suspended particles, such as fat globules,
casein micelles, and other proteins. This factor value could be 0 (Bad) or 1 (Good).
• Colour: Variations in hue and brightness ranges from 240 to 255.

5
3.2. Import Data
Firstly, let’s import our data with readr library

> library(readr)
> milkgrade <- read_csv("BACH KHOA UNIVERSITY/SECOND
YEAR/PROBABILITY AND STATISTICS/milkgrade.csv")

To view the table of the dataset, we use view() command


> View(milkgrade)

Figure 3: The table of the “milkgrade.csv” dataset

Let’s check the first 5 rows of the data.

> head(milkgrade, 5) #Show the first 5 lines of "milkgrade"

Figure 4: The first 5 rows of “milkgrade”

6
3.3. Data cleaning
Practically, we need to modify the data name to make it easier to grasp how the
attributes relate to the goal. Since Grade is our output aim according to our data, the
other qualities can be interpreted as Xi and Grade as Y:

> colnames(milkgrade)[1]= "Y"


> colnames(milkgrade)[2]= "X1"
> colnames(milkgrade)[3]= "X2"
> colnames(milkgrade)[4]= "X3"
> colnames(milkgrade)[5]= "X4"
> colnames(milkgrade)[6]= "X5"
> colnames(milkgrade)[7]= "X6"
> colnames(milkgrade)[8]= "X7"

Check our output:


> milkgrade

Figure 5: Rename the attributes

Before beginning to analyze the data, we need first to determine whether any NA
(Not Available) values are present.
> apply(is.na(milkgrade),2,sum) #Check the NA and show the totals
number of NA in each columns

Figure 6: Check the NA values

Comment: We can see that the dataset doesn’t have NA values.

7
Next, we may check if there is any duplicated rows
> sum(duplicated(milkgrade)) #give the number of duplicated rows

Output:

Figure 7: Check duplicates

Then, we remove the duplicates by using the unique() command. The unique()
function in R is used to eliminate or delete the duplicate values or the rows present in
the vector, data frame, or matrix as well.

> milkgrade <- unique(milkgrade) #remove the duplicated rows


> View(milkgrade)

Figure 8: After removing the duplicated rows

Comment: We can see that after cleaning the dataset, we have a new dataset with
49 rows and 8 variables so we have removed the irrelevant tuples.

3.4. Descriptive statistics


Obtaining the descriptive statistic for any dataset is the first step in any analysis.
We can utilize the character() datatype in the R language, but we are aware that one of
our data's columns, Y, includes the character() datatype. Therefore, to make the
calculation easier, we must convert the data to numeric representation.

8
We can numeric the data to solve this problem. The Grade column, which only
accepts two values (H for High or L for Low), is renamed to Y columns. Because of
this, the idea is to convert it to a binary number (1 for H and 0 for L). Thus, we might
say that we are "encoding" the data. So:
> milkgrade$Y[milkgrade$Y == 'low'] <- 0
> milkgrade$Y[milkgrade$Y == 'high'] <- 1

Then, let’s see


> milkgrade

Figure 9: Encoding data

Next, we need to check the datatype of each variable to numerical values with the
sapply() command:

> milkgrade$Y <- as.numeric(milkgrade$Y)


> milkgrade$X1 <- as.numeric(milkgrade$X1)
> milkgrade$X2 <- as.numeric(milkgrade$X2)
> milkgrade$X3 <- as.numeric(milkgrade$X3)
> milkgrade$X4 <- as.numeric(milkgrade$X4)
> milkgrade$X5 <- as.numeric(milkgrade$X5)
> milkgrade$X6 <- as.numeric(milkgrade$X6)
> milkgrade$X7 <- as.numeric(milkgrade$X7)
> sapply(milkgrade,class) #Check classification of each variables

Figure 10: Check the datatype

9
3.4.1. Calculate the descriptive statistics
For continuous variables “pH”, “Temperature”, “Colour”, we perform
descriptive statistics and export the results in a table formula.

> ##Perform the descriptive statistic for variables


> mean <- apply(milkgrade[,c("X1", "X2", "X7")],2,mean) #Determine
the sample mean
> sd <- apply(milkgrade[,c("X1", "X2", "X7")],2,sd) #Determine the
standard deviation
> Q1 <- apply(milkgrade[,c("X1", "X2", "X7")],2,quantile, probs=0.25)
#Determine the 1st quantile_Q1 (25%)
> median <- apply(milkgrade[,c("X1", "X2", "X7")],2,median)
#Determine the median value
> Q3 <- apply(milkgrade[,c("X1", "X2", "X7")],2,quantile, probs=0.75)
#Determine the 3rd quantile_Q3 (75%)
> min <- apply(milkgrade[,c("X1", "X2", "X7")],2,min) #Determine the
minimum value
> max <- apply(milkgrade[,c("X1", "X2", "X7")],2,max) #Determine the
maximum value
> t(data.frame(mean,sd,Q1,median,Q3,min,max)) #Draw a table of
descriptive statistic

Figure 11: Descriptive statistic of the dataset

3.4.2. Make a table of the quantity statistics


For categorical variables "Taste", "Odor", "Fat", "Turbidity", "Grade" make a
statistical table.

> table(milkgrade$X3)

Figure 12: R code and results when performing quantitative statistics for the variable "Taste"

10
Comments:
• There are 20 samples that aren’t satisfied with optimal conditions.
• 29 samples are satisfied with optimal conditions.

> table(milkgrade$X4)

Figure 13: R code and results when performing quantitative statistics for the variable "Odor"

Comments:
• 21 samples aren’t satisfied with optimal conditions.
• 28 samples are satisfied with optimal conditions.

> table(milkgrade$X5)

Figure 14: R code and results when performing quantitative statistics for the variable "Fat"

Comments:
• 10 samples aren’t satisfied with optimal conditions.
• 39 samples are satisfied with optimal conditions.

> table(milkgrade$X6)

Figure 15: R code and results when performing quantitative statistics for the variable
"Turbidity"

Comments:
• 19 samples aren’t satisfied with optimal conditions.
• 30 samples are satisfied with optimal conditions.

> table(milkgrade$Y)

Figure 16: R code and results when performing quantitative statistics for the variable "Grade"

11
Comments:
• 26 samples have low-quality milk.
• 23 samples have high-quality milk.

3.5. Data Visualization


3.5.1. Plot histograms showing the distribution of quantitative variables,
and plot statistical bar plots for each classifier

> ##Plot the histogram to illustrate the distribution of variables


> hist(milkgrade$X1, xlab = "X1", ylab = "Frequency", main =
"Histogram graph for frequency of pH",col="hotpink1", label= T,
ylim= c(0,50)) #Plot a histogram to illustrate the distribution for
variable pH
> hist(milkgrade$X2, xlab = "X2", ylab = "Frequency", main =
"Histogram graph for frequency of Temperature",col="springgreen4",
label= T, ylim= c(0,50)) #Plot a histogram to illustrate the
distribution for variable Temperature
> hist(milkgrade$X7, xlab = "X7", ylab = "Frequency", main =
"Histogram graph for frequency of Colour",col="lightblue2", label=
T, ylim= c(0,50)) #Plot a histogram to illustrate the distribution
for variable Colour

Figure 17: The result when plotting the histogram of the variable “pH”
12
Figure 18: The result when plotting the histogram of the variable “Temperature”

Figure 19: The result when plotting the histogram of the variable “Colour”

Comments:
• In the "Histogram graph for frequency of pH" clearly shows that the frequency
is unevenly distributed. The pH range 6-7 shows the highest frequency, which is 8 times
higher than the others.
• In the "Histogram graph for frequency of Temperature", the frequency decreases
from 30 degrees Celsius to 90 degrees Celsius. The frequencies are high in the range

13
between 30 and 50 degrees Celsius, but substantially lower between 50 and 90 degrees
Celsius. Furthermore, there is an outlier at temperatures ranging from 70 to 80 degrees.
• In the "Histogram graph for colour frequency," there is a significant contrast
between the color frequencies. It is typically in the range of 254 to 256, on the other
hand, there is an outlier in the color code range of 250 to 252.

> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$X3), xlab = "Taste", ylab = "Frequency",
main = "Barplot of Taste", col= "salmon1") #Plot a barplot to
illustrate the distribution for variable Taste
> barplot(table(milkgrade$X4), xlab = "Odor", ylab = "Frequency",
main = "Barplot of Odor", col= "aquamarine2") #Plot a barplot to
illustrate the distribution for variable Odor

Figure 20: The result when plotting the barplots of the variable “Taste” and “Odor”

> par(mfrow = c(1,3)) #Set the 1x3 matrix for the graphs
> barplot(table(milkgrade$X5), xlab = "Fat", ylab = "Frequency", main
= "Barplot of Fat", col= "midnightblue") #Plot a barplot to
illustrate the distribution for variable Fat
> barplot(table(milkgrade$X6), xlab = "Turbidity", ylab =
"Frequency", main = "Barplot of Turbidity", col= "lemonchiffon2")
#Plot a barplot to illustrate the distribution for variable Turbidity
> barplot(table(milkgrade$Y), xlab = "Grade", ylab = "Frequency",
main = "Barplot of Grade", col = "steelblue") #Plot a barplot to
illustrate the distribution for variable Grade

14
Figure 21: The result when plotting the barplots of the variable “Fat”, “Turbidity”
and “Grade”

3.5.2. Plot a histogram showing the distribution of pH of high/low milk quality

> library(ggplot2)
> library(plyr)
> mu_pH <- ddply(milkgrade, "Y", summarise, grp.mean=mean(X1))
> ggplot(milkgrade, aes(x=X1, color= as.factor(Y), fill=
as.factor(Y))) + geom_histogram(position = "identity", alpha=0.5)
+ geom_vline(data = mu_pH, aes(xintercept=grp.mean,
color=as.factor(Y)), linetype= "twodash") +
scale_color_manual(values = c("rosybrown2",
"lightblue2","palegreen4")) + scale_fill_manual(values =
c("rosybrown2", "lightblue2","palegreen4")) + labs(title =
"Histogram of pH for Grade of milk", x="pH", y="Frequency") +
theme_light()

15
Figure 22: Histogram results show the distribution of pH of good/bad milk quality

Comment: The average pH of disappointed milk is higher than the average pH of


satisfied milk. This is demonstrated by the fact that users prefer dairy products with a
pH lower than the average.

3.5.3. Plot a histogram showing the distribution of Temperature of high/low


milk quality

> mu_Temperature <- ddply(milkgrade, "Y", summarise,


grp.mean=mean(X2))
> ggplot(milkgrade, aes(x=X2, color= as.factor(Y), fill=
as.factor(Y))) + geom_histogram(position = "identity", alpha=0.5)
+ geom_vline(data = mu_Temperature, aes(xintercept=grp.mean,
color=as.factor(Y)), linetype= "twodash") +
scale_color_manual(values = c("rosybrown2",
"lightblue2","palegreen4")) + scale_fill_manual(values =
c("rosybrown2", "lightblue2", "palegreen4")) + labs(title =
"Histogram of Temperature for Grade of milk", x="Temperature",
y="Frequency") + theme_light()

16
Figure 23: Histogram results show the distribution of Temperature of good/bad milk quality

Comments:
• The average temperature of disappointed level with milk is higher than those of
satisfied milk.
• The average temperature of satisfied-level milk is normal distribution around 35-
45 degrees.

3.5.4. Plot a histogram showing the distribution of Colour of high/low milk quality

> mu_Colour <- ddply(milkgrade, "Y", summarise, grp.mean=mean(X3))


> ggplot(milkgrade, aes(x=X7, color= as.factor(Y), fill=
as.factor(Y))) + geom_histogram(position = "identity", alpha=0.5)
+ geom_vline(data = mu_Colour, aes(xintercept=grp.mean,
color=as.factor(Y)), linetype= "twodash") +
scale_color_manual(values = c("rosybrown2",
"lightblue2","palegreen4")) + scale_fill_manual(values =
c("rosybrown2", "lightblue2", "palegreen4")) + labs(title =
"Histogram of Colour for Grade of milk", x="Colour",
y="Frequency") + theme_light()

17
Figure 24: Histogram results show the distribution of Colour of good/bad milk quality

Comments:
• The colour of milk mild fluctuates between 246 and 255.
• The rate of colour at 255 leads to the satisfied milk level being higher than the
disappointed milk level.
• At 246 the satisfied milk level is significantly lower than the disappointed milk level.

3.5.5. Plot a barplot chart with quantitative statistics of Taste and Colour of
high/low milk grade

> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$Y, milkgrade$X3), xlab = "Taste", ylab =
"Frequency", main = "Barplot of Taste for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X3)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for the variable Taste
> barplot(table(milkgrade$Y, milkgrade$X4), xlab = "Odor", ylab =
"Frequency", main = "Barplot of Odor for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X4)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Odor

18
Figure 25: Result of the barplot graph of the quantity of two variables “Taste”, “Odor”

Comments:
• In good taste, the satisfied milk level is higher than in bad taste. The nice flavor
indicates a larger number of pleased milk levels.
• In appealing odor, the frequency of disappointed milk level is smaller than in satisfied
milk level. As a result, the appealing odor will create more and more desired milk levels.

3.5.6. Plot a barplot chart with quantitative statistics of Fat and Turbidity of
high/low milk grade

> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$Y, milkgrade$X5), xlab = "Fat", ylab =
"Frequency", main = "Barplot of Fat for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X5)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Fat
> barplot(table(milkgrade$Y, milkgrade$X6), xlab = "Turbidity", ylab
= "Frequency", main = "Barplot of Turbidity for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X6)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Turbidity

19
Figure 26: Result of the barplot graph of the quantity of two variables “Fat”, “Turbidity”

Comment:
• The frequency of disappointed milk levels at high-fat rates is lower than the
frequency of satisfied milk levels. Regarding the low-fat rate, the frequency of the
disappointed milk level is significantly greater than the frequency of the satisfied milk
level. As a consequence of the high-fat content, milk levels are more fulfilled, and fat
has a direct effect on milk levels.
• In high turbidity, the frequency of disappointed milk level is larger than that of
satisfied milk level. It is a comparable number for low turbidity. This resulted in
turbidity having no effect on milk level.
3.6. Build up a logistic regression model to evaluate the milk quality
Firstly, we load the library “caTools” for the calculation of ROC - AUC. After that,
we set the seed with a random number (“1000” is chosen in this case) so that the results
can be reproducible.
> library(caTools)

> set.seed(1000)

20
Next, we use the split method to separate the sample into “train” and “test” datasets
with a split ratio of 0.85. This means 85% of our dataset is passed in the training dataset
and 15% in the testing dataset.

The results after splitting will be shown with 2 values: TRUE AND FALSE.

> split = sample.split(milkgrade$Y, SplitRatio = 0.85)

> split

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRU
E TRUE
[14] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TR
UE TRUE
[27] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FAL
SE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

Figure 27: Results after the sample is splitted with a ratio of 0.85

Let’s use the “subset” command to set the train dataset getting all the data points
after split which are “TRUE” and similarly the test dataset getting all the data points
which are “FALSE”.

> train <- subset(milkgrade, split == TRUE)

> test <- subset(milkgrade, split == FALSE)

Moving onto the next step, using the training dataset to create the model for logistic
regression by glm (general linear model) function. Then, we input the “summary” command
to show the different statistical values for our independent variables after the model is
generated.

#Build up logistic regression model


> models <- glm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7, family =
"binomial", data = train) #Using the glm()-general linear model to
create an instance of model
> summary(models)

21
Call:
glm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7, family = "binomial",
data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.88573 -0.58048 -0.01818 0.59463 1.87784

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.65420 32.99509 -0.656 0.5116
X1 0.61270 0.39169 1.564 0.1178
X2 -0.22091 0.09451 -2.337 0.0194 *
X3 0.12670 1.10510 0.115 0.9087
X4 0.20436 1.13409 0.180 0.8570
X5 3.59796 1.85357 1.941 0.0522 .
X6 -0.54890 1.11432 -0.493 0.6223
X7 0.09618 0.12381 0.777 0.4373
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 58.129 on 41 degrees of freedom


Residual deviance: 32.754 on 34 degrees of freedom
AIC: 48.754

Number of Fisher Scoring iterations: 6

Figure 28: The different statistical values for the logistic regression model

Comments:

• In this model, only X2 (Temperature) has an impact on the milk grade (p=0.0194
< 0.05). The other variables are not significant statistical meaning.
• Since we only have one predictor variable (X2) and one response variable (Y),
we can use simple logistic regression, which uses the following formula to estimate the
relationship between the variables:
p
𝑙𝑜𝑔 ( ) = β0 + β1 X
1−p

22
Then, we can expressed formula for the most optimal model as

p
𝑙𝑜𝑔 ( ) = − 21.6542 − 0.2209X2
1−p

However, we need to perform the hypothesis testing in order to conclude certainly that
there is statistically significant relationship between the predictor variable (x) and the response
variable (y). First, the null hypothesis and the alternative hypothesis are determined by

H0 : β1 = 0; H1 : β1 ≠ 0

Next, we calculate the overall Chi-square value of the model by command


“pchisq(Null deviance-Residual deviance, Null df – Residual df)”. After that, p-value
will be determined by 1 - (Chi-square value).

#Hypothesis testing

> 1-pchisq(58.129-32.754, 41-34)

) [1] 0.0006509515

Figure 29: The p-value of overall Chi-square statistic

Comment: Since the p-value is less than the significant level of 0.05, therefore the
null hypothesis can be rejected. In other words, our model is highly useful for predicting
the probability that a given individual defaults.

3.7. Make predictions

After the model is created and fitted, we need to make predictions on the model by
using “predict” function to foreseen the probability determined from the equation

p
𝑙𝑜𝑔 ( ) = − 21.6542 − 0.2209X2
1−p
Initially, making predictions on the training dataset and using the summary
command to get statistical values.
#Prediction on train dataset
> pred_train <- predict(models, type = "response", newdata = train)
> summary(pred_train)

23
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000913 0.1031407 0.5217235 0.4761905 0.8248221 0.9432828

Figure 30: The results when making predictions on train dataset

Next, we continue to predict on the testing dataset which contains the unseen data values.
After the predictions on the test dataset are made, a confusion matrix with threshold value is 0.5.

#Prediction on test dataset


> pred_test <- predict(models, type = "response", newdata = test)
> table(test$Y, pred_test > 0.5) # Assuming threshold to be 0.5

FALSE TRUE
0 3 1
1 1 2

Figure 31: The confusion matrix table of this model

Comment: The first and second rows show the observed and predicted values of
the milk grade, respectively. Furthermore, the columns FALSE AND TRUE provide
the data for the low-grade milk and high-grade milk, vice and versa.

Hence, to obverse the result more easier, let’s change the names of each components in
this confusion matrix. After that, using the “t(table)” command to see the result

#Change the names of rows in the table

> table <- data.frame(table(test$Y, pred_test > 0.5))

> rownames(table)[1]="Observed low-grade milk"

> rownames(table)[2]="Predicted low-grade milk"

> rownames(table)[3] ="Observed high-grade milk"

> rownames(table)[4]="Predicted high-grade milk"

> t(table)
Observed low-grade milk Predicted low-grade milk Observed high-grade milk
Var1 "0" "1" "0"
Var2 "FALSE" "FALSE" "TRUE"
Freq "3" "1" "1"
Predicted high-grade milk
Var1 "1"
Var2 "TRUE"
Freq "2"

Figure 32: The confusion matrix after changing names

24
Recall that the standard confusion matrix form will be expressed below:

Figure 32: The sample of confusion matrix

Hence, our confusion matrix states that the true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN) are 2, 3, 1, and 1, respectively. From the formula
mentioned below, we calculate the accuracy, the true positive rate (TPR) or sensitivity, the
specificity and the false positive rate (FPR). Recall some crucial equations

𝑇𝑃 + 𝑇𝑁 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃

𝑇𝑃 𝐹𝑃𝑅 = 1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
𝑇𝑃𝑅 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
In particular, the result of accuracy shows how much we predicted correctly, which
must be high as possible. In this case with the split ratio of 0.85, we can predict the
model with the accuracy being up to 71.4%.

> (2+3)/nrow(test) # accuracy

[1] 0.7142857
Figure 33: The accuracy of this model
The true positive rate (TPR), or sensitivity is a measure of the probability that an
actual positive instance will be classified as positive. Similarly, the false positive rate
(FPR) is essentially a measure of how often an actual negative instance will be classified
as positive. The FPR is calculated by (1 – Specificity) in which the specificity measures

25
the proportion of actual negatives that are correctly identified. Thus, TPR and
specificity should be as high as possible and inversely, FPR should be reduced as much
as possible in order to get a good predicted model.

> 2/(2+1) # tpr = sensitivity

[1] 0.6666667
Figure 34: The true positive rate

> 3/(3+1) # specificity

[1] 0.75

Figure 35: The specificity

> 1-3/(3+1) #fpr

[1] 0.25

Figure 36: The false positive rate

Comments:

• TPR, specificity and FPR are determined as 66.7%, 75% and 25%, vice and versa.
• With the split ratio of 0.85, the model gives a good accuracy of over 71%.
Besides, the TPR and FPR values are rather good to make a predicted model.

Let’s load the library “ROCR” to visualize the performance of scoring classifiers
such as ROC graphs, sensitivity/ specificity curves.

> library(ROCR)

ROC (Receiver Operator Characteristic) curve can help in deciding the best
threshold value. A high threshold value gives high specificity and low sensitivity. In
reverse, a low threshold value gives low specificity and high sensitivity. Then, we use
the plot command to draw the ROC curve for test data of Grade (Y) with FPR on the x-
axis and TPR on y-axis.

> ROCRpred <- prediction(pred_test, test$Y)


> ROCRperf <- performance(ROCRpred, "tpr", "fpr")
> plot(ROCRperf)

26
Here is the graph of our predicted model

Figure 37: ROC curve

AUC (Area under ROC curve) measures the entire two-dimensional area below the ROC
curve. This metric also determines the quality of the model’s predictions, regardless of the
classification threshold chosen. AUC range is [0;1] but our AUC result should be greater than
0.8 to conclude the good model. In this case, AUC for Grade (Y) value is determined as 85%.

> as.numeric(performance(ROCRpred, "auc")@y.values)

[1] 0.8333333

Figure 38: The AUC value for our model


Comment: The AUC value gives the result of 83.3%. This means our model is
pretty good to classify between the low-grade and high-grade milk.

4. CONCLUSION
• In this assignment, we already visualize the dataset of milk grade via the
statistical description and graphs.
• Then, we successfully built up a logistic regression model and chose the best
one by comparing p-value of each factor and removing factors which had p > 0.05.
• Lastly, we create a predicted model to evaluate how well our model work or
assess the ability to distinguish classifiers of this model.

27
5. REFERENCES

1. SHRIJAYAN, R. (07/2023). Milk Quality Prediction. Retrieved from


https://www.kaggle.com/datasets/cpluzshrijayan/milkquality.

2. Tuan, N.V. (2015). Phân tích dữ liệu với R. Nhà xuất bản tổng hợp thành phố
Hồ Chí Minh.

3. ZACH. (29/11/2021). Understanding the Null Hypothesis for Logistic


Regression. Retrieved from https://www.statology.org/null-hypothesis-of-logistic-
regression/.

4. How to plot AUC ROC curve in R. (15/12/2022). Retrieved from


https://www.projectpro.io/recipes/plot-auc-roc-curve-r?

5. Receiver operating characteristic. Retrieved from https://en.wikipedia.org/wiki/ROC.

6. Logistic regression. Retrieved from https://en.wikipedia.org/wiki/LRM.

7. Hồi quy Logistic sử dụng phần mềm R. Retrieved from


https://www.youtube.com/watch?v=wywgW3SejVI&t=632s.

28

You might also like