Professional Documents
Culture Documents
MAY - 2021
ACKNOWLEDGMENT
The report is possible because of the considerable support of those who gave me motivation
with the way to complete throughout the valuable period. Furthermore, they encouraged us
to conquer enormous obstacles and gave sensible advice as well.
First, we would like to express my sincere gratitude to my supervisors, PhD. Phan Thi
Huong and MSc. Thai Ba Ngoc, whose contribution to stimulating suggestions, guidance,
and determination, motivated us to work harder and then conduct this scientific report,
especially in writing, throughout the course.
Also, we would thank the Faculty of Geology and Petroleum for providing us a precious
opportunity to sign up for this subject to help us broaden our academic knowledge and our
classmates for their assistance.
Eventually, we deeply apologize to all other unnamed people who help us in the enormous
number of ways to complete this report.
Contents
ACKNOWLEDGMENT 3
REQUIREMENT 5
1 Linear Regression 6
1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Hypothesis tests in simple linear regression . . . . . . . . . . . . . . . 7
1.2 R Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Advertising Effectiveness 16
3.1 Input the data from the table . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Running chisq.test() to compare the given significant level. . . . . . . . . . . 17
4 Late Arrivals 18
4.1 Establish the following hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Input the data and create the given table . . . . . . . . . . . . . . . . . . . . 19
4.3 Anova Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
REFERENCE 22
4
REQUIREMENT
• Each group works on the assigned topic.
• In each report, there must be student names and IDs on the cover page, the table of
content, and the questions.
• R-Studio must be used to analyze the data set and the codes must be inside framed
environments. Detailed explanations must be provided to receive full credit.
• Deadline: 28/05/2021.
Chapter 1
Linear Regression
X 50 130 170 270 90 210 50 130 270 24 170 210 90 210 90 240 50 240
Y 15 115 215 335 95 295 55 155 295 315 175 275 75 255 115 35 275 315
1.1 Theory
1.1.1 Linear regression model
The case of simple linear regression considers a single regressor variable or predictor variable X
and a dependent or response variable Y. Suppose that the true relationship between Y and X is a
straight line and that the observation Y at each level of X is a random variable. We assume that
each observation, Y, can be described by the model:
y = β0 + β1 x + ε
Where the intercept β0 and the slope β1 are unknown regression coeffcients, and ε is a random
error with mean zero and (unknown) variance σ 2 . The random errors corresponding to different
observations are also assumed to be uncorrelated random variables.
For a dataset of n observations (x1 , y1 ), ..., (xn , yn ), the sum of squares for errors is defined by:
X n X n
SSE = e1 2 = [yi − (βb0 + βb1 xi )]2
i=1 i=1
The least-square method aims to find the estimates βb0 , and βb1 by minimizing SSE. Those
estimates are called least squares estimates.
( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
βb1 = = n and βb0 = y − βb1 x
n Pn
Sxx X ( i=1 ix ) 2
xi 2 −
n
i=1
The coefficient of determination is the proportion of variation in the response variables that
is explained by the different values of independent variable compared to the total variation.
SSR βb1 Sxy
R2 = =
SST Syy
Note: a value of R2 near 1 indicates that most of the variation of the response data is explained by
the different values of independent variable. In other word, the linear regression model is explaining
well the relationship between y and x.
6
Probablity and Statistics Project 1 - Topic 6
Step 5: Conclusion.
The P-value is the smallest significance level α at which the null hypothesis can be rejected. Because
of this, the P-value is alternatively referred to as the observed significance level (OSL) for the data.
If P-value is smaller than α, we would reject H0 and have enough evidence to confirm H1 .
1.2 R Programming
Step 1: Import data (the predictor variable X and response variable Y)
With R studio, we use a command c() for importing raw data and a command matrix() for visu-
alization of variables X and Y.
At the present, we could easily observe the numbers and check whether importing variables is
accurate or wrong.
Utilizing a command data.frame() in order to format the table for variables X and Y and then
calculate the total of variables X and Y with a command sum() and print() for displaying values.
Besides, we apply a command summary() to compute parameters for each variable X and Y.
7
Probablity and Statistics Project 1 - Topic 6
These are important parameters in order to model a linear regression which is demonstrated below
using R programming.
We use a command lm() in order to compute a linear regression model. Besides, a command
summary() is used for displaying variables of β0 , β1 , and ε.
• Residuals
The residuals are the difference between the actual values and the predicted values. It is
0.132 which is close to the expected value 0.
The first quartile is -30.495 which is the median of the lower half of the data set. This means
that about 25% of the numbers in the data set lie below -30.495 and about 75% lie above.
The third quartile is 44.307 which is the median of the upper half of the data set. This means
that about 75% of the numbers in the data set lie below 44.307 and about 25% lie above.
In conclusion, the distribution of residuals is not lightly symmetrical and pretty right-skewed.
8
Probablity and Statistics Project 1 - Topic 6
• Coefficients
( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
The slope is defined as βb1 = = n
n Pn
Sxx X ( i=1 xi )2
xi 2 −
n
i=1
P-value is equal to 0.002682 which is smaller than the default significance level α = 0.05,
Therefore, we have enough evidence to confirm that there is a relationship between variables
X and Y.
• Coefficient of determination
R2 is equal to 0.4052 which means that approximate 40.52% of the variance found in the
response variable x can be explained by the predictor variable y.
Furthermore, par() and dev.off() are applied to arrange graphs based on commands plot()
which shows the relationship between X and Y.
• Residuals vs Fitted Graph
The graph demonstrates that the red line is approximately horizontal and the average magni-
tude of the standardized residuals is not changing as much as a function of the fitted values.
This means that the linear regression model is moderately good.
• Scale-Location Graph
Scale-Location plot shows whether residuals are spread equally along the ranges of input
variables (predictor). It is observed that the line is about horizontal with randomly spread
point, thus, the model is quite good.
9
Probablity and Statistics Project 1 - Topic 6
A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based
on estimates of the quantiles. The pattern of points in the plot is used to compare the two
distributions. It is easily clear that the points seem to fall about a straight line, thus, those
are the quantiles from the Normal distribution.
Residuals versus Leverage is used to identify influential cases. Extreme values might influence
the regression results when they are included or excluded from the analysis. In this case, there
are two points outside the dashed line – Cook’s distance that could influence the regression
results.
Also, we use a command predict() to observe the differences between the fitted values from
the linear regression model and the response variables y. To be more specific, we could
determine the prediction interval within lwr and upr at 95%.
At first, we assign the intercept and slope values, the coefficient of determination to Beta 0, Beta 1
and R2 respectively with a command list(). Moreover, we create the linear regression equation
with a command substitute(). Last but not least, a command function() is utilized to combine
conditions above for adding the regression equation on the plot.
We use a package ggplot2 with a command library() so as to sketch the diagram of the relationship
between x and y. With a command ggplot(), we import data on the diagram and then draw a
regression line by a command geom smooth(). Furthermore, a command geom line() is considered
to detect the boundary of the prediction interval – a blue dashed line.
10
Probablity and Statistics Project 1 - Topic 6
For more details on the diagram, we have already applied a command labs() for putting a title, a
name of axis, and also a command geom text() for a equation we constructed at the first in step 4.
There are other commands such as geom rect() for drawing a frame of a linear regression equation,
theme() for customizing background, color, size of the graph.
The grey band indicates 95% confidence interval while the blue dashed lines are the 95% confident
of prediction range for the future relationship between x and y.
11
Chapter 2
One researcher wanted to examine the response times of men and women to different types of
signals. The subjects were asked to press the ENTER button on the computer keyboard as soon
as they recognized the signal. The duration (measured in seconds) between the time the signal was
emitted and the time the object hitting the button was recorded. Here are the results for 15 men
and 15 women.
GENDER SOUND LIGHT PULSE
MALE 10.0 6.0 9.1
7.2 3.7 5.8
6.8 5.1 6.0
6.0 4.0 4.0
5.0 3.2 5.2
FEMALE 10.5 6.6 7.3
8.8 4.9 6.1
9.2 2.5 5.2
8.1 4.2 2.5
13.4 1.8 3.9
Draw a conclusion at the significance level of α = 5%. Do the factors gender and signal interact?
Solution: Establish the following hypotheses for each signal
12
Probablity and Statistics Project 1 - Topic 6
Value:
The result has levels from 1 to n with each value replicated in groups of length k out to a total
length of length.
Package dplyr provides functions to convert and manipulate data after it has been loaded. The
functions groupby() chained with summarize(), initial statistics of the dataset can be computed
and represented in a table.
13
Probablity and Statistics Project 1 - Topic 6
The null hypothesis for this test is that the data are normally distributed. The p-value is greater
than 0.05, then the null hypothesis is not rejected.
2.3 F-test
Performs an F test to compare the variances of two samples from normal populations. All p-values
of the F-test are greater than α = 0.05, therefore, the equality of vari- ances is assumed.
14
Probablity and Statistics Project 1 - Topic 6
2.4 T-test
Performs one and two sample t-tests on vectors of data.
15
Probablity and Statistics Project 1 - Topic 6
Conclusion
From the tests above, we can see p-values for light response (p = 0.6985) and pulse response (p =
0.4262) are significantly higher compared to the confidence interval α = 0.05. Therefore, we fail to
reject the null hypothesis.
This proves these two categories have a mutual relationship with sexuality roles. Meanwhile, the
only p-value for sound (p = 0.04403) is lower than α = 0.05. We reject the null hypothesis for this
category.
16
Chapter 3
Advertising Effectiveness
To study whether the size of a company affects the advertising effectiveness, a survey of 356
customers’ opinions was collected and the following table was obtained.
At the significance of α = 0.1, is there enough evidence to conclude that the company size affects
the advertising effectiveness?
Solusion:
Solving the problem, first of all, establish the hypotheses:
Description: The function data.frame() creates data frames, tightly coupled collections of vari-
ables which share many of the properties of matrices and of lists, used as the fundamental data
17
Probablity and Statistics Project 1 - Topic 6
Creating row names for the table by function row.names() which gets and sets row names for data
frames with a size company and using a function to export the table.
A Chi-square statistic is the test that measures how a model compares to actual observed data. The
Chi-square test statistic compares the size of any discrepancies between the expected results and the
actual results, given the size of the sample and the number of variables in the relationship. Function
chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests. Therefore, it is
appreciated for solving this problem.
Conclusion
By using the Chi-Test, we obtained:
With p-value = 0.01 lower than α = 0.1. It can be determined the null hypothesis H0 is rejected and
the size of companies and their advertising effectiveness are statistically significantly associated.
18
Chapter 4
Late Arrivals
The number of students arriving late at five high schools on different days of week are given the
following table:
Is there any significant difference in the number of late arrivals among different days of the week
at the significance level α = 5%.
The hypothesis we want to test is if H1 is “likely” true. So, there are two possible outcomes:
19
Probablity and Statistics Project 1 - Topic 6
• H1 : There is a difference in the number of late arrivals among different days of the week.
20
Probablity and Statistics Project 1 - Topic 6
Cbind: (generalized) vectors or matrices. These can be given as named arguments. Other R
objects may be coerced as appropriate, or S4 methods may be used: see sections ‘Details’ and
‘Value’. (For the ”data.frame” method of cbind these can be further arguments to data.frame such
asstringsAsFactors).
Stack: Stacking vectors concatenates multiple vectors into a single vector along with a factor
indicating where each observation originated. Unstacking reverses this operation.
After Anova test runs, the “summary” function will create the table and output data in the picture.
Conclusion:
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.
21
Probablity and Statistics Project 1 - Topic 6
The p-value = 0.9472 is obtained, which is still larger than α= 0.05 with little-to-no variance
compared to Anova test.
Conclusion
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.
22
REFERENCE
1. @Manual, title = RStudio: Integrated Development Environment for R, author = RStu-
dio Team, organization = RStudio, PBC., address = Boston, MA, year = 2020, url =
http://www.rstudio.com/,
2. Montgomery, D. C., & Runger, G. C. (2007). Applied statistics and probability for engineers.
Hoboken, NJ: Wiley.
3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction
to statistical learning : with applications in R. New York :Springer,