Probability and Statistics Project 1 - Topic 6

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
PROBABILITY AND STATISTICS

PROJECT 1 - TOPIC 6
Lectures: Dr. Phan Thi Huong

MSc. Thai Ba Ngoc
Group member: Truong Nguyen Cao Ngoc Hai (L) 1852350

Nguyen Hoang Thao Ly 1852561
Nguyen Ngoc Thanh Tin 1953023
Tran Kim Ngan 1852606
MAY - 2021
ACKNOWLEDGMENT
The report is possible because of the considerable support of those who gave me motivation
with the way to complete throughout the valuable period. Furthermore, they encouraged us
to conquer enormous obstacles and gave sensible advice as well.
First, we would like to express my sincere gratitude to my supervisors, PhD. Phan Thi
Huong and MSc. Thai Ba Ngoc, whose contribution to stimulating suggestions, guidance,
and determination, motivated us to work harder and then conduct this scientific report,
especially in writing, throughout the course.
Also, we would thank the Faculty of Geology and Petroleum for providing us a precious
opportunity to sign up for this subject to help us broaden our academic knowledge and our
classmates for their assistance.
Eventually, we deeply apologize to all other unnamed people who help us in the enormous
number of ways to complete this report.
Contents
ACKNOWLEDGMENT 3
REQUIREMENT 5
1 Linear Regression 6
1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Hypothesis tests in simple linear regression . . . . . . . . . . . . . . . 7
1.2 R Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Signal Response Time 11

2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Advertising Effectiveness 16
3.1 Input the data from the table . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Running chisq.test() to compare the given significant level. . . . . . . . . . . 17
4 Late Arrivals 18
4.1 Establish the following hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Input the data and create the given table . . . . . . . . . . . . . . . . . . . . 19
4.3 Anova Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
REFERENCE 22
4
REQUIREMENT
• Each group works on the assigned topic.
• In each report, there must be student names and IDs on the cover page, the table of
content, and the questions.
• R-Studio must be used to analyze the data set and the codes must be inside framed
environments. Detailed explanations must be provided to receive full credit.
• Deadline: 28/05/2021.
Chapter 1
Linear Regression
Using linear regression, model the following data
X 50 130 170 270 90 210 50 130 270 24 170 210 90 210 90 240 50 240
Y 15 115 215 335 95 295 55 155 295 315 175 275 75 255 115 35 275 315
1.1 Theory
1.1.1 Linear regression model
The case of simple linear regression considers a single regressor variable or predictor variable X
and a dependent or response variable Y. Suppose that the true relationship between Y and X is a
straight line and that the observation Y at each level of X is a random variable. We assume that
each observation, Y, can be described by the model:
y = β0 + β1 x + ε
Where the intercept β0 and the slope β1 are unknown regression coeffcients, and ε is a random
error with mean zero and (unknown) variance σ 2 . The random errors corresponding to different
observations are also assumed to be uncorrelated random variables.
For a dataset of n observations (x1 , y1 ), ..., (xn , yn ), the sum of squares for errors is defined by:
X n X n
SSE = e1 2 = [yi − (βb0 + βb1 xi )]2
i=1 i=1
The least-square method aims to find the estimates βb0 , and βb1 by minimizing SSE. Those
estimates are called least squares estimates.
( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
βb1 = = n and βb0 = y − βb1 x
n Pn
Sxx X ( i=1 ix ) 2
xi 2 −
n
i=1
The coefficient of determination is the proportion of variation in the response variables that
is explained by the different values of independent variable compared to the total variation.
SSR βb1 Sxy
R2 = =
SST Syy
Note: a value of R2 near 1 indicates that most of the variation of the response data is explained by
the different values of independent variable. In other word, the linear regression model is explaining
well the relationship between y and x.
6
Probablity and Statistics Project 1 - Topic 6
1.1.2 Hypothesis tests in simple linear regression

Step 1: State the hypotheses H0 and H1
• The null hypothesis denoted by H0 is the claim that is initially assumed to be true.
• The alternative hypothesis denoted by H1 is the assertion that is contradictory to H0 .
Step 2: State the confident level (1 – α).

Step 3: Compute the test statistic.
s
βb0 − b0 b2
x
Tβ0 = ∼ t(n − 2) where SE(βb0 ) = b2 (1 +
σ )
SE(βb0 ) Sxx
Step 4: Determine the rejected range or compute p-value.

Alternative hypothesis Rejected range P-value
H1 : β1 6= b1 |tβ1 | > tn−2
α÷2 p = 2P (Tn−2 ≥ |tβ1 |)
H1 : β1 > b1 tβ1 > tn−2
α p = P (Tn−2 ≥ tβ1 )
H1 : β1 < b1 tβ1 < −tαn−2 p = P (Tn−2 ≤ tβ1 )
Step 5: Conclusion.
The P-value is the smallest significance level α at which the null hypothesis can be rejected. Because
of this, the P-value is alternatively referred to as the observed significance level (OSL) for the data.
If P-value is smaller than α, we would reject H0 and have enough evidence to confirm H1 .
1.2 R Programming
Step 1: Import data (the predictor variable X and response variable Y)
With R studio, we use a command c() for importing raw data and a command matrix() for visu-
alization of variables X and Y.
At the present, we could easily observe the numbers and check whether importing variables is
accurate or wrong.
Step 2: Data Visualization
Utilizing a command data.frame() in order to format the table for variables X and Y and then
calculate the total of variables X and Y with a command sum() and print() for displaying values.
Besides, we apply a command summary() to compute parameters for each variable X and Y.
7
These are important parameters in order to model a linear regression which is demonstrated below
using R programming.
The total of variables X: 18 The total of a product of X and Y: 18

P P
x = 2910 i=1 xi yi = 64
P18 i 2
i=1
The total of variables X squared: i=1 xi = 574300 The mean of X: x = 161.7
The total of variables Y: 18
P
i=1 yi = 3410 The mean of Y: y = 189.4
Step 3: Fitting a linear regression model
We use a command lm() in order to compute a linear regression model. Besides, a command
summary() is used for displaying variables of β0 , β1 , and ε.
The data printed on the console can be determined as followings.
• Residuals
The residuals are the difference between the actual values and the predicted values. It is
0.132 which is close to the expected value 0.
The first quartile is -30.495 which is the median of the lower half of the data set. This means
that about 25% of the numbers in the data set lie below -30.495 and about 75% lie above.
The third quartile is 44.307 which is the median of the upper half of the data set. This means
that about 75% of the numbers in the data set lie below 44.307 and about 25% lie above.
In conclusion, the distribution of residuals is not lightly symmetrical and pretty right-skewed.
8
• Coefficients
( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
The slope is defined as βb1 = = n
n Pn
Sxx X ( i=1 xi )2
xi 2 −
n
i=1
The intercept is defined as βb0 = y − βb1 x
Therefore, the linear regression model is yi = βb0 + βb1 xi = 40.05 + 0.9241xi

• Residual standard error
The R console informs s2 = 83:95 on 16 degrees of freedom.

• P-value
P-value is equal to 0.002682 which is smaller than the default significance level α = 0.05,
Therefore, we have enough evidence to confirm that there is a relationship between variables
X and Y.
• Coefficient of determination
R2 is equal to 0.4052 which means that approximate 40.52% of the variance found in the
response variable x can be explained by the predictor variable y.
Furthermore, par() and dev.off() are applied to arrange graphs based on commands plot()
which shows the relationship between X and Y.
• Residuals vs Fitted Graph
The graph demonstrates that the red line is approximately horizontal and the average magni-
tude of the standardized residuals is not changing as much as a function of the fitted values.
This means that the linear regression model is moderately good.
• Scale-Location Graph
Scale-Location plot shows whether residuals are spread equally along the ranges of input
variables (predictor). It is observed that the line is about horizontal with randomly spread
point, thus, the model is quite good.
9
• Normal Q-Q Graph
A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based
on estimates of the quantiles. The pattern of points in the plot is used to compare the two
distributions. It is easily clear that the points seem to fall about a straight line, thus, those
are the quantiles from the Normal distribution.
• Residuals vs Leverage Graph
Residuals versus Leverage is used to identify influential cases. Extreme values might influence
the regression results when they are included or excluded from the analysis. In this case, there
are two points outside the dashed line – Cook’s distance that could influence the regression
results.
Also, we use a command predict() to observe the differences between the fitted values from
the linear regression model and the response variables y. To be more specific, we could
determine the prediction interval within lwr and upr at 95%.
Step 4: Diagram of a linear regression model

We construct a linear regression equation in order to exhibit for the regression line on the graph.
At first, we assign the intercept and slope values, the coefficient of determination to Beta 0, Beta 1
and R2 respectively with a command list(). Moreover, we create the linear regression equation
with a command substitute(). Last but not least, a command function() is utilized to combine
conditions above for adding the regression equation on the plot.
We use a package ggplot2 with a command library() so as to sketch the diagram of the relationship
between x and y. With a command ggplot(), we import data on the diagram and then draw a
regression line by a command geom smooth(). Furthermore, a command geom line() is considered
to detect the boundary of the prediction interval – a blue dashed line.
10
For more details on the diagram, we have already applied a command labs() for putting a title, a
name of axis, and also a command geom text() for a equation we constructed at the first in step 4.
There are other commands such as geom rect() for drawing a frame of a linear regression equation,
theme() for customizing background, color, size of the graph.
The grey band indicates 95% confidence interval while the blue dashed lines are the 95% confident
of prediction range for the future relationship between x and y.
11
Chapter 2
Signal Response Time
One researcher wanted to examine the response times of men and women to different types of
signals. The subjects were asked to press the ENTER button on the computer keyboard as soon
as they recognized the signal. The duration (measured in seconds) between the time the signal was
emitted and the time the object hitting the button was recorded. Here are the results for 15 men
and 15 women.
GENDER SOUND LIGHT PULSE
MALE 10.0 6.0 9.1
7.2 3.7 5.8
6.8 5.1 6.0
6.0 4.0 4.0
5.0 3.2 5.2
FEMALE 10.5 6.6 7.3
8.8 4.9 6.1
9.2 2.5 5.2
8.1 4.2 2.5
13.4 1.8 3.9
Draw a conclusion at the significance level of α = 5%. Do the factors gender and signal interact?
Solution: Establish the following hypotheses for each signal
• H0 : The gender plays in that signal response.
• H1 : The gender does not play a role in that signal response.
12
2.1 Input data

Enter data frame for male, females and the signal they are recognized by matrix. Then, RStudio
will create the table like the given topic.
Usage: gl(n, k, length = n*k, labels = seq len(n))
Arguments n an integer giving the number of levels.

k an integer giving the number of replications.
labels. an optional vector of labels for the resulting factor levels.
Value:
The result has levels from 1 to n with each value replicated in groups of length k out to a total
length of length.
Package dplyr provides functions to convert and manipulate data after it has been loaded. The
functions groupby() chained with summarize(), initial statistics of the dataset can be computed
and represented in a table.
The table is re-represented below:
Group Count Sound Light Pulse

E(X) V(X) E(X) V(X) E(X) V(X)
Male 5 7 1.88 4.4 1.13 6 1.90
Female 5 10 2.09 4.0 1.92 5 1.87
2.2 Shapiro-Wilk Test

Performs the Shapiro-Wilk test of normality and use functions shapiro.test() nested in with().
13
The null hypothesis for this test is that the data are normally distributed. The p-value is greater
than 0.05, then the null hypothesis is not rejected.
2.3 F-test
Performs an F test to compare the variances of two samples from normal populations. All p-values
of the F-test are greater than α = 0.05, therefore, the equality of variances is assumed.
14
2.4 T-test
Performs one and two sample t-tests on vectors of data.
15
Conclusion
From the tests above, we can see p-values for light response (p = 0.6985) and pulse response (p =
0.4262) are significantly higher compared to the confidence interval α = 0.05. Therefore, we fail to
reject the null hypothesis.
This proves these two categories have a mutual relationship with sexuality roles. Meanwhile, the
only p-value for sound (p = 0.04403) is lower than α = 0.05. We reject the null hypothesis for this
category.
Therefore, it can be concluded this has no major impact on gender determination.
16
Chapter 3
Advertising Effectiveness
To study whether the size of a company affects the advertising effectiveness, a survey of 356
customers’ opinions was collected and the following table was obtained.
Company size category Advertising effectiveness

High Moderate Low
Small 20 52 32
Medium 53 47 28
Large 67 32 25
At the significance of α = 0.1, is there enough evidence to conclude that the company size affects
the advertising effectiveness?
Solusion:
Solving the problem, first of all, establish the hypotheses:
• H0 : The size of the company does not relate to effectiveness in advertising.
• H1 : The size of the company relates to effectiveness in advertising.
3.1 Input the data from the table

Injecting the dataset using vector c() and data.frame() which creates data frames for advertising
effectiveness.
Description: The function data.frame() creates data frames, tightly coupled collections of vari-
ables which share many of the properties of matrices and of lists, used as the fundamental data
17
structure by most of R’s modeling software.
Creating row names for the table by function row.names() which gets and sets row names for data
frames with a size company and using a function to export the table.
3.2 Running chisq.test() to compare the given signifi-

cant level.
Application of Chi Test:
A Chi-square statistic is the test that measures how a model compares to actual observed data. The
Chi-square test statistic compares the size of any discrepancies between the expected results and the
actual results, given the size of the sample and the number of variables in the relationship. Function
chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests. Therefore, it is
appreciated for solving this problem.
Using chisq.test() to find p-value comparing the given significance level.
Conclusion
By using the Chi-Test, we obtained:
χ2 = 29.638, degrees of freedom (df) = 4, p-value = 0.01.
With p-value = 0.01 lower than α = 0.1. It can be determined the null hypothesis H0 is rejected and
the size of companies and their advertising effectiveness are statistically significantly associated.
18
Chapter 4
Late Arrivals
The number of students arriving late at five high schools on different days of week are given the
following table:
Day of week High school

A B C D
Monday 5 4 5 7
Tuesday 4 5 3 2
Wednesday 4 3 4 5
Thursday 4 4 3 2
Is there any significant difference in the number of late arrivals among different days of the week
at the significance level α = 5%.
Theory - Hypothesis Test

A statistical hypothesis is an assertion or conjecture concerning one or more populations. To prove
that a hypothesis is true, or false, with absolute certainty, we would need absolute knowledge. That
is, we would have to examine the entire population. Instead, hypothesis testing concerns how to
use a random sample to judge if it is evidence that supports or not the hypothesis. Hypothesis
testing is formulated in terms of two hypotheses:
• H0 : The null hypothesis.
• H1 : The alternate hypothesis.
The hypothesis we want to test is if H1 is “likely” true. So, there are two possible outcomes:
• Reject H0 and accept H1 because of sufficient evidence in the sample in favor or H1 .
• Do not reject H0 because of insufficient evidence to support H1 .
19
4.1 Establish the following hypotheses

• H0 : There is no difference in the number of late arrivals among different days of the week.
• H1 : There is a difference in the number of late arrivals among different days of the week.
4.2 Input the data and create the given table

The data is generated in a matrix style according to the given dataset in R.
Create a matrix by combining them, turning them into the data by using the “data.frame” function
and performing the data on top of each other, vertically:
20
Cbind: (generalized) vectors or matrices. These can be given as named arguments. Other R
objects may be coerced as appropriate, or S4 methods may be used: see sections ‘Details’ and
‘Value’. (For the ”data.frame” method of cbind these can be further arguments to data.frame such
asstringsAsFactors).
Stack: Stacking vectors concatenates multiple vectors into a single vector along with a factor
indicating where each observation originated. Unstacking reverses this operation.
4.3 Anova Test

Compute analysis of variance (or deviance) tables for one or more fitted model objects.
Usage: anova(object, ...).

Arguments object an object containing the results by a model fitting function (e.g., lm or glm).
..... additional objects of the same type.
After Anova test runs, the “summary” function will create the table and output data in the picture.
Conclusion:
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.
21
4.4 Chi Square Test

The other method is chi-square evaluation can be applied to this dataset. However, the R console
provides a warning about inaccurate approximation of χ2 . This is due to the expected values being
too small (below 5) and approximation of p-value could be too poor. The p-value = 0.9387 > α =
0.05 and χ2 = 3.5458 are evaluated nonetheless. This indicates correctness to the null hypothesis
H0 .
The p-value = 0.9472 is obtained, which is still larger than α= 0.05 with little-to-no variance
compared to Anova test.
Conclusion
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.
22
REFERENCE
1. @Manual, title = RStudio: Integrated Development Environment for R, author = RStu-
dio Team, organization = RStudio, PBC., address = Boston, MA, year = 2020, url =
http://www.rstudio.com/,
2. Montgomery, D. C., & Runger, G. C. (2007). Applied statistics and probability for engineers.
Hoboken, NJ: Wiley.
3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction
to statistical learning : with applications in R. New York :Springer,

Probability and Statistics Project 1 - Topic 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability and Statistics Project 1 - Topic 6

Uploaded by

Copyright:

Available Formats

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

PROBABILITY AND STATISTICS

Lectures: Dr. Phan Thi Huong

Group member: Truong Nguyen Cao Ngoc Hai (L) 1852350

2 Signal Response Time 11

Using linear regression, model the following data

1.1.2 Hypothesis tests in simple linear regression

• The alternative hypothesis denoted by H1 is the assertion that is contradictory to H0 .

Step 2: State the confident level (1 – α).

Step 4: Determine the rejected range or compute p-value.

Step 2: Data Visualization

The total of variables X: 18 The total of a product of X and Y: 18

Step 3: Fitting a linear regression model

The data printed on the console can be determined as followings.

The intercept is defined as βb0 = y − βb1 x

Therefore, the linear regression model is yi = βb0 + βb1 xi = 40.05 + 0.9241xi

The R console informs s2 = 83:95 on 16 degrees of freedom.

• Normal Q-Q Graph

• Residuals vs Leverage Graph

Step 4: Diagram of a linear regression model

Signal Response Time

• H0 : The gender plays in that signal response.

• H1 : The gender does not play a role in that signal response.

2.1 Input data

Usage: gl(n, k, length = n*k, labels = seq len(n))

Arguments n an integer giving the number of levels.

The table is re-represented below:

Group Count Sound Light Pulse

2.2 Shapiro-Wilk Test

Therefore, it can be concluded this has no major impact on gender determination.

Company size category Advertising effectiveness

• H0 : The size of the company does not relate to effectiveness in advertising.

• H1 : The size of the company relates to effectiveness in advertising.

3.1 Input the data from the table

structure by most of R’s modeling software.

3.2 Running chisq.test() to compare the given signifi-

Using chisq.test() to find p-value comparing the given significance level.

χ2 = 29.638, degrees of freedom (df) = 4, p-value = 0.01.

Day of week High school

Theory - Hypothesis Test

• H0 : The null hypothesis.

• H1 : The alternate hypothesis.

• Reject H0 and accept H1 because of sufficient evidence in the sample in favor or H1 .

• Do not reject H0 because of insufficient evidence to support H1 .

4.1 Establish the following hypotheses

4.2 Input the data and create the given table

4.3 Anova Test

Usage: anova(object, ...).

4.4 Chi Square Test

You might also like