You are on page 1of 23

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

PROBABILITY AND STATISTICS


PROJECT 1 - TOPIC 6

Lectures: Dr. Phan Thi Huong


MSc. Thai Ba Ngoc

Group member: Truong Nguyen Cao Ngoc Hai (L) 1852350


Nguyen Hoang Thao Ly 1852561
Nguyen Ngoc Thanh Tin 1953023
Tran Kim Ngan 1852606

MAY - 2021
ACKNOWLEDGMENT
The report is possible because of the considerable support of those who gave me motivation
with the way to complete throughout the valuable period. Furthermore, they encouraged us
to conquer enormous obstacles and gave sensible advice as well.

First, we would like to express my sincere gratitude to my supervisors, PhD. Phan Thi
Huong and MSc. Thai Ba Ngoc, whose contribution to stimulating suggestions, guidance,
and determination, motivated us to work harder and then conduct this scientific report,
especially in writing, throughout the course.

Also, we would thank the Faculty of Geology and Petroleum for providing us a precious
opportunity to sign up for this subject to help us broaden our academic knowledge and our
classmates for their assistance.

Eventually, we deeply apologize to all other unnamed people who help us in the enormous
number of ways to complete this report.
Contents

ACKNOWLEDGMENT 3

REQUIREMENT 5

1 Linear Regression 6
1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Hypothesis tests in simple linear regression . . . . . . . . . . . . . . . 7
1.2 R Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Signal Response Time 11


2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Advertising Effectiveness 16
3.1 Input the data from the table . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Running chisq.test() to compare the given significant level. . . . . . . . . . . 17

4 Late Arrivals 18
4.1 Establish the following hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Input the data and create the given table . . . . . . . . . . . . . . . . . . . . 19
4.3 Anova Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

REFERENCE 22

4
REQUIREMENT
• Each group works on the assigned topic.

• In each report, there must be student names and IDs on the cover page, the table of
content, and the questions.

• R-Studio must be used to analyze the data set and the codes must be inside framed
environments. Detailed explanations must be provided to receive full credit.

• Deadline: 28/05/2021.
Chapter 1

Linear Regression

Using linear regression, model the following data

X 50 130 170 270 90 210 50 130 270 24 170 210 90 210 90 240 50 240
Y 15 115 215 335 95 295 55 155 295 315 175 275 75 255 115 35 275 315

1.1 Theory
1.1.1 Linear regression model
The case of simple linear regression considers a single regressor variable or predictor variable X
and a dependent or response variable Y. Suppose that the true relationship between Y and X is a
straight line and that the observation Y at each level of X is a random variable. We assume that
each observation, Y, can be described by the model:
y = β0 + β1 x + ε

Where the intercept β0 and the slope β1 are unknown regression coeffcients, and ε is a random
error with mean zero and (unknown) variance σ 2 . The random errors corresponding to different
observations are also assumed to be uncorrelated random variables.

For a dataset of n observations (x1 , y1 ), ..., (xn , yn ), the sum of squares for errors is defined by:
X n X n
SSE = e1 2 = [yi − (βb0 + βb1 xi )]2
i=1 i=1

The least-square method aims to find the estimates βb0 , and βb1 by minimizing SSE. Those
estimates are called least squares estimates.
( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
βb1 = = n and βb0 = y − βb1 x
n Pn
Sxx X ( i=1 ix ) 2
xi 2 −
n
i=1

The coefficient of determination is the proportion of variation in the response variables that
is explained by the different values of independent variable compared to the total variation.
SSR βb1 Sxy
R2 = =
SST Syy

Note: a value of R2 near 1 indicates that most of the variation of the response data is explained by
the different values of independent variable. In other word, the linear regression model is explaining
well the relationship between y and x.

6
Probablity and Statistics Project 1 - Topic 6

1.1.2 Hypothesis tests in simple linear regression


Step 1: State the hypotheses H0 and H1
• The null hypothesis denoted by H0 is the claim that is initially assumed to be true.

• The alternative hypothesis denoted by H1 is the assertion that is contradictory to H0 .

Step 2: State the confident level (1 – α).


Step 3: Compute the test statistic.
s
βb0 − b0 b2
x
Tβ0 = ∼ t(n − 2) where SE(βb0 ) = b2 (1 +
σ )
SE(βb0 ) Sxx

Step 4: Determine the rejected range or compute p-value.


Alternative hypothesis Rejected range P-value
H1 : β1 6= b1 |tβ1 | > tn−2
α÷2 p = 2P (Tn−2 ≥ |tβ1 |)
H1 : β1 > b1 tβ1 > tn−2
α p = P (Tn−2 ≥ tβ1 )
H1 : β1 < b1 tβ1 < −tαn−2 p = P (Tn−2 ≤ tβ1 )

Step 5: Conclusion.
The P-value is the smallest significance level α at which the null hypothesis can be rejected. Because
of this, the P-value is alternatively referred to as the observed significance level (OSL) for the data.
If P-value is smaller than α, we would reject H0 and have enough evidence to confirm H1 .

1.2 R Programming
Step 1: Import data (the predictor variable X and response variable Y)

With R studio, we use a command c() for importing raw data and a command matrix() for visu-
alization of variables X and Y.

At the present, we could easily observe the numbers and check whether importing variables is
accurate or wrong.

Step 2: Data Visualization

Utilizing a command data.frame() in order to format the table for variables X and Y and then
calculate the total of variables X and Y with a command sum() and print() for displaying values.
Besides, we apply a command summary() to compute parameters for each variable X and Y.

7
Probablity and Statistics Project 1 - Topic 6

These are important parameters in order to model a linear regression which is demonstrated below
using R programming.

The total of variables X: 18 The total of a product of X and Y: 18


P P
x = 2910 i=1 xi yi = 64
P18 i 2
i=1
The total of variables X squared: i=1 xi = 574300 The mean of X: x = 161.7
The total of variables Y: 18
P
i=1 yi = 3410 The mean of Y: y = 189.4

Step 3: Fitting a linear regression model

We use a command lm() in order to compute a linear regression model. Besides, a command
summary() is used for displaying variables of β0 , β1 , and ε.

The data printed on the console can be determined as followings.

• Residuals

The residuals are the difference between the actual values and the predicted values. It is
0.132 which is close to the expected value 0.

The first quartile is -30.495 which is the median of the lower half of the data set. This means
that about 25% of the numbers in the data set lie below -30.495 and about 75% lie above.

The third quartile is 44.307 which is the median of the upper half of the data set. This means
that about 75% of the numbers in the data set lie below 44.307 and about 25% lie above.

In conclusion, the distribution of residuals is not lightly symmetrical and pretty right-skewed.

8
Probablity and Statistics Project 1 - Topic 6

• Coefficients

( ni=1 xi )( ni=1 yi )
P P
Pn
Sxy i=1 xi yi −
The slope is defined as βb1 = = n
n Pn
Sxx X ( i=1 xi )2
xi 2 −
n
i=1

The intercept is defined as βb0 = y − βb1 x

Therefore, the linear regression model is yi = βb0 + βb1 xi = 40.05 + 0.9241xi


• Residual standard error

The R console informs s2 = 83:95 on 16 degrees of freedom.


• P-value

P-value is equal to 0.002682 which is smaller than the default significance level α = 0.05,
Therefore, we have enough evidence to confirm that there is a relationship between variables
X and Y.
• Coefficient of determination

R2 is equal to 0.4052 which means that approximate 40.52% of the variance found in the
response variable x can be explained by the predictor variable y.

Furthermore, par() and dev.off() are applied to arrange graphs based on commands plot()
which shows the relationship between X and Y.
• Residuals vs Fitted Graph

The graph demonstrates that the red line is approximately horizontal and the average magni-
tude of the standardized residuals is not changing as much as a function of the fitted values.
This means that the linear regression model is moderately good.
• Scale-Location Graph

Scale-Location plot shows whether residuals are spread equally along the ranges of input
variables (predictor). It is observed that the line is about horizontal with randomly spread
point, thus, the model is quite good.

9
Probablity and Statistics Project 1 - Topic 6

• Normal Q-Q Graph

A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based
on estimates of the quantiles. The pattern of points in the plot is used to compare the two
distributions. It is easily clear that the points seem to fall about a straight line, thus, those
are the quantiles from the Normal distribution.

• Residuals vs Leverage Graph

Residuals versus Leverage is used to identify influential cases. Extreme values might influence
the regression results when they are included or excluded from the analysis. In this case, there
are two points outside the dashed line – Cook’s distance that could influence the regression
results.

Also, we use a command predict() to observe the differences between the fitted values from
the linear regression model and the response variables y. To be more specific, we could
determine the prediction interval within lwr and upr at 95%.

Step 4: Diagram of a linear regression model


We construct a linear regression equation in order to exhibit for the regression line on the graph.

At first, we assign the intercept and slope values, the coefficient of determination to Beta 0, Beta 1
and R2 respectively with a command list(). Moreover, we create the linear regression equation
with a command substitute(). Last but not least, a command function() is utilized to combine
conditions above for adding the regression equation on the plot.

We use a package ggplot2 with a command library() so as to sketch the diagram of the relationship
between x and y. With a command ggplot(), we import data on the diagram and then draw a
regression line by a command geom smooth(). Furthermore, a command geom line() is considered
to detect the boundary of the prediction interval – a blue dashed line.

10
Probablity and Statistics Project 1 - Topic 6

For more details on the diagram, we have already applied a command labs() for putting a title, a
name of axis, and also a command geom text() for a equation we constructed at the first in step 4.
There are other commands such as geom rect() for drawing a frame of a linear regression equation,
theme() for customizing background, color, size of the graph.

The grey band indicates 95% confidence interval while the blue dashed lines are the 95% confident
of prediction range for the future relationship between x and y.

11
Chapter 2

Signal Response Time

One researcher wanted to examine the response times of men and women to different types of
signals. The subjects were asked to press the ENTER button on the computer keyboard as soon
as they recognized the signal. The duration (measured in seconds) between the time the signal was
emitted and the time the object hitting the button was recorded. Here are the results for 15 men
and 15 women.
GENDER SOUND LIGHT PULSE
MALE 10.0 6.0 9.1
7.2 3.7 5.8
6.8 5.1 6.0
6.0 4.0 4.0
5.0 3.2 5.2
FEMALE 10.5 6.6 7.3
8.8 4.9 6.1
9.2 2.5 5.2
8.1 4.2 2.5
13.4 1.8 3.9

Draw a conclusion at the significance level of α = 5%. Do the factors gender and signal interact?
Solution: Establish the following hypotheses for each signal

• H0 : The gender plays in that signal response.

• H1 : The gender does not play a role in that signal response.

12
Probablity and Statistics Project 1 - Topic 6

2.1 Input data


Enter data frame for male, females and the signal they are recognized by matrix. Then, RStudio
will create the table like the given topic.

Usage: gl(n, k, length = n*k, labels = seq len(n))

Arguments n an integer giving the number of levels.


k an integer giving the number of replications.
labels. an optional vector of labels for the resulting factor levels.

Value:
The result has levels from 1 to n with each value replicated in groups of length k out to a total
length of length.

Package dplyr provides functions to convert and manipulate data after it has been loaded. The
functions groupby() chained with summarize(), initial statistics of the dataset can be computed
and represented in a table.

The table is re-represented below:

Group Count Sound Light Pulse


E(X) V(X) E(X) V(X) E(X) V(X)
Male 5 7 1.88 4.4 1.13 6 1.90
Female 5 10 2.09 4.0 1.92 5 1.87

2.2 Shapiro-Wilk Test


Performs the Shapiro-Wilk test of normality and use functions shapiro.test() nested in with().

13
Probablity and Statistics Project 1 - Topic 6

The null hypothesis for this test is that the data are normally distributed. The p-value is greater
than 0.05, then the null hypothesis is not rejected.

2.3 F-test
Performs an F test to compare the variances of two samples from normal populations. All p-values
of the F-test are greater than α = 0.05, therefore, the equality of vari- ances is assumed.

14
Probablity and Statistics Project 1 - Topic 6

2.4 T-test
Performs one and two sample t-tests on vectors of data.

15
Probablity and Statistics Project 1 - Topic 6

Conclusion
From the tests above, we can see p-values for light response (p = 0.6985) and pulse response (p =
0.4262) are significantly higher compared to the confidence interval α = 0.05. Therefore, we fail to
reject the null hypothesis.

This proves these two categories have a mutual relationship with sexuality roles. Meanwhile, the
only p-value for sound (p = 0.04403) is lower than α = 0.05. We reject the null hypothesis for this
category.

Therefore, it can be concluded this has no major impact on gender determination.

16
Chapter 3

Advertising Effectiveness

To study whether the size of a company affects the advertising effectiveness, a survey of 356
customers’ opinions was collected and the following table was obtained.

Company size category Advertising effectiveness


High Moderate Low
Small 20 52 32
Medium 53 47 28
Large 67 32 25

At the significance of α = 0.1, is there enough evidence to conclude that the company size affects
the advertising effectiveness?

Solusion:
Solving the problem, first of all, establish the hypotheses:

• H0 : The size of the company does not relate to effectiveness in advertising.

• H1 : The size of the company relates to effectiveness in advertising.

3.1 Input the data from the table


Injecting the dataset using vector c() and data.frame() which creates data frames for advertising
effectiveness.

Description: The function data.frame() creates data frames, tightly coupled collections of vari-
ables which share many of the properties of matrices and of lists, used as the fundamental data

17
Probablity and Statistics Project 1 - Topic 6

structure by most of R’s modeling software.

Creating row names for the table by function row.names() which gets and sets row names for data
frames with a size company and using a function to export the table.

3.2 Running chisq.test() to compare the given signifi-


cant level.
Application of Chi Test:

A Chi-square statistic is the test that measures how a model compares to actual observed data. The
Chi-square test statistic compares the size of any discrepancies between the expected results and the
actual results, given the size of the sample and the number of variables in the relationship. Function
chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests. Therefore, it is
appreciated for solving this problem.

Using chisq.test() to find p-value comparing the given significance level.

Conclusion
By using the Chi-Test, we obtained:

χ2 = 29.638, degrees of freedom (df) = 4, p-value = 0.01.

With p-value = 0.01 lower than α = 0.1. It can be determined the null hypothesis H0 is rejected and
the size of companies and their advertising effectiveness are statistically significantly associated.

18
Chapter 4

Late Arrivals

The number of students arriving late at five high schools on different days of week are given the
following table:

Day of week High school


A B C D
Monday 5 4 5 7
Tuesday 4 5 3 2
Wednesday 4 3 4 5
Thursday 4 4 3 2

Is there any significant difference in the number of late arrivals among different days of the week
at the significance level α = 5%.

Theory - Hypothesis Test


A statistical hypothesis is an assertion or conjecture concerning one or more populations. To prove
that a hypothesis is true, or false, with absolute certainty, we would need absolute knowledge. That
is, we would have to examine the entire population. Instead, hypothesis testing concerns how to
use a random sample to judge if it is evidence that supports or not the hypothesis. Hypothesis
testing is formulated in terms of two hypotheses:

• H0 : The null hypothesis.

• H1 : The alternate hypothesis.

The hypothesis we want to test is if H1 is “likely” true. So, there are two possible outcomes:

• Reject H0 and accept H1 because of sufficient evidence in the sample in favor or H1 .

• Do not reject H0 because of insufficient evidence to support H1 .

19
Probablity and Statistics Project 1 - Topic 6

4.1 Establish the following hypotheses


• H0 : There is no difference in the number of late arrivals among different days of the week.

• H1 : There is a difference in the number of late arrivals among different days of the week.

4.2 Input the data and create the given table


The data is generated in a matrix style according to the given dataset in R.
Create a matrix by combining them, turning them into the data by using the “data.frame” function
and performing the data on top of each other, vertically:

20
Probablity and Statistics Project 1 - Topic 6

Cbind: (generalized) vectors or matrices. These can be given as named arguments. Other R
objects may be coerced as appropriate, or S4 methods may be used: see sections ‘Details’ and
‘Value’. (For the ”data.frame” method of cbind these can be further arguments to data.frame such
asstringsAsFactors).

Stack: Stacking vectors concatenates multiple vectors into a single vector along with a factor
indicating where each observation originated. Unstacking reverses this operation.

4.3 Anova Test


Compute analysis of variance (or deviance) tables for one or more fitted model objects.

Usage: anova(object, ...).


Arguments object an object containing the results by a model fitting function (e.g., lm or glm).
..... additional objects of the same type.

After Anova test runs, the “summary” function will create the table and output data in the picture.

Conclusion:
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.

21
Probablity and Statistics Project 1 - Topic 6

4.4 Chi Square Test


The other method is chi-square evaluation can be applied to this dataset. However, the R console
provides a warning about inaccurate approximation of χ2 . This is due to the expected values being
too small (below 5) and approximation of p-value could be too poor. The p-value = 0.9387 > α =
0.05 and χ2 = 3.5458 are evaluated nonetheless. This indicates correctness to the null hypothesis
H0 .

The p-value = 0.9472 is obtained, which is still larger than α= 0.05 with little-to-no variance
compared to Anova test.

Conclusion
From multiple tests, with p-value > α = 0.05, we fail to reject the null hypothesis H0 . It can be
determined that there are little-to-no significant differences in the number of late arrivals among
different days of the week.

22
REFERENCE
1. @Manual, title = RStudio: Integrated Development Environment for R, author = RStu-
dio Team, organization = RStudio, PBC., address = Boston, MA, year = 2020, url =
http://www.rstudio.com/,

2. Montgomery, D. C., & Runger, G. C. (2007). Applied statistics and probability for engineers.
Hoboken, NJ: Wiley.

3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction
to statistical learning : with applications in R. New York :Springer,

You might also like