You are on page 1of 15

HANOI UNIVERSITY

FACULTY OF MANAGEMENT AND TOURISM




BUSINESS AND ECONOMICS STATISTICS


CASE STUDY:
ACADEMIC PERFORMANCE OF UNIVERSITY STUDENTS

Tutor: Mrs. Lai Hoai Phuong Tutorial: Tut 2

Group: 02

Nguyễn Thị Bích 1904040014

Nguyễn Hà Linh 1904040066

Nguyễn Bùi Huyền My 1904000081

Nguyễn Quốc Phi 1904000093

Trịnh Huyền Thương 1904000109

Nguyễn Thảo Linh 1904050020

Nguyễn Thị Trang 1904000113

Content pages: 10 pages

Date: 10 November 2021


TABLE OF CONTENT

A. Scenario .................................................................................................................................... 3

B. Questions and Answers ........................................................................................................... 4

1. Clarifying inference technique............................................................................................ 4

2. Producing descriptive statistics for the dataset ................................................................. 4

3. Checking the assumptions ................................................................................................... 8

4. Performing the Two-way ANOVA test ............................................................................ 10

5. Drawing and interpreting the interaction plot ................................................................ 12

6. Credibility of the interpretations and conclusions .......................................................... 13

C. Peer Evaluation ...................................................................................................................... 14

1
LIST OF FIGURE

Figure 1: Box plot of the dataset ............................................................................................... 7

Figure 2: Mean plot of the dataset ............................................................................................ 8

Figure 3: Q-Q plot of the dataset ............................................................................................... 9

Figure 4: Interaction plot of the dataset ................................................................................. 12

2
A. Scenario

The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two
years to monitor systematically the living standards of Vietnam's societies. In 2018, the survey
was carried out with a sample size of 46,995 households in 3,133 communes/wards which were
representative at national, regional, urban, rural and provincial levels. The household
questionnaire contained many sections, each of which covered a separate aspect of household
activities, and education was one important indicator. In the survey, household heads were asked
to specify their place of residence (province), schooling level of their children (edulevel) and
expenditure on education per child for the past 12 months in thousands of VND (eduspend). The
objective of our study is to test for any significant interaction between place of residence and
schooling levels and to test for any significant differences in education expenditure due to these
two variables. Use 0.05 level of significance.

The dataset, which consist of 150 observations, is provided under the file named “Case 1.csv”

obs province edulevel eduspend


1 HaiDuong Nursery School 1440
2 HaiDuong Nursery School 1795
3 HaiDuong Nursery School 1330
4 HaiDuong Nursery School 930
5 HaiDuong Nursery School 2450
6 HaiDuong Nursery School 3344
7 HaiDuong Nursery School 3880
8 HaiDuong Nursery School 3215
9 HaiDuong Nursery School 1360
10 HaiDuong Nursery School 1385
11 HaiDuong Nursery School 1645
12 HaiDuong Nursery School 1730
13 HaiDuong Nursery School 410
14 HaiDuong Nursery School 1810
15 HaiDuong Nursery School 2625
16 HaiDuong Nursery School 1260
17 HaiDuong Nursery School 1535
18 HaiDuong Nursery School 630
19 HaiDuong Nursery School 1600
20 HaiDuong Nursery School 1304
21 HaiDuong Nursery School 1450
22 HaiDuong Nursery School 5322
23 HaiDuong Nursery School 1819
24 HaiDuong Nursery School 1619
25 HaiDuong Nursery School 1959

3
B. Questions and Answers

Question 1: What inference technique should be considered for this study? Explain.

In this report, a two-way ANOVA (higher-way ANOVA) test should be applied for the two
reasons listed:

• The case study enclosed two factors or two independent variables: Household activities
and Education
• The case study covered a continuous dependent variable: Education expenditure

The analysis is carried out for the following purposes:

• Examine if there is a significant interaction between the place of residence and schooling
levels.
• Examine whether there are any significant differences in education expenditure due to
these two variables or not.

Question 2: Produce descriptive statistics for the dataset. You are expected to generate as many
relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of
this course. Remember to provide appropriate interpretations for the descriptive statistics. Try
not to include unnecessary or irrelevant descriptive statistics.

In this project, the description of statistics has been conducted through the R studio.

First, we set the working directory where the excel file is located, then import the “Case 1. Csv”
into R as the input for all works later:

➢ case1 <- read.table("Case1.csv", header=TRUE, sep = ",",


stringsAsFactors = FALSE)
➢ str(case1)

To apply the statistical method, all data variables must be converted to factors instead of character
variables. Therefore, we use the factor() function to do this:

➢ case1$province <- factor(case1$province, levels=c("HaiDuong",


"HaiPhong"))

4
➢ case1$edulevel <- factor(case1$edulevel, levels = c("Nursery
School", "Primary School", "Secondary School"))

Thus, the structure of data is transformed into:

➢ str(case1)

From the R output, there are 150 observations for 4 variables: observation, province, education
level (edulevel) and education spending (eduspend). Moreover, the outcome also specifies two
factors with its corresponding level: factor ‘province’ includes 2 levels: HaiDuong and Hai Phong;
factor ‘edulevel’ has 3 levels that are Nursery School, Primary School and Secondary School.
Besides showing the dataset, the R output also clarifies the type of variables, that is the integer for
both ‘observation’ and ‘eduspend’.

Also in R studio, we use the table() code to produce the crosstabulation table for the sample size
between 2 variables ‘province’ and ‘edulevel’:

➢ table(case1$province, case1$edulevel)

In total, there are 6 combinations between province and education for 150 observations with each
have the same size of 25 observations. With such an equal sample size, this is a good sign for
hypothesis tests later because this has satisfied one of the conditions for performing the test.

In preparation for proving the latter hypothesis, we calculated both mean and standard deviation
for the combinations defined above:

➢ by(case1$eduspend, list(case1$province,case1$edulevel), mean)

5
➢ by(case1$eduspend, list(case1$province,case1$edulevel), sd)

Clearly that there is a huge disparity between the indicators, with the lowest mean belonging to
Hai Duong Province’s Nursery (approximately 1914) and Hai Phong Secondary School reaching
the top by nearly 10867. Besides, in terms of standard deviation, the lowest index is the Secondary
school in Hai Duong with 1040 while the primary school in Hai Phong has the highest of
approximately 6160.
6
Besides, with the goal of determining the distribution and spread of six combinations: Hai Duong-
Nursery school, Hai Duong- Primary school, Hai Duong- Secondary school, Hai Phong- Nursery
school, Hai Phong- Primary school and Hai Phong- Secondary school, the Boxplot and the
Meanplot were used for clarification.

➢ boxplot(eduspend ~ province + edulevel, data = case1, xlab =


“Province along with level of education”, ylab = “Spending for
education”, col = c(“red”, “skyblue”,
“yellow”, ”grey”, ”green”, ”pink”), main = “Box plot”)

Figure 1: Box plot of the dataset

➢ library(gplots)
➢ plotmeans(eduspend ~ interaction(province, edulevel), data = case1,
xlab = “Province and level of education”, ylab = “Spending for
education”, main=”Mean Plot with 95% CI”)

7
Figure 2: Mean plot of the dataset

The box plot is used to show the minimum, maximum, quartile, and its range (including
interquartile range and whiskers) in the most obvious way. From the R output of the box plot,
obviously the median between the groups has a significant difference when Hai Phong Secondary
school has the highest median of about 10000 and Hai Duong Nursery has the lowest median of
about 2000. Besides, the range of the groups also has a clear difference in width, in particular, Hai
Phong Secondary has a much larger interval than the other groups and Hai Duong Nursery is still
the group with the smallest interval of all.

Based on both the mean and the box plot, the mean values of the groups have different
distributions. More specifically, it clearly that the mean of education expenditure at Hai Duong
Nursery school and Hai Phong Secondary school tends to be right-skewed while the mean of
educational expenditure of Hai Duong Primary school is skewed to the left. However, during the
observation, a lot of outliers appear, which causes the mean of the calculated groups to be affected.

Question 3: Check all the assumptions of the inference technique you suggest in Question 1. Are
the assumptions satisfied? Explain.

In order to ensure the validity of the two-way ANOVA test, three following assumptions are
required:

Assumption 1: Independent, simple random samples.

Assumption 2: All populations are normally distributed.

Assumption 3: All population standard deviations are the same. (𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2 )

In general, the total sample size of 150 respondents was divided into six equal-sized groups:
HaiDuong-Primary School, HaiDuong-Nursery School, HaiDuong-Secondary School, HaiPhong-
Primary School, HaiPhong-Secondary School and HaiPhong-Nursery School. Moreover, the
8
responses from 150 households were separately recorded, accordingly, the education expenditure
of one group did not affect that of another. Therefore, the samples were independent and randomly
selected.

In addition, an eligible ANOVA test requires checking the normality of populations. Apparently,
the normal distribution could be verified by using Q-Q plot with the following code:

➢ library(car)
➢ qqPlot(lm(eduspend ~ province + edulevel + province*edulevel,
data = case1), simulate=T, main="Q-Q Plot", labels=F)

Figure 3: Q-Q plot of the dataset

As we use the Q-Q plot for checking the normality assumption, the scatter of points is the most
achievable demonstration of whether the population is normally distributed or not. It can be clearly
seen that most of the points on this plot were not within the blue area of the boundaries, the data
set could be assumed to be from a non-normal distribution. Furthermore, the Q-Q plot followed a
straight line, as a result, the normality of residuals was approximately satisfied.

Eventually, the assumption of equal standard deviation was checked by comparing the ratio of the
largest standard deviation over the smallest one with 2. If the ratio is smaller than 2, a conclusion
could be withdrawn that the population standard deviations are equal.

𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 6143.572


= = 5.9047
𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 1040.453
9
In this case, the largest standard deviation is 6143.572 and the smallest standard deviation is
1040.453, consequently, the ratio is 5.9047 which is greater than 2, we can conclude that the
population standard deviations are not equal.

Besides, another approach to check the assumption of equal standard deviation called the Levene’s
test can be exercised. This test is entitled to check the homogeneity of the variance or to test
whether the variances of samples are approximately equal. The mechanism here was comparing
the p-value of the Levene’s test and our significant level (α = 0.05). If the p-value is larger than α
so the Levene’s test is non-significant then equal variances were assumed and vice versa. However,
the Levene’s test is of paramount importance only under the circumstance of the unclear ratio of
standard deviations (between 2 and 3) then, in this case, the exercise of the Levene’s test is
unnecessary. All codes to run the Levene’s test are provided below:

➢ install.packages(“car”)
➢ library(carData)
➢ leveneTest(case1$eduspend,
interaction(case1$province,case1$edulevel), center=mean)

It is obvious that the P-value calculated by Levene’s test is smaller than the level of significance,
hence we can conclude that the population variances are not equal which means the population
standard deviations are not equal as well.

Question 4: Perform the inference technique you suggest in Question 1. Remember to provide all
the necessary steps. What are your interpretations and conclusions? Explain.

Step 1: Hypothesis testing

- To test whether there are differences in Education expenditure due to Place of residence.

Ho: There is no effect of Place of residence on Education expenditure.

Ha: There is an effect of Place of residence on Education expenditure.

- To test whether there are differences in Education expenditure due to Schooling levels.

Ho: There is no effect of Schooling levels on Education expenditure.

Ha: There is an effect of Schooling levels on Education expenditure.

10
- To test whether there is an interaction between Place of residence and Schooling levels.

Ho: There is no significant interaction between Place of residence and Schooling levels.

Ha: There is a significant interaction between Place of residence and Schooling levels.

Step 2: Check assumptions

• The sample is independent, simple random samples


• All populations are normally distributed
• All populations have the same variance (𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2 )
 The details of the above assumptions have been clearly stated in Question 3.

Step 3: Test statistic and P-value

We use R studio to calculate statistics and p-value as well and here is the R output for the two-
way ANOVA test:

➢ case1.result <- aov(eduspend ~ province*edulevel, data = case1)


➢ summary(case1.result)

Step 4: Level of significance: α= 0.05

Step 5: Comparison

We decide to use the p-value approach to make decisions; therefore, we reject Ho if:

p-value < α. Based on the R output, we find out p-value = 0.0223

p-value = 0.0223 < α= 0.05

 Therefore, we reject Ho.

Step 6: Conclusion

At the 5% significance level, there is sufficient evidence to conclude that the interaction between
Place of residence and Schooling levels is significant; there are differences in Education
expenditure due to Place of residence and Schooling levels.

Question 5: Draw an interaction plot and interpret the plot. Is the plot consistent with the
conclusions made in Question 4?

The command is used to draw interaction plot between place of residence and schooling levels:
11
➢ interaction.plot(case1$province, case1$edulevel, case1$eduspend,
type="b", col=c("black", "pink", "gold"), pch=c(16, 18), main
="Interaction between province and education level")

Figure 4: Interaction plot of the dataset

The graph above illustrates how the relationship between place of residence (province) and
schooling levels (edulevel) depends on the changes of education expenditure (eduspend) on each
group.

It can be clearly seen that all three lines are not parallel to each other; therefore, the effect of
education expenditure depends on schooling levels. It is likely that residents in Hai Phong and Hai
Duong spend most on Secondary School with nearly 11,000 and more than 4,000 respectively.
Moreover, the effect of place of residence on education expenditure can be taken into
consideration. As can be seen in the illustration, there is a considerable difference on spending on
Nursery School between two provinces. People who live in Hai Phong spend approximately 4,000
for this educational level, more than others in Hai Duong with nearly 2,000. Similarly, in Hai
Phong, people tend to spend more on Primary School with nearly 8,000 in comparison to Hai
Duong with about 3,000. It is obvious that Secondary School also has great interaction between
place of residence and education expenditure. The spending in Hai Phong is 11,000, almost three
times as high as that in Hai Duong, with over 4,000.

We can conclude that there is a significant interaction effect of education expenditure to both the
place of residence and schooling levels. This conclusion is consistent with the one we get in
Question 4 when we use the level of significance α= 0.05.

12
Question 6: Discuss the credibility of the interpretations and conclusions of Question 4. Is there
anything we should be concerned about? Explain.

In this question, we discuss the reliability of the interpretations, and draw conclusions for
Question 4. First of all, about credibility, it is considered the most important criterion for
establishing trustworthiness. Due to reliability essentially requires the researcher to explicitly link
the findings of the study to the facts in order to prove the truth of the study's findings.

As mentioned above, the two-way ANOVA test is useful for checking if there is an interaction
between the place of residence (province) and schooling levels (edulevel) depending on the
changes in education expenditure (eduspend) on each group.

As can be seen in question 4, all of the assumptions are satisfied and thoroughly checked without
any illiteracy. With the p-value is 0.0223, which is less than the significance level (α = 0.05), we
reject the null hypothesis. We conclude that there are significant interactions between Place of
Residence and Schooling Levels, as well as changes in Education Spending as a result of these
relationships.

In fact, ANOVA test is very useful for testing the interaction between two factors, but it still has
some limitations. As we mentioned in question 3, when the ANOVA test is used to test the
normality of the population, the results showed that the normality of residuals was approximately
gratified. Then we tried comparing the ratio of the largest standard deviation to the smallest
standard deviation with 2 to see if the ratio is less than 2 which helps us conclude that the
population standard deviations are equal, unfortunately, the result shows that the population
standard deviations are not equal. Moreover, there is an assumption that this case sample must be
a Simple Random Sample, but in fact, there is no evidence to ensure that the sample is drawn at
random from its population. These imply that the ANOVA test is not really appropriate for this
case study.

In addition, in this case study we only consider two factors: place of residence and schooling levels
that affect the expenditure for education, although there might be more factors that can persuade
the education expenditure. Those factors include the income levels of each household, different
cultures in different regions…

13
C. Peer Evaluation

Name Student ID Contribution Signature

(…/100%)

Nguyễn Thị Bích 1904040014

Nguyễn Hà Linh 1904040066

Nguyễn Bùi Huyền My 1904000081

Nguyễn Quốc Phi 1904000093

Trịnh Huyền Thương 1904000109

Nguyễn Thảo Linh 1904050020

Nguyễn Thị Trang 1904000113

14

You might also like