You are on page 1of 15

I.

Introduction
- A survey for first year students at DUT has given us interesting
data about students’ entrance scores, height, weight,... Then, we
decide to carry out research about the given data in this survey.
In this report, we focus on the self-study time of the students
and their entrance scores and base on all what we have learned
related to descriptive statistics and inferential statistics to
summarize and visualize the data as well as estimations,
hypothesis testing, to find answers for the problems. We also use
R to analyze the data and clarify the problems.
- Group 6 has six members and their tasks in this project is shown
below:

Descriptive Statistics(graph) Truong Phuoc Thinh


Vo Dai Thuan
Inferential Statistics Le Gia Vinh
Cleaning and Distribution the Le Quang Kiet
Data Phan Duong Gia Bao
Writing Report Phan Duong Gia Bao
Le Gia Vinh
Pham Minh Chien

II. Data Collection and Describe Data


-Our team uses the data from the survey of freshmen from
“FASR_Data_4Groups” of the Faculty of Advanced Science and Technology.
-We clear the data and then it consists of 2 classes (in the given data, they
are group 3, 4). It concludes: Total score, Math score, Score 1, Score 2,
Self-study time and Number of Social Networks.
● Class 1 is 21ES and has 52 students.
● Class 2 is 21ECE and has 49 students.
III. Analyzing the results
1. Descriptive statistics
1.1 Total Entrance Score:

❖ Comment:
➢ The highest entrance’s score belongs to ES class with 26.70 and
ECE class is only slightly behind with 26.00.
➢ The median score of ECE is greater than ES but the mean score is
smaller. Although this difference is not too significant.
Check that if the data is normal distribution or not:
❖ Comment:
➢ The linear pattern in each plot supports the assumption that the
total scores distributions of two classes are normal
➢ The plot charts also give the conclusion that the Total Entrance
Score variable of 2 classes follows the normal distribution since
almost all points in 2 charts are in the gray area.

Boxplot:

❖ Comment:
➢ ES is quite a slower median.
➢ In class ES, there are 2 outliers (18 and 19.5). And 1 outlier in class
ECE (17).
The histogram:

❖ Comment:
➢ Most students in two classes have scores ranging from 22-25, only
a few exceptions below 18 points.

1.2 Score for Each Subjects Score:


-ECE:

-ES:
❖ Comment:
➢ While the highest Math score and Score 3 in ES and ECE are the
same (9.2 and 9.6, respectively), in Score 2 witnessed ES was 1
point higher than ECE (9.6 and 8.6).
➢ The lowest points in 2 classes belongs to Score 2 of ECE (4.0)
➢ Overall, most medians and means of ES are larger than those of
ECE.

Check for Normal Distribution:

❖ Comment:
➢ The QQ-plot charts also give the conclusion that the Math Score
variable of 2 classes follows the normal distribution since almost
all points in 2 charts are in the gray area.

The Boxplot:
❖ Comment:
➢ The Median is greater for ES (8.2).
➢ Class ECE has 1 outlier (6).
➢ The Interquartile Range is the same for two classes.
➢ While the 3rd quartiles of the boxes of ECE are much farther to the
median, the boxes of ES are much closer.

The Histograms:

-ES:
-ECE:

❖ Comment:
➢ Most students in 2 classes have scores ranging from 7.0 to 8.6.
1.3 Self-study Time & Social Media:

❖ Comment:
➢ In ES (class 1), almost students use 3,4 kind of social media, while
only 1 student who use one social media
➢ In ECE (class 2), the number of students who use 3 types of social
media is highest and no one uses 1 social media.
Check that if Self-study time is normal distribution or not:

❖ Comment:
➢ The first QQ-plot chart gives the conclusion that the Self-Study
Hour variable of ECE may follow the normal distribution because
almost all points in the first chart are not in the gray area.
➢ The second QQ-plot chart gives the conclusion that the Self-Study
Hour variable of Class2 does not follow the normal distribution
because almost all points in the second chart are in the gray area.
The boxplot:

❖ Comment:
➢ ES have many outliers above 40
➢ The 1st quartiles of the boxes of both classes are much closer to the
median than the 3rd, especially in ES class.

2. Probability Distribution
Total score and Self-study time distribution:
❖ 21ES + 21ECE
❖ 21ES
➢ Table of Distribution:

➢ Parameter:
■ Total score:
+ Expected value: 23.389
+ Variance: 3.055
+ Standard deviation: 1.7469
■ Self-study time:
+ Expected value: 9.783
+ Variance: 125.37
+ Standard deviation: 11.196

❖ 21ECE
➢ Table of Distribution:
➢ Parameter:
■ Total score:
+ Expected value: 23.326
+ Variance: 3.006
+ Standard deviation: 1.73
■ Self-study time:
+ Expected value: 10.73
+ Variance: 83.6
+ Standard deviation: 9.13

3. Inferential Statistics
- Topic : Score Prediction Model based on self-studying time
Develop a score prediction model based on self-studying time. Using
statistics methods to estimate the impact of self study on university
results.
- Direction:
1. Define the objects
Finding the associate internet on time and studying process.
2. Data collection
Gathering the relevant data, including academic scores and the time
spent on studying.
3. Data preprocessing
Removing missing values, outliers and issues in data. Then
optimizing the data becomes clear and suitable.
4. Data exploration
Understanding the trend and distribution of the data, explore the
correlation between self-studying and scores.
5. Statistical analyzing
Using statistical methods to show the qualification of self-study
time and scores. This may involve correlation analysis, t-tests,
regression analysis, depending on the nature of data.
6. Interpretation of results
Interpreting the result of the model, understanding coefficients,
graphs, features importance.

Problem 1: Based on the true average of students' self-study time (denoted


by μ) of first year students at DUT obtained from descriptive statistics ,which is
10 hours, claim that the true average time that students spend on self study
exceeds 10 hours per week. Our desire is to carry out a test of hypotheses to
see whether the data supports this conclusion with a significant level of 0.05.
● Since the survey was conducted on 100 students which exceeds 40 in
terms of sample size, CLT can be applied (rules of thumb). Thus, this
sample has approximately a normal distribution.
● Due to the fact that our test is about the population mean μ and the
sample has approximately a normal distribution, the test is in Case II:
Large-Sample Tests
● steps for testing hypotheses about the true average self- study time of first
year students at DUT:
1. Parameter of interest: μ = true average time to self- study per week
2. Null hypothesis: Ho: μ = 10;
3. Alternative hypothesis: Ha: μ > 10 ⇒ upper-tailed test
4. Formula for test statistic value:

5. With significance level 0.05, reject H0 when Z >= 1.64


6. With n =101, x = 10.245 and s = 10.21
7. Comment: Since 0.2 and less than 1.64, thus there is evidence to fail the
rejection of the null hypothesis.

Problem 2: Find 95% confidence interval of self-studying time per week of


FAST’s students (21ES + 21ECE)

Problem 3:
We stated that the true average entering total score of 2 programs of FAST.
Let’s carry out a test of hypotheses to see whether the data supports this
conclusion with a significant level of 0.05 and data of 21ES and 21ECE.
1. Denote μ1 is average total score of 21ES and μ2 is average total
score of 21 ECE
2. Null hypothesis: H0: μ1 - μ2 =0 (no difference in the true average
total entering score for 2 programs)
3. Alternative hypothesis: Ha: μ1 - μ2 ≠ 0 ⇒ two-tailed test
4. Formula for test statistic value
5. With significant level 0.05, reject H0 when Zα/2 >= 1.96 or Zα/2<=
-1.96
6. With m = 52, x = 23.39, σ12 = 3.055; n = 49, y = 23.33, σ22 = 3.006:
Z = 0.176

7. Since -1.96 < Z < 1.96, there is evidence about failing to reject the
null hypothesis at a significant level of 0.05. So, there is no
significant difference between the entering total scores of 21ES and
21ECE.

Conclusion:
Based on the sample data of 21FAST’s Freshman includes 21ES and 21ECE,
this project shows the general observation for students, with parameters: math
score, score2, score3, total score, Self-studying time per week,... By applying
the probability and statistics knowledges, furthermore assert the
hand-calculation by Rstudio software, we can draw the following conclusions:
❖ Firstly, sample data is represented by the meaningful visual technique:
histograms, pie charts, bar graphs, box plots,.. and some numerical summary
measures. In the inferential statistics, because sample data about 100 students is
a large sample, thus we apply the large sample test case. From the data, we have
the sample mean is 10.245, standard deviation is 10.21, then we claim a
hypothesis that the self-studying time per week of a student is 10 hours, and we
consider there is no evidence to reject this hypothesis. Moreover, the confidence
interval of 95% is (8.24, 12.24).
❖ To sum up, these things help we more understand surrounding area

You might also like