You are on page 1of 9

MAS291 Final Project

Group: 7

Member:

{Nguyễn Thái Nguyên- QS170069; Lương Hoàng Duy - DE170114; Trần Đinh Khang -

DS170082; Bùi Nữ Vân Nhi - DA160062}

I. Dataset:

The dataset given by teacher includes the weight and height of 30 female students. Here is the

example of the dataset:

The dataset only contains two variables weights and heights of 30 female students. In this project

we will conduct descriptive and inferential statistical analysis based on the given data and then a

regression model is introduced to show the mathematical relation between the two variables.

II. Descriptive and Inferential Statistical Analysis:

1. Descriptive Analysis:

By using the Analysis tool provided by Microsoft Excel, a descriptive analysis table were

generated.

Weight Height
Mean 59,73333 Mean 165,6667
Standard Error 1,231701 Standard Error 1,132928
Median 59,5 Median 166
Mode 58 Mode 170
Standard Deviation 6,746306 Standard Deviation 6,2053
Sample Variance 45,51264 Sample Variance 38,50575
Kurtosis 0,524973 Kurtosis -0,12615
Skewness 0,837299 Skewness -0,10425
Range 27 Range 25
Minimum 50 Minimum 154
Maximum 77 Maximum 179
Sum 1792 Sum 4970
Count 30 Count 30
Confidence Level (95.0%) 2,519112 Confidence Level (95.0%) 2,317097

a. Variable Weight (kg):

The average weight of the students was found to be approximately 59.73 kg, with a standard

deviation of 6.75 kg, indicating a moderate amount of variability in the weights. The mode is 58

kg, indicating that it is the weight that appears most often in the dataset.

The skewness value of 0.84 indicates that the weight distribution is positively skewed. This

means that the tail of the distribution extends more towards higher weights, suggesting that there

might be a few students with relatively higher weights compared to the majority.

The range of weights observed in the dataset was 27 kg, ranging from a minimum of 50 kg to a

maximum of 77 kg. The most frequently occurring weight was 58 kg. The sample variance, a

measure of dispersion, was calculated to be approximately 45.51 kg^2


Histogram
10 120.00%
8 100.00%
Frequency
6 80.00%
60.00% Frequency
4 40.00% Cumulative %
2 20.00%
0 0.00%
50 55,4 60,8 66,2 71,6 More
Bin

The majority of students have weights ranging from 55.4 kg to 71.6 kg, with the highest

frequency occurring within the 55.4 kg to 60.8 kg range. The distribution appears to be slightly

skewed towards relatively lower weights, as indicated by the lower frequencies in the higher

weight ranges.

b. Variable Height (cm):

The average height of the students was calculated to be approximately 165.67 cm, with a

standard deviation of 6.21 cm. This suggests that the heights of the students varied around the

mean, indicating a moderate level of diversity within the group. The height distribution exhibited

a nearly symmetrical shape, as evidenced by a small negative skewness value.

The analysis revealed that the range of heights in the dataset spanned 25 cm, with the shortest

height recorded at 154 cm and the tallest at 179 cm. The most common height among the

students was 170 cm, reflecting a frequently occurring value within the group. The sample

variance, a measure of height dispersion, was computed to be approximately 38.51 cm^2,

providing an understanding of the spread of heights around the mean.


Histogram
12 120.00%
10 100.00%
8 80.00%

Frequency
6 60.00% Frequency
Cumulative %
4 40.00%
2 20.00%
0 0.00%
154 159 164 169 174 More
Bin

The majority of individuals have heights falling within the range of 164 cm to 169

cm, which is the most prevalent range. The distribution appears to be slightly

skewed towards relatively shorter heights, as indicated by the lower frequencies in

the taller height ranges.

2. Inferential Statistics:

Hypothesis Testing and Confidence Interval of the mean of a population

Population: All female students

Sample: This dataset

Analysis: Construct a Confidence Interval with 5% significance level for average height of all

female students. In here, we will use t-distribution because sigma is unknown.

C.I for average height of all althletes in the world


Use t-distribution (sigma is unknown)
alpha 5%
n 30
sample mean x 165,67
sample stdev s 6,21
t(alpha/2,n-1) 2,05
right bound 167,98
left bound 163,35
163.35 <= height <=
C.I for average height: 167.98

So after the analysis, we can conclude that a 95% confidence interval on the average height of all

female students based on this data is (163.35, 167.98)

Research question for Hypothesis Testing: Average Height of all female students is 164 cm.

Test the claim with significance level of 10% based on the data.

Hypotheses:

Null Hypothesis (H0): The average height of all female students is 164 cm.

Alternative Hypothesis (H1): The average height of all female students is not 164 cm.

alpha 10%
n 30
sample mean x 165,67
sample stdev s 6,21

Test statistic
mean0 164
1,47111
t0 5
t(alpha/2,n-1) 1,70
-t(alpha/2,n-1) -1,70

Because t0 is in acceptance range, fail to reject null hypothesis. This indicates that there is not

enough evidence to suggest that the average height of all female student is not equal to 164 cm.

Hypothesis Testing for population proportion P

Population: All female students

Sample: This data

Analysis: Test the claim that the percentage of female students with weight under 65 in the

world is 60% of the total population with 10% significance level


Hypotheses:

Null Hypothesis (H0): The percentage of female students with weight under 65 in the world is

equal to 60%.

Alternative Hypothesis (H1): The percentage of female students with weight under 65 in the

world is not equal to 60%.

p hat 0,80
test static z0 2,24
z(alpha/2) = right critical 1,96
- z(alpha/2) = left critical -1,96
Z0 is not in acceptance, so we reject the null hypothesis (H0), which means this would indicate

evidence to suggest that the percentage of female students with weight under 65 in the world is

different from 60%.

III. Constructing Simple Linear Regression and Analyzing Result:

For Linear Regression analysis, we set the variable height(cm) as “X” and variable weight(kg) as

“Y”. Using Regression Analysis from Analysis Tool in Excel, a summary output was generated:

Regression Statistics
Multiple R 0,875052
R Square 0,765716
Adjusted R Square 0,757349
Standard Error 3,323203
Observations 30

The multiple correlation coefficient (R) indicates the strength and direction of the linear

relationship between the predictor variables and the response variable. In this case, the multiple

R value is approximately 0.88. This suggests a strong positive correlation between the predictor

variables and the response variable.

The R Square value is approximately 0.77, indicating that around 77% of the variability in the

response variable can be explained by the predictor variables included in the regression model.
The adjusted R Square takes into account the number of predictor variables and the sample size

to provide a more accurate measure of the proportion of variance explained. The adjusted R

Square value of approximately 0.76 suggests that the predictor variables explain about 76% of

the variance in the response variable, considering the model's complexity and sample size.

The standard error represents the average deviation of the observed values from the regression

line. In this case, the standard error is approximately 3.32. It provides an estimate of the typical

distance between the actual data points and the predicted values from the regression model.

The number of observations indicates the sample size used in the regression analysis. In this

case, the analysis is based on 30 observations.

ANOVA table:

The analysis conducted involved performing an ANOVA (analysis of variance) to assess the

significance of a regression model. The results showed that the regression model was highly

significant (p < 0.001), indicating that the predictor variable(s) included in the model have a

strong influence on the response variable.

Further examination of the coefficients revealed that the intercept had a significant negative

effect, with a value of approximately -97.87. The predictor variable (referred to as X Variable 1)

had a significant positive effect, with a coefficient of approximately 0.95. These coefficients

imply that for every unit increase in X Variable 1, the response variable is expected to increase

by approximately 0.95 units, after accounting for the intercept.

For the given regression analysis above, we can conduct a formula calculating Y based on X and

vice versa: Y= 0.951343*X - 97.87254

Example: Given a student in the class with her height is 170cm, predict her weight?

Answer: Her predicted weight is Y = 63.8557 (kg)


In the analysis, we based on the formula and conduct tests to see if the regression model can

correctly predict the weights given their heights. The output table is given below.

X Variable 1 Line Fit Plot


100
80
60 Y
40 Predicted Y
Y

20
0
150 155 160 165 170 175 180 185
X Variable 1

As we can see, the regression model can estimate the value quite correct. This means that the

regression model can estimate well the relation between weights and heights of 30 examined

female students.

IV. Conclusion:

In conclusion, this data analysis project focused on examining the heights and weights of 30

female students. The project utilized descriptive statistics to summarize and analyze the data,

inferential statistics to draw conclusions and make predictions, and visual representations to

effectively communicate the findings.

The descriptive analysis revealed that the average height of the female students was

approximately 165.67 cm, with a standard deviation of 6.21 cm, indicating a moderate level of

variability. The weight data showed an average weight of approximately 59.73 kg, with a

standard deviation of 6.75 kg. Both height and weight distributions exhibited near-normal

shapes.
Inferential statistics were employed to explore relationships and determine the statistical

significance of certain variables. The results indicated strong positive correlations between

height and weight.

Visual representations, including graphs, tables, and frequency distributions, were created to

enhance the communication of the findings. These visuals aided in effectively presenting the

data, allowing for easier interpretation and comprehension.

Overall, this project successfully analyzed the heights and weights of the female students,

providing valuable insights into their distributions, relationships, and characteristics. The

findings can inform further research, interventions, or decision-making processes related to

height and weight management among female students.

You might also like