Professional Documents
Culture Documents
Assignment No. 3
Aim: To find the annual sugar consumption in India in coming years using power analysis, chi-square test,
hypothesis, regression analysis.
Problem statement: Determine the suitable sample size using power analysis, check if the data follows
normal distribution using chi-square test, and find the sugar consumption with respect to the population in
coming years.
Objectives:
1) To learn different techniques for predicting the future trend of any data set.
2) To use power analysis to find the suitable sample size for the prediction.
3) To use Chi-square test to determine if the data follows normal distribution.
4) To use hypothesis analysis to determine if the hypothesis considered is correct or not.
5) To find the future trend of sugar consumption with respect to the population growth.
Data obtained: For the selected problem statement, data of population growth since year 1955 was taken
along with the annual consumption of sugar.
Methodology:First the required sample size was found out using power analysis with the help of minitab
software. Then chi-square test was performed to find if the data follows a normal distribution. Then
hypothesis testing was done using a t-test in excel which tells if the hypothesis done was correct or not.
Finally the regression analysis was done to find the future sugar consumption with respect to the
population growth.
A) Power Analysis
Power analysis is an important aspect of experimental design. It allows us to determine the sample
size required to detect an effect of a given size with a given degree of confidence. Conversely, it
allows us to determine the probability of detecting an effect of a given size with a given level of
confidence, under sample size constraints. If the probability is unacceptably low, we would be wise
to alter or abandon the experiment.
Following is the screenshot of minitab window which shows the output of the calculated sample
size.
Based on these calculations the sample size of 25 was selected and following is the data required
for study.
Calculated 13.1835069
mean 15.36 From table 13.277 for alpha = 0.01
std dev 8.085
Here, since the calculated value of chi-square is less than the that obtained from the chi-square
table, we can conclude that the data selected follows a normal distribution
Variable 1 Variable 2
Mean 15.35804 15.35804
65.36252 65.362525
Variance 55 46
Observations 25 25
Since t stat value is smaller than P one tail and P two tail, we fail to reject
our hypothesis.
Since P one tail and P two tail are greater than alpha (0.05), we accept the
hypothesis.
Here the rejection value comes out to be 2.059 Hence the sugar
consumption below this value will be rejected.
2) Right tail test:
Here the rejection value comes out to be 28.66 Hence the sugar
consumption above this value will be rejected.
Regression analysis is a reliable method of identifying which variables have impact on a topic of
interest. The process of performing a regression allows you to confidently determine which factors
matter most, which factors can be ignored, and how these factors influence each other.
In order to understand regression analysis fully, it’s essential to comprehend the following terms:
Dependent Variable: This is the main factor that you’re trying to understand or predict.
Independent Variables: These are the factors that you hypothesize have an impact on your
dependent variable.
In our problem statement, the independent variable is the population while the dependent variable
is the annual consumption of sugar. The dependency of the sugar consumption on the population
growth can be seen.
Regression analysis was done in excel using data analysis and output shown above was obtained.
Multiple R: It is the Correlation Coefficient that measures the strength of a linear relationship between two
variables. Our multiple R is 0.9868 which is closer to 1. So, there is a good relationship between our
variables.
R Square: It is the Coefficient of Determination, which is used as an indicator of the goodness of fit. Our R
value is 0.9738 or this line would be roughly 97.38% reliable if we use it to estimate our y value.
Adjusted R Square: It is the R square adjusted for the number of independent variable in the model. We do
not use this, as we have only one independent variable. So, its value 0.9727 is quite insignificant.
Standard Error: It shows the precision of the regression analysis. The smaller the number, the more certain
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.986835972
R Square 0.973845236
Adjusted R Square 0.972708072
Standard Error 1.335615711
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 1527.672 1527.672 856.3808983 1.05556E-19
Residual 23 41.02899 1.783869
Total 24 1568.701
we can be about our regression equation. Our standard error is 1.34 which is low and not severe.
Observations: It is simply the number of observations in our model. Our Observations are 25.
ANOVA
The second part of the output is Analysis of Variance (ANOVA) which basically splits the sum of squares
into individual components that give information about the levels of variability within our regression
model:
df is the number of the degrees of freedom associated with the sources of variance. df for regression is
k=1, residual is (n-k-1)=23, total is (n-1)=24; where n=25& k is no. of coefficients.
SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the better our model fits
the data. SSR=1527.672 is the variation in group means around our overall mean 15.358. SSE=41.02899 is
variation in responses around group mean. SST=SSR+SSE=1568.7 and is accumulation of variation of all N
observations. Remains the same
MS is the mean square. MSR is SSR/1=1527.672estimates variance of group mean around overall mean &
MSE is SSE/ (n-k-1) =1.784estimates variation of errors around the group means.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall significance of the model.
Our model gives F as 856.38 which is significant for us.
From the obtained result, the equation of line in the form of Y= mx + c is given as ;
Here, the future values of y can be obtained by varying the value of x in equation. Hence, by varying
the population value, the sugar consumption can be obtained.
Conclusion:
1) Different data analysis techniques studied and implemented on the given problem statement.
2) Power analysis done to find appropriate sample size.
3) Chi-square test carried out to conclude that data follows normal distribution.
4) Hypothesis testing done using t-test and found out that.
5) Regression analysis done to find the equation of line to be fitted in the data points.
6) Sugar consumption of next 5 years was predicted using this line.