You are on page 1of 7

MODELLING AND SIMULATION LABORATORY

Assignment No. 3

Aniket A. Pawar Piyush C. Pawar


Mis – 111613048 Mis – 111613049

Aim: To find the annual sugar consumption in India in coming years using power analysis, chi-square test,
hypothesis, regression analysis.

Problem statement: Determine the suitable sample size using power analysis, check if the data follows
normal distribution using chi-square test, and find the sugar consumption with respect to the population in
coming years.

Objectives:

1) To learn different techniques for predicting the future trend of any data set.
2) To use power analysis to find the suitable sample size for the prediction.
3) To use Chi-square test to determine if the data follows normal distribution.
4) To use hypothesis analysis to determine if the hypothesis considered is correct or not.
5) To find the future trend of sugar consumption with respect to the population growth.

Data obtained: For the selected problem statement, data of population growth since year 1955 was taken
along with the annual consumption of sugar.

Methodology:First the required sample size was found out using power analysis with the help of minitab
software. Then chi-square test was performed to find if the data follows a normal distribution. Then
hypothesis testing was done using a t-test in excel which tells if the hypothesis done was correct or not.
Finally the regression analysis was done to find the future sugar consumption with respect to the
population growth.

Approach towards the solution:

A) Power Analysis
Power analysis is an important aspect of experimental design. It allows us to determine the sample
size required to detect an effect of a given size with a given degree of confidence. Conversely, it
allows us to determine the probability of detecting an effect of a given size with a given level of
confidence, under sample size constraints. If the probability is unacceptably low, we would be wise
to alter or abandon the experiment.
Following is the screenshot of minitab window which shows the output of the calculated sample
size.

Based on these calculations the sample size of 25 was selected and following is the data required
for study.

YEAR POPULATION SUGAR CONSUMPTION(million tons)


x y
1955 409269055 1.703
1960 449480608 2.113
1965 497702365 2.81
1970 553578513 4.025
1975 621301720 3.689
1980 696783517 4.98
1985 781666671 8.353
1990 870133480 10.715
1995 960482795 13.172
2000 1014004000 16.2
2001 1045845000 16.781
2002 1049700000 18.384
2003 1053050912 17.285
2004 1065071000 18.5
2005 1144118674 18.5
2006 1156230654 19.9
2007 1170684298 21.9
2008 1185546782 22.912
2009 1215684658 21.328
2010 1230980691 20.769
2011 1232556498 22
2012 1249564548 23
2013 1270698578 24.427
2014 1286214569 25.655
2015 1309053980 24.85
B) Chi-Square test
In the standard applications of this test, the observations are classified into mutually exclusive
classes, and there is some theory, or say null hypothesis, which gives the probability that any
observation falls into the corresponding class. The purpose of the test is to evaluate how likely the
observations that are made would be, assuming the null hypothesis is true.
Chi-squared tests are often constructed from a sum of squared errors, or through the sample
variance. Test statistics that follow a chi-squared distribution arise from an assumption of
independent normally distributed data, which is valid in many cases due to the central limit
theorem. A chi-squared test can be used to attempt rejection of the null hypothesis that the data
are independent.
The chi-square test was done on the y-variable data as follows:

class freq pdf cdf cdf' pdf' freq chi sq


1 to 5 6 0.24 0.24 0.100029 0.100029 2.500729643 4.796528
6 to 10 2 0.08 0.32 0.253679 0.15365 3.841253806 0.88258047
11 to 15 1 0.04 0.36 0.482242 0.228563 5.714071661 3.189078
16 to 20 8 0.32 0.68 0.716983 0.234741 5.868520833 0.774165
21 to 25 8 0.32 1 0.883434 0.166451 4.161282851 3.54115543

Calculated 13.1835069
mean 15.36 From table 13.277 for alpha = 0.01
std dev 8.085

Here, since the calculated value of chi-square is less than the that obtained from the chi-square
table, we can conclude that the data selected follows a normal distribution

C) Hypothesis testing (t-testing)


A t-test is a type of inferential statistic used to determine if there is a significant difference between
the means of two groups, which may be related in certain features. It is mostly used when the data
sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a
normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool,
which allows testing of an assumption applicable to a population.

Null hypothesis : The variable y depends upon the variable x.


Alternate hypothesis: The variable y does not depend upon the variable x.
t-Test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2
Mean 15.35804 15.35804
65.36252 65.362525
Variance 55 46
Observations 25 25

Hypothesized Mean Difference 0


Df 48
t Stat 0
P(T<=t) one-tail 0.5
1.677224
t Critical one-tail 2
P(T<=t) two-tail 1
2.010634
t Critical two-tail 72

Since t stat value is smaller than P one tail and P two tail, we fail to reject
our hypothesis.
Since P one tail and P two tail are greater than alpha (0.05), we accept the
hypothesis.

1) Left tail test:

Here the rejection value comes out to be 2.059 Hence the sugar
consumption below this value will be rejected.
2) Right tail test:

Here the rejection value comes out to be 28.66 Hence the sugar
consumption above this value will be rejected.

3) Two tail test:

Here, the between the -0.4883 and 31.2 will be accepted.


D) Regression analysis

Regression analysis is a reliable method of identifying which variables have impact on a topic of
interest. The process of performing a regression allows you to confidently determine which factors
matter most, which factors can be ignored, and how these factors influence each other.

In order to understand regression analysis fully, it’s essential to comprehend the following terms:

 Dependent Variable: This is the main factor that you’re trying to understand or predict.
 Independent Variables: These are the factors that you hypothesize have an impact on your
dependent variable.

In our problem statement, the independent variable is the population while the dependent variable
is the annual consumption of sugar. The dependency of the sugar consumption on the population
growth can be seen.

Regression analysis was done in excel using data analysis and output shown above was obtained.

Multiple R: It is the Correlation Coefficient that measures the strength of a linear relationship between two
variables. Our multiple R is 0.9868 which is closer to 1. So, there is a good relationship between our
variables.
R Square: It is the Coefficient of Determination, which is used as an indicator of the goodness of fit. Our R
value is 0.9738 or this line would be roughly 97.38% reliable if we use it to estimate our y value.
Adjusted R Square: It is the R square adjusted for the number of independent variable in the model. We do
not use this, as we have only one independent variable. So, its value 0.9727 is quite insignificant.
Standard Error: It shows the precision of the regression analysis. The smaller the number, the more certain
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.986835972
R Square 0.973845236
Adjusted R Square 0.972708072
Standard Error 1.335615711
Observations 25

ANOVA
df SS MS F Significance F
Regression 1 1527.672 1527.672 856.3808983 1.05556E-19
Residual 23 41.02899 1.783869
Total 24 1568.701

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Lower 95.0%


Upper 95.0%
Intercept -11.8618252 0.967746 -12.2572 1.44588E-11 -13.86375973 -9.85989 -13.8638 -9.85989
X Variable 1 2.77534E-08 9.48E-10 29.26399 1.05556E-19 2.57915E-08 2.97E-08 2.58E-08 2.97E-08

we can be about our regression equation. Our standard error is 1.34 which is low and not severe.
Observations: It is simply the number of observations in our model. Our Observations are 25.
ANOVA
The second part of the output is Analysis of Variance (ANOVA) which basically splits the sum of squares
into individual components that give information about the levels of variability within our regression
model:

df is the number of the degrees of freedom associated with the sources of variance. df for regression is
k=1, residual is (n-k-1)=23, total is (n-1)=24; where n=25& k is no. of coefficients.

SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the better our model fits
the data. SSR=1527.672 is the variation in group means around our overall mean 15.358. SSE=41.02899 is
variation in responses around group mean. SST=SSR+SSE=1568.7 and is accumulation of variation of all N
observations. Remains the same
MS is the mean square. MSR is SSR/1=1527.672estimates variance of group mean around overall mean &
MSE is SSE/ (n-k-1) =1.784estimates variation of errors around the group means.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall significance of the model.
Our model gives F as 856.38 which is significant for us.

Significance F is the P-value of F.


The Significance F value gives an idea of how reliable (statistically significant) our results are. If Significance
F is less than 0.05 (5%), your model is OK. If it is greater than 0.05, we had to probably choose another
independent variable. Our Significance F is very low, nearly 0. So we reject our null Hypothesis.

From the obtained result, the equation of line in the form of Y= mx + c is given as ;

Y = (2.77534 x 10^(-8)) X + (-11.8618252)

Here, the future values of y can be obtained by varying the value of x in equation. Hence, by varying
the population value, the sugar consumption can be obtained.

The predicted sugar consumption of next five years is as follows:

2016 1324171354 24.88


2017 1339180127 25.304
2018 1354051854 25.72
2019 1368737513 26.125
2020 1384658756 26.567

Conclusion:
1) Different data analysis techniques studied and implemented on the given problem statement.
2) Power analysis done to find appropriate sample size.
3) Chi-square test carried out to conclude that data follows normal distribution.
4) Hypothesis testing done using t-test and found out that.
5) Regression analysis done to find the equation of line to be fitted in the data points.
6) Sugar consumption of next 5 years was predicted using this line.

You might also like