You are on page 1of 21

Roll No.____CA654221 Registration No.

____ 20 KBR00360

ASSIGNMENT
Educational Statistics
Course Code: 8614

Department of Early Childhood Education and


Elementary Teacher Education
Faculty of Education
Allama Iqbal Open University, Islamabad

1
Roll No.____CA654221 Registration No.____ 20 KBR00360

In the Name of Allah, the Most Gracious, the Most Merciful

SUBMITTED BY ABDUL SABOOR


ROLL NO. CA654221
REGISTRATION NO. 20KBR00360
PROGRAM B.Ed(1.5)
SEMESTER Spring 2021
SUBMITTED TO Sir Khalil Hussain Qamar

2
Roll No.____CA654221 Registration No.____ 20 KBR00360

ASSIGNMENT No.2

Question#01
What do you know about? An independent sample t-test. And
A paired sample t-test

A t-test is a hypothesis test of the mean of one or two normally distributed


populations. Several types of t-tests exist for different situations, but they all use
a test statistic that follows a t-distribution under the null hypothesis:

Test Purpose Example


1-Sample t Tests whether the mean of Is the mean height of female college
a single population is students greater than 5.5 feet?
equal to a target value
2-Sample t Tests whether the Does the mean height of female
difference between the college students significantly differ
means of two independent from the mean height of male
populations is equal to a college students?
target value

3
Roll No.____CA654221 Registration No.____ 20 KBR00360

Paired t Tests whether the mean of If you measure the weight of male
the differences between college students before and after
dependent or paired each subject takes a weight-loss
observations is equal to a pill, is the mean weight loss
target value significant enough to conclude that
the pill works?
t-test in Tests whether the values Are high school SAT test scores
regression of coefficients in the significant predictors of college
output regression equation differ GPA?
significantly from zero

An important property of the t-test is its robustness against assumptions of


population normality. In other words, with large samples t-tests are often valid
even when the assumption of normality is violated. This property makes them
one of the most useful procedures for making inferences about population
means.
The t score is a ratio between the difference between two groups and the
difference within the groups. The larger the t score, the more difference there is
between groups. The smaller the t score, the more similarity there is between
groups. A t score of 3 means that the groups are three times as
different from each other as they are within each other. When you run a t test,
the bigger the t-value, the more likely it is that the results are repeatable.
• A large t-score tells you that the groups are different.
• A small t-score tells you that the groups are similar.

4
Roll No.____CA654221 Registration No.____ 20 KBR00360

T-Values and P-values

How big is “big enough”? Every t-value has a p-value to go with it. A p-value is
the probability that the results from your sample data occurred by chance. P-
values are from 0% to 100%. They are usually written as a decimal. For example,
a p value of 5% is 0.05. Low p-values are good; They indicate your data did not
occur by chance. For example, a p-value of .01 means there is only a 1%
probability that the results from an experiment happened by chance. In most
cases, a p-value of 0.05 (5%) is accepted to mean the data is valid.
There are three main types of t-test:
• An Independent Samples t-test compares the means for two groups.
• A Paired sample t-test compares means from the same group at different
times (say, one year apart).
• A One sample t-test tests the mean of a single group against a known mean.
You probably don’t want to calculate the test by hand (the math can get very
messy, but if you insist you can find the steps for an independent samples t test
here.

Reference:
Ans No.1 is written from text book (8614) Unit No.6 Page
No.65,68,69

END Q#01

5
Roll No.____CA654221 Registration No.____ 20 KBR00360

Question#02
Why do we use regression analysis? Write down the types of
regression.

Regression and correlation analysis:

Regression analysis involves identifying the relationship between a


dependent variable and one or more independent variables. A model of the
relationship is hypothesized, and estimates of the parameter values are used to
develop an estimated regression equation. Various tests are then employed to
determine if the model is satisfactory. If the model is deemed satisfactory, the
estimated regression equation can be used to predict the value of the dependent
variable given values for the independent variables.

Regression model.
In simple linear regression, the model used to describe the relationship
between a single dependent variable y and a single independent variable x is y =
a0 + a1x + k. a0and a1 are referred to as the model parameters, and is a
probabilistic error term that accounts for the variability in y that cannot be
explained by the linear relationship with x. If the error term were not present,

6
Roll No.____CA654221 Registration No.____ 20 KBR00360

the model would be deterministic; in that case, knowledge of the value of x would
be sufficient to determine the value of y.

Least squares method.


Either a simple or multiple regression model is initially posed as a
hypothesis concerning the relationship among the dependent and independent
variables. The least squares method is the most widely used procedure for
developing estimates of the model parameters.
As an illustration of regression analysis and the least squares method,
suppose a university medical centre is investigating the relationship between
stress and blood pressure. Assume that both a stress test score and a blood
pressure reading have been recorded for a sample of 20 patients. The data are
shown graphically in the figure below, called a scatter diagram. Values of the
independent variable, stress test score, are given on the horizontal axis, and
values of the dependent variable, blood pressure, are shown on the vertical axis.
The line passing through the data points is the graph of the estimated
regression equation: y = 42.3 + 0.49x. The parameter estimates, b0 = 42.3 and
b1 = 0.49, were obtained using the least squares method.

Correlation.
Correlation and regression analysis are related in the sense that both deal
with relationships among variables. The correlation coefficient is a measure of
linear association between two variables. Values of the correlation coefficient
are always between -1 and +1. A correlation coefficient of +1 indicates that two
variables are perfectly related in a positive linear sense, a correlation coefficient
of -1 indicates that two variables are perfectly related in a negative linear sense,
and a correlation coefficient of 0 indicates that there is no linear relationship

7
Roll No.____CA654221 Registration No.____ 20 KBR00360

between the two variables. For simple linear regression, the sample correlation
coefficient is the square root of the coefficient of determination, with the sign of
the correlation coefficient being the same as the sign of b1, the coefficient of x1
in the estimated regression equation.
Neither regression nor correlation analyses can be interpreted as
establishing cause-and-effect relationships. They can indicate only how or to
what extent variables are associated with each other. The correlation coefficient
measures only the degree of linear association between two variables. Any
conclusions about a cause-and-effect relationship must be based on the
judgment of the analyst.

Reference:
Ans No.2 is written from text book (8614) Unit No.7 Page
No.72,76-78

END Q#02

8
Roll No.____CA654221 Registration No.____ 20 KBR00360

Question#03
Write a short note on one way ANOVA. Write down main
assumptions underlying and way ANOVA.

In some decision-making situations, the sample data may be divided into


various groups i.e. the sample may be supposed to have consisted of k-sub
samples. There are interest lies in examining whether the total sample can be
considered as homogenous or there is some indication that sub-samples have
been drawn from different populations. So, in these situations, we have to
compare the mean values of various groups, with respect to one or more criteria.
The total variation present in a set of data may be partitioned into a number
of non-overlapping components as per the nature of the classification. The
systematic procedure to achieve this is called Analysis of Variance (ANOVA).
With the help of such a partitioning, some testing of hypothesis may be
performed.
Initially, Analysis of Variance (ANOVA) had been employed only for the
experimental data from the Randomized Designs but later they have been used
for analyzing survey and secondary data from the Descriptive Research.
Analysis of Variance may also be visualized as a technique to examine a
dependence relationship where the response (dependence) variable is metric
(measured on interval or ratio scale) and the factors (independent variables) are
categorical in nature with a number of categories more than two.

9
Roll No.____CA654221 Registration No.____ 20 KBR00360

Example of ANOVA
Ventura is an FMCG company, selling a range of products. Its outlets have
been spread over the entire state. For administrative and planning purpose,
Ventura has sub-divided the state into four geographical-regions (Northern,
Eastern, Western and Southern). Random sample data of sales collected from
different outlets spread over the four geographical regions.
Variation, being a fundamental characteristics of data, would always be
present. Here, the total variation in the sales may be measured by the squared
sum of deviation from the mean sales. If we analyze the sources of variation in
the sales, in this case, we may identify two sources:
• Sales within a region would differ and this would be true for all four
regions (within-group variations)
• There might be impact of the regions and mean-sales of the four regions
would not be all the same i.e. there might be variation among regions
(between-group variations).
So, total variation present in the sample data may be partitioned into two
components: between-regions and within-regions and their magnitudes may be
compared to decide whether there is a substantial difference in the sales with
respect to regions. If the two variations are in close agreement, then there is no
reason to believe that sales are not same in all four regions and if not then it may
be concluded that there exists a substantial difference between some or all the
regions.
Here, it should be kept in mind that ANOVA is the partitioning of variation
as per the assignable causes and random component and by this partitioning

10
Roll No.____CA654221 Registration No.____ 20 KBR00360

ANOVA technique may be used as a method for testing significance of difference


among means (more than two).

Types of Analysis of Variance (ANOVA)


If the values of the response variable have been affected by only one factor
(different categories of single factor), then there will be only one assignable
reason by which data is sub-divided, then the corresponding analysis will be
known as One-Way Analysis of Variance. The example (Ventura Sales) comes
in this category. Other examples may be: examining the difference in analytical-
aptitude among students of various subject-streams (like engineering graduates,
management graduates, statistics graduates); impact of different modes of
advertisements on brand-acceptance of consumer durables etc.
On the other hand, if we consider the effect of more than one assignable
cause (different categories of multiple factors) on the response variable then the
corresponding analysis is known as N-Way ANOVA (N>=2). In particular, if the
impact of two factors (having multiple categories) been considered on the
dependent (response) variable then that is known as Two-Way ANOVA. For
example: in the Ventura Sales, if along with geographical-regions (Northern,
Eastern, Western and Southern), one more factor ‘type of outlet’ (Rural and
Urban) has been considered then the corresponding analysis will be Two-Way
ANOVA.
More examples: examining the difference in analytical-aptitude among
students of various subject-streams and geographical locations; the impact of
different modes of advertisements and occupations on brand-acceptance of
consumer durables etc.
Two-Way ANOVA may be further classified into two categories:

11
Roll No.____CA654221 Registration No.____ 20 KBR00360

Two-Way ANOVA with one observation per cell:


There will be only one observation in each cell (combination). Suppose, we
have two factors A (having m categories) and B (having n categories), So, there
will be N= m*n total observations with one observation(data-point) in each of
(Ai Bj) cell (combination), i=1, 2, ……., m and j= 1, 2, …..n. Here, the effect of the
two factors may be examined.
Two-Way ANOVA with multiple observations per cell:
There will be multiple observations in each cell (combination). Here, along
with the effect of two factors, their interaction effect may also be examined.
Interaction effect occurs when the impact of one factor (assignable cause)
depends on the category of other assignable cause (factor) and so on. For
examining interaction-effect it is necessary that each cell (combination) should
have more than one observations so it may not be possible in the earlier Two-
Way ANOVA with one observation per cell.

Conceptual Background
The fundamental concept behind the Analysis of Variance is “Linear Model”.
X1, X2,……….Xn are observable quantities. Here, all the values can be expressed
as:
Xi = µi + ei
Where µi is the true value which is because of some assignable causes and e i is
the error term which is because of random causes. Here, it has been assumed
that all error terms ei are independent distributed normal variate with mean zero
and common variance (σe2).
Further, true value µi can be assumed to be consist of a linear function of
t1, t2,…….tk, known as “effects”.

12
Roll No.____CA654221 Registration No.____ 20 KBR00360

If in a linear model, all effects tj’s are unknown constants (parameters),


then that linear model is known as “fixed-effect model”. Otherwise, if effects tj’s
are random variables then that model is known as “random-effect model”.

One-Way Analysis of Variance


We have n observations (Xij), divided into k groups, A1, A2,…….Ak, with
each group having nj observations.
Here, the proposed fixed-effect linear model is:
Xij = µi + eij
Where µi is the mean of the ith group.
General effect (grand mean): µ = Σ (ni. µi)/n
and additional effect of the ith group over the general effect: αi = µi – µ.
So, the linear model becomes:
Xij = µ + αi + eij
with Σi (ni αi) = 0
The least-square estimates of µ and αi may be determined by minimizing
error sum of square (Σi Σj eij2) = Σi Σj (Xij – µ – αi)2 as:
X.. (combined mean of the sample) and X i. (mean of the ith group in the sample).
So, the estimated linear model becomes:
Xij = X.. + (Xi.- X..) + (Xij– Xi.)
This can be further solved as:
Σi Σj (Xij –X..)2 = Σi ni (Xi.- X..)2 + Σi Σj (Xij –Xi.)2
Total Sum of Square = Sum of square due to group-effect + Sum of square due to
error
or

13
Roll No.____CA654221 Registration No.____ 20 KBR00360

Total Sum of Square= Between Group Sum of square+ Within Group Sum of
square
TSS= SSB + SSE
Further, Mean Sum of Square may be given as:
MSB = SSB/(k-1) and MSE = SSE/(n-k),
where (k-1) is the degree of freedom (df) for SSB and (n-k) is the df for SSE.
Here, it should be noted that SSB and SSE added up to TSS and the
corresponding df’s (k-1) and (n-k) add up to total df (n-1) but MSB and MSE will
not be added up to Total MS.
This by partitioning TSS and total df into two components, we may be able
to test the hypothesis:
H0: µ1 = µ2=……….= µk
H1: Not all µ’s are same i.e. at least one µ is different from others.
or alternatively:
H0: α1 = α2=……….= αk =0
H1: Not all α’s are zero i.e. at least one α is different from zero.
MSE has always been an unbiased estimate of σe2 and if H0 is true then MSB
will also be an unbiased estimate of σe2.
Further MSB/ σe2 will follow Chi-square (χ2) distribution with(k-1) df and
MSE/ σe2 will follow Chi-square (χ2) distribution with(n-k) df. These two
χ2 distributions are independent so the ratio of two Chi-square (χ2) variate F=
MSB/MSE will follow variance-ratio distribution (F distribution) with (k-1), (n-k)
df.
Here, the test-statistic F is a right-tailed test (one-tailed Test). Accordingly,
p-value may be estimated to decide about reject/not able to reject of the null
hypothesis H0.

14
Roll No.____CA654221 Registration No.____ 20 KBR00360

If is H0 rejected i.e. all µ’s are not same then rejecting the null hypothesis
does not inform which group-means are different from others, So, Post-Hoc
Analysis is to be performed to identify which group-means are significantly
different from others. Post Hoc Test is in the form of multiple comparison by
testing equality of two group-means (two at a time) i.e. H0: µp = µq by using two-
group independent samples test or by comparing the difference between sample
means (two at a time) with the least significance difference (LSD)/critical
difference (CD)
= terror-df*MSE/ (1/np+1/nq )1/2
If observed difference between two means is greater than the LSD/CD then
the corresponding Null hypothesis is rejected at alpha level of significance.

Assumptions for ANOVA


Though it has been discussed in the conceptual part just to reiterate it
should be ensured that the following assumptions must be fulfilled:
➢ The populations from where samples have been drawn should follow
a normal distribution.
➢ The samples have been selected randomly and independently.
➢ Each group should have common variance i.e. should be
homoscedastic i.e. the variability in the dependent variable values
within different groups is equal.
It should be noted that the Linear Model used in ANOVA is not affected by
minor deviations in the assumptions especially if the sample is large.
The Normality Assumption may be checked using Tests of Normality:
Shapiro-Wilk Test and Kolmogorov-Smirnov Test with Lilliefors Significance
Correction. Here, Normal Probability Plots (P-P Plots and Q-Q Plots) may also

15
Roll No.____CA654221 Registration No.____ 20 KBR00360

be used for checking normality assumption. The assumption of Equality of


Variances (homoscedasticity) may be checked using different Tests of
Homogeneity of Variances (Levene Test, Bartlett’s test, Brown–Forsythe test
etc.).

ANOVA vs T-test
We employ two-independent sample T-test to examine whether there
exists a significant difference in the means of two categories i.e. the two samples
have come from the same or different populations. The extension to it may be
applied to perform multiple T-tests (by taking two at a time) to examine the
significance of the difference in the means of k-samples in place of ANOVA. If
this is attempted, then the errors involved in the testing of hypothesis (type I and
type II error) can’t be estimated correctly and the value of type I error will be
much more than alpha (significance level). So, in this situation, ANOVA is always
preferred over multiple intendent samples T-tests.
Like, in our example we have four categories of the regions Northern (N),
Eastern (E), Western (W) and Southern (S). If we want to compare the population
means by using two-independent sample T-test i.e. by taking two categories
(groups) at a time. We have to make 4C2 = 6 number of comparisons i.e. six
independent samples tests have to performed (Tests comparing N with S, N with
E, N with W, E with W, E with S and W with S), Suppose we are using 5% level
of significance for the null hypothesis based on six individual T-tests, then type
I error will be = 1- (0.95)6 =1-0.735= 0.265 i.e. 26.5%.

Reference:

16
Roll No.____CA654221 Registration No.____ 20 KBR00360

Ans No.3 is written from text book (8614) Unit No.8 Page

No.82,86,87

END Q#03

Qurstion#04
What do you know about chi- square (x2) goodness
of fit test? Write down the procedure for goodness of
fit test.

Set up the hypothesis for Chi-Square goodness of fit test:

A. Null hypothesis:
In Chi-Square goodness of fit test, the null hypothesis assumes that there is
no significant difference between the observed and the expected value.

17
Roll No.____CA654221 Registration No.____ 20 KBR00360

B. Alternative hypothesis:
In Chi-Square goodness of fit test, the alternative hypothesis assumes that
there is a significant difference between the observed and the expected value.
Compute the value of Chi-Square goodness of fit test using the following formula:

Where, = Chi-Square goodness of fit test O= observed value E=

expected value

Degree of freedom:
In Chi-Square goodness of fit test, the degree of freedom depends on the
distribution of the sample. The following table shows the distribution and an
associated degree of freedom:

Type of distribution No of Degree of


constraints freedom
Binominal distribution 1 n-1
Poisson distribution 2 n-2
Normal distribution 3 n-3

Hypothesis testing:
Hypothesis testing in Chi-Square goodness of fit test is the same as in other
tests, like t-test, ANOVA, etc. The calculated value of Chi-Square goodness of fit
test is compared with the table value. If the calculated value of Chi-Square

18
Roll No.____CA654221 Registration No.____ 20 KBR00360

goodness of fit test is greater than the table value, we will reject the null
hypothesis and conclude that there is a significant difference between the
observed and the expected frequency. If the calculated value of Chi-Square
goodness of fit test is less than the table value, we will accept the null hypothesis
and conclude that there is no significant difference between the observed and
expected value.

Reference:
Ans No.4 is written from text book (8614) Unit No.9

Page No.92-94

END Q#04

Question#05
What is chi-square (x2) independence test? Explain in detail.

Chi-Square goodness of fit test is a non-parametric test that is used to find


out how the observed value of a given phenomenon is significantly different from
the expected value. In Chi-Square goodness of fit test, the term goodness of fit

19
Roll No.____CA654221 Registration No.____ 20 KBR00360

is used to compare the observed sample distribution with the expected


probability distribution.
Chi-Square goodness of fit test determines how well theoretical distribution
(such as normal, binomial, or Poisson) fits the empirical distribution. In Chi-
Square goodness of fit test, sample data is divided into intervals. Then the
numbers of points that fall into the interval are compared, with the expected
numbers of points in each interval.
We use the chi-square test to test the validity of a distribution assumed for
a random phenomenon. The test evaluates the null hypotheses H 0 (that the data
are governed by the assumed distribution) against the alternative (that the data
are not drawn from the assumed distribution).

Let p1, p2, ..., pk denote the probabilities hypothesized for k possible
outcomes. In n independent trials, we let Y1, Y2, ..., Yk denote the observed
counts of each outcome which are to be compared to the expected counts np1,
np2, ..., npk. The chi-square test statistic is qk-1 =

= (Y1 - np1)² + (Y2 - np2)² + ... + (Yk - npk)²

---------- ---------- --------

np1 np2 npk

Reject H0 if this value exceeds the upper critical value of the (k-1)

distribution, where is the desired level of significance.

Reference:

20
Roll No.____CA654221 Registration No.____ 20 KBR00360

Ans No.5 is written from text book (8614) Unit No.9 Page

No.94

END Q#05

21

You might also like