Professional Documents
Culture Documents
SAMPLING METHODS
Sampling is concerned with the selection of a subset of
individuals from within a statistical population to estimate
characteristics of the whole population. Two advantages of
sampling are that the cost is lower and data collection is
faster than measuring the entire population.
The sampling process comprises several stages:
Defining the population of concern
Determining the sample size
Implementing the sampling plan
Probability methods
This is the best overall group of methods to use and can
subsequently use the most powerful statistical analyses
on the results.
Nonprobability sampling
Convenience methods
Good sampling is time-consuming and expensive. Not all
experimenters have the time or funds to use more
accurate methods. There is a price, of course, in the
potential limited validity of results.
Descriptive and Inferential Statistics
When analysing data, such as the marks achieved by 100
students for a piece of coursework, it is possible to use both
descriptive and inferential statistics in your analysis of their
marks. Typically, in most research conducted on groups of
people, you will use both descriptive and inferential
statistics to analyse your results and draw conclusions.
Descriptive Statistics
Descriptive statistics is the term given to the analysis of data
that helps describe, show or summarize data in a
meaningful way such that, for example, patterns might
emerge from the data. Descriptive statistics do not,
however, allow us to make conclusions beyond the data we
have analysed or reach conclusions regarding any
hypotheses we might have made. They are simply a way to
describe our data.
Descriptive statistics are very important because if we
simply presented our raw data it would be hard to visualize
what the data was showing, especially if there was a lot of
it. Descriptive statistics therefore enables us to present the
data in a more meaningful way, which allows simpler
interpretation of the data. For example, if we had the results
of 100 pieces of students' coursework, we may be
interested in the overall performance of those students. We
would also be interested in the distribution or spread of the
marks. Descriptive statistics allow us to do this. How to
properly describe data through statistics and graphs is an
important topic and discussed in other Laerd Statistics
guides. Typically, there are two general types of statistic
that are used to describe data:
o Measures of central tendency: these are ways of
describing the central position of a frequency
distribution for a group of data. In this case, the
frequency distribution is simply the distribution and
pattern of marks scored by the 100 students from the
lowest to the highest. We can describe this central
position using a number of statistics, including the
mode, median, and mean. You can read about
measures of central tendency here.
Standard deviation. –
INFERENTIAL STATISTICS
Parametric Statistics
Parametric tests are tests applied to data
that are normally distributed, the levels of
measurement of which are expressed in interval
and ratio.
Examples:
t-test for Independent Samples
t-test for Correlated Sample
Z-test for Two Sample Means
F-test (ANOVA)
r (Pearson Product Moment Coefficient
of Correlation)
Multiple Regression Analysis
When to use parametric tests -
1.When the distribution is normal
The normal distribution is a symmetric distribution
with no skew. The tails are exactly the same. A left-
skewed distribution has a long left tail. Left-skewed distributions
are also called negatively-skewed distributions.
X1 – X2
t = SS1 + SS2 1 + 1
n1 + n2 -2 n1 n2
X1 = mean of group 1
X2 = mean of group 2
SS1 = sum of squares of group 1
SS2 = sum of squares of group 2
n1 = number of observations in group 1
n2 = number of observations in group 2
n2
Ex. Problem: Is there a significant difference between
the performance of the male and the
female students in spelling?
Hypothesis:
Ho: There is no significant difference between the
performance of the male and female
students in spelling
Ho: X1 = X2
H1: X1 ≠X2
Male Female
14 12
18 9
17 11
16 5
4 10
14 3
12 7
10 2
9 6
17 13
Solution:
Male Female
X1 X12 X2 X 22
14 196 12 144
18 324 9 81
17 289 11 121
16 256 5 25
4 16 10 100
14 196 3 9
12 144 7 49
10 100 2 4
9 81 6 36
17 289 13 169
∑x1 = 131 ∑x12 = 1891 ∑x2 =78 ∑x22 = 738
n1 = 10 n2 = 10
x1 = 13.1 x2 = 7.8
13.1 – 7.8
174.9+129.6 1 + 1
10 +10-2 10 10
= 2.88
How to solve t-test for independent samples using
Microsoft Office Excel
X1(1.0mg) X2 (1.5mg)
9.8 12
13.2 7.4
11.2 9.8
9.5 11.5
13 13
12.1 12.5
9.8 9.8
12.3 10.5
7.9 13.5
10.2
9.7
t-test for correlated samples
Formula:
Formula:
Z= X 1 - X2
S12 + S22
n1 n2
X1 = mean of sample 1
X2 = mean sample 2
S12 = variance of sample 1
S22 = variance of sample 2
n1 = size of sample 1
n2 = size of sample 2
Example:
An admission test was
administered to incoming freshmen in
the colleges of nursing and
Veterinary Medicine with 50 students
each college randomly selected
X1 X2
90 90 85 83
87 89 87 85
88 90 85 88
86 87 85 86
89 88 85 84
90 90 85 88
89 87 84 86
87 86 86 84
90 87 85 83
90 85 87 85
90 86 88 86
89 90 85 85
88 90 85 85
87 86 88 85
90 89 88 84
90 89 87 85
87 90 85 85
88 89 86 87
89 87 84 85
90 85 84 87
87 87 86 85
88 90 86 84
90 85 85 86
86 89 85 87
88 90 86 84
Table
Level of significance
test .01 .05
One-tailed +2.33 +1.645
Two-tailed + 2.575 +1.96
CORRELATION COEFFICIENT R
In statistics, the correlation coefficient r measures the strength and
direction of a linear relationship between two variables on
a scatterplot. The value of r is always between +1 and –1. To
interpret its value, see which of the following values your
correlation r is closest to:
0. No linear relationship
Formula:
Step one: Make a chart with your data for two variables, labeling the
variables (x) and (y), and add three more columns labeled (xy), (x^2),
and (y^2). A simple data chart might look like this:
3
Step two: Complete the chart using basic multiplication of the variable
values.
Step three: After you have multiplied all the values to complete the
chart, add up all of the columns from top to bottom.
F-test
Another parametric test used to compare the means of two
or more groups of independent samples. It is also known as
Analysis of Variance (ANOVA). This is used in comparing the
means of two or more independent groups. Data to be analyzed
are normal and expressed in interval and ratio .
Kinds of ANOVA:
One-way analysis – only one variable is involved
Two-way analysis of variance – two variables are
Involved (column and row variables) also used in
looking at intera – used to know if there are
significant differences between and among
columns and rows and ) also used in looking at
interaction effect between the variables being
analyzed.
Steps in computing F-test
1. Compute the CF,
CF = (GT)2
N
2. TSS – total sum of squares minus the CF, or the correction
factor
TSS = ∑x2 – CF or ∑x12 + ∑x22 + ∑x32 + ∑x42 + … -CF
3. BSS – between sum of squares minus the CF or correction
factor
BSS = (∑x1)2 + (∑x2)2 + (∑x3)2 + ((∑x4)2… -CF
n1 n2 n3 n4
4. WSS – within sum of squares
WSS = TSS – BSS
39 48 38 125
38 45 38 121
Total 198 241 197 ∑=636
Method of 40 45 50 135
Teaching 2
41 42 46 129
39 42 43 124
38 41 43 122
38 40 42 120
Total 196 210 224 ∑=630
Method of 40 40 40 120
Teaching 3
43 45 41 129
41 44 41 126
39 44 39 122
38 43 38 119
Total 201 216 199 ∑=616
Grand Total 595 667 620 = 1,882
CF = (1882)2/45 = 78709.42
SSt = 402 + 412 + … + 392 + 382 – CF
= 79218 – 78709.42
508.58
SSr = (130)2 + (132)2 + (128)2 + ….(122)2 + (119)2-CF
3 3 3 3 3
78820.67 – 78709.42 = 111.25
SSc = (595)2 + ( 667 )2 + (620)2 - CF
15
= 78887.6 – 78709.42
178.18
SSw = SSt – (SSr + SSc) = 508.58 – (111.25 + 178.18)
219.15
df t = 45-1 =44
dfr = 15-1 =14
dfc =3-1 = 2
dfw = 44- (14+2) = 28
CF = (1882)2/45 = 78709.42
SSt = 402 + 412 + … + 392 + 382 – CF
= 79218 – 78709.42
508.58
Total 508.58 44
F-value computed:
Columns = MSc = 89.09 = 24.82
MSw 3.59
Row = MSr = 7.02 = 1.96
MSw 3.59
Interaction = MSI = 46.79 = 13.03
MSw 3.59
F-tabular at .05
Columns df = 2/36 =3.26
Row df = 2/36 = 3.26
Interaction df = 4/36 = 2.63
NONPARAMETRIC TEST
Tests that do not require a normal distribution
They utilize both nominal and ordinal data
Nominal data come from categories such as male and
female; yes or no response; political affiliations, religious
groupings and other categories.
Ordinal data are data expressed in rankings showing an
order, such as rank 1, rank 2, etc. O, VS S, F, P and SA, A , D, SD
Commonly used tests under the nonparametric tests:
Examples of Nonparametric tests
Chi-square
This approach consists of four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.
O= observed frequencies
E= observed frequencies
Problem
Grades Observed
1.25 14
1.50 18
1.75 32
2.0 20
2.25 16
uniform
Alternative hypothesis: At least one of the
Χ2 = Σ [ (Oi - Ei)2 / Ei ]
Χ2 = [ (14 - 20)2 / 20 ] + [ (18 - 20)2 / 20 ] + [ (32 -
20)2 / 20 ] + [ (20- 20)2/20] + [(16-20)2/20]
Χ2 = (36 / 20) + (4/ 20) + ( 144/20) + ( 0)+ (16/20)
= 1.8 + 0.20 + 7.20 + 0.8 = 10.0
Interpret results. Critical X2 value ( df =4; α=.05) =
9.488 Since the computed value > than critical value ,
reject the null hypothesis.
Chi-Square Test of Homogeneity
The test procedure described in this lesson is appropriate when the following
conditions are met:
Every hypothesis test requires the analyst to state a null hypothesis and
an alternative hypothesis. The hypotheses are stated in such a way that
they are mutually exclusive. That is, if one is true, the other must be false;
and vice versa.
The alternative hypothesis (Ha) is that at least one of the null hypothesis
statements is false.
Using sample data from the contingency tables, find the degrees of
freedom, expected frequency counts, test statistic, and the critical value.
DF = (r - 1) * (c - 1)
where r is the number of populations, and c is the number of levels for the
categorical variable.
where Er,c is the expected frequency count for population r at level c of the
categorical variable, nr is the total number of observations from population r, nc is
the total number of observations at treatment level c, and n is the total sample
size.
Test statistic. The test statistic is a chi-square random variable (Χ2)
defined by the following equation.
Problem
Viewing Preferences
Row total
Lone Ranger Sesame Street The Simpsons
Boys 50 30 20 100
Girls 50 80 70 200
Do the boys' preferences for these TV programs differ significantly from the
girls' preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret
results. We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis
and an alternative hypothesis.
Null hypothesis: The null hypothesis states that the proportion
of boys who prefer the Lone Ranger is identical to the
proportion of girls. Similarly, for the other programs. Thus,
H0: Pboys who prefer Lone Ranger = Pgirls who prefer Lone Ranger
H0: Pboys who prefer Sesame Street = Pgirls who prefer Sesame Street
H0: Pboys who prefer The Simpsons = Pgirls who prefer The Simpsons
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Chi-square test for independence is the test applied when you have
two categorical variables from a single population. It is used to determine
whether there is a significant association between the two variables.
This approach consists of four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret
results.
The alternative hypothesis is that knowing the level of Variable A can help
you predict the level of Variable B which suggests that the variables are
related
Using sample data, find the degrees of freedom, expected frequencies, test
statistic, and the P-value associated with the test statistic.
DF = (r - 1) * (c - 1)
Interpret Results
Problem
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret
results. We work through those steps below:
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Interpret results. Since the computed value of 16.20 is > than 9.210
at .05 level of significance , reject the null hypothesis. Thus, we
conclude that there is a relationship between gender and voting
preference.
It is used when:
Steps:
1. Rank the observations from lowest value to the highest value of both
groups
2. After ranking, assign the rank to the respective observation
3. Add the ranks of group 1, W1
4. Add the ranks of group 2, W2
5. Determine the number of observation in group 1 and group 2 that is
n1 and n2 respectively
6. Use the formula
U1 = W1 – n1(n1 + 1)/2
U2 = W2 – n2(n2+1)/2
*
Problem: Of the 18 selected patients who had advanced stage of leukemia,
ten were treated with new serum and eight now. The survival time, in years
was reckoned from the time , in years was reckoned from the time
experiment was conducted from the time the experiment was conducted,
Treatment 2.9 3.1 5.3 4.2 4.5 3.9 2.0 3.7 4.1 4.0
No Treatment 1.9 0.50 9.0 2.2 3.1 2.0 1.7 2.5
rs = 1- n(n2- 1)