You are on page 1of 20

The description of quantitative data

Review: some key Definitions

Definition 1: Frequency table of quantitative data
Tabulations and Frequency Distributions
One of the simplest ways to summarize data is by tabulation.

Table 1 Distribution of Married Women of
Reproductive Age According to Present
Number of children in rural area China
Number of
Children
Number of women
0 13751
1 25171
2 30426
3 28560
4 21719
5 13695
6 7255
7 3268
8 1151
9 373
10&above 156
total 145,525


Definition 2: Frequency distribution
One of the most common ways of describing a sample pictorially is to
plot on one axis values of the variable and on another axis the frequency
of occurrence of a value or a measure related to it.

Graph 1 Distribution of Married Women
0
5000
10000
15000
20000
25000
30000
35000
0 1 2 3 4 5 6 7 8 9
1
0

a
b
o
v
e
Number of Children
N
u
m
b
e
r

o
f

W
o
m
e
n
系列1


The type of curve

Definition 3: Normal distribution: The peak of the curve is in the middle,
bilateral symmetric with respect to mean




Definition 4 Skewed distribution: The peak of the curve is not in the
middle.







Graph 1 Distribution of Married Women
0
5000
10000
15000
20000
25000
30000
35000
0 1 2 3 4 5 6 7 8 9
1
0

a
b
o
v
e
Number of Children
N
u
m
b
e
r

o
f

W
o
m
e
n
系列1





Measures of central tendency
An entire distribution can be characterized by one typical measure that
represents all the observations-measures of central tendency.
It is the average including mean, geometric mean and median.
(1) Mean is calculated by following formula:
X
X X X
n
X
n
X
n
n
i
i
n
=
+ + +
= =
=
¯
¯
1 2 1
...
(for raw data)

For example, The weight (kg) of ten 7-year-old boys are 17.3, 18.0, 19.4,
20.6, 21.2, 21.8, 22.5, 23.2, 24.0, 25.5. Find their average weight.

) ( 35 . 21
10
5 . 213
10
5 . 25 ... 0 . 18 3 . 17
kg X = =
+ + +
=

n
fX
n
X f
f f f
X f X f X f
X
n
i
i i
n
n n
¯
¯
= =
+ + +
+ + +
=
=1
1 2 1
2 2 1 1
...
...

(for frequency table)
Table 1 Mean and SD of Height of 14 years old female
children
Height f
i
x
i
f
i
x
i
f
i
x
2
i

124~ 2 126 252 31752
128~ 3 130 390 50700
132~ 11 134 1474 197516
136~ 22 138 3036 418968
140~ 39 142 5538 786396
144~ 27 146 3942 575532
148~ 16 150 2400 360000
152~ 5 154 770 118580
156~ 3 158 474 74892
160~164 2 162 324 52488
Total 130 186000 2666824
n
fX
n
X f
f f f
X f X f X f
X
n
i
i i
n
n n
¯
¯
= =
+ + +
+ + +
=
=1
1 2 1
2 2 1 1
...
...



) ( 08 . 143
130
18600
2 ... 3 2
162 2 ... 130 3 126 2
cm X = =
+ + +
× + + × + ×
=

(2) Geometric mean is calculated by the same formula as for mean and
the only difference is to transform the value into logarithm when the
calculation.
)
lg
( lg )
lg ... lg lg
( lg
1
2 1
1
n
X
n
X X X
G
n
¯ ÷ ÷
=
+ + +
=
(for raw data)

)
lg
( lg )
lg ... lg lg
( lg
1 2 2 1 1 1
¯
¯
¯
÷ ÷
=
+ + +
=
f
fX
f
X f X f X f
G
n n
(for
frequency table data)
Example The serum titre of five person : 1:2, 1:4,1:8,1:16,1:32. Find the
average titire.
8 ) 9031 . 0 ( lg )
5
5154 . 4
( lg )
5
32 lg 16 lg 8 lg 4 lg 2 lg
( lg
1 1 1
= =
+ + + +
=
÷ ÷ ÷
G


(3)Median is the value of observation located in the middle of value
sequence of observations (sorted first). It divides the frequency
distribution in half when all the values are listed in order.
For example:
120,123,125,127,128,130,132 (7 values)
M=127.

118,120,123,125,127,128,130,132 (8 values)
M=(125+127)/2=126.

For the data from a frequency table, we do not know the exactly value of
median so that we calculate the median by following formula:
Median or Percentile
) % (
¯
÷ + =
L
x
x
f nx
f
i
L P

x means percentile; L means the low limit of group where
percentile located in; i means the interval; f means frequency in the group.

Table 2 Latent period of an infective disease
Latent days
Group
F
Frequency
Ef
Cumulative
frequency
Cumulative
percent/%
4~ 26 26 24.07
8~ 48 74 68.52
12~ 25 99 91.67
16~ 6 105 97.22
20~ 3 108 100.00

Which group is median located in?
) ( 33 . 10 ) 26 % 50 108 (
48
4
8
50
days P = ÷ × + =

Percentile
The xth percentile of a data set is a value such that at least x percent of
the items take on this value or less and at least (100-x) percent of the
items take on this value or more.
Median is the fiftieth percentile. 50% of items is less than Median, and
50% of items is large than it.

(4) Mode is the value that occurs with the greatest frequency.
2,2,3,3,3,3,4,5,6

Let’s summarize the application of average:
1.Mean is suitable to the data distributed in normal distribution or at least
symmetric distribution.
2.Geometric mean is suitable to the data distributed in positive skewed
distribution or logarithm normal distribution.
3.Median is suitable to all kinds of data but it is of poor attribute for
further analysis comparing to mean. skewed distribution




Measures of dispersion
(1) Range
The range is the difference between the largest and smallest values in the
data set.

Disadvantage: It only reflect the tow extremely values, the biggest and
the smallest.

(2) Quartile interval
Quartile interval can be regarded as the range of half observed values in
the middle part, marked by Q.
The interquartile range is the difference between the third quartile (75
th

percentile) and the first quartile (25
th
percentile).

Table 2 Latent period of an infective disease
Latent days f Ef %
4~ 26 26 24.07
8~ 48 74 68.52
12~ 25 99 91.67
16~ 6 105 97.22
20~ 3 108 100.00
To calculate the interquartile range.

① Find
75
P
and
25
P

) % (
¯
÷ + =
L
x
x
f nx
f
i
L P

75
P
: L=12, i=4,
75
f
=25,
¯
=
L
f
74
( ) 12 . 13 74 % 75 108
25
4
12
75
= ÷ × + = P

25
P
: L=8, i=4,
25
f
=48,
¯
=
L
f
26
( ) 08 . 8 26 % 25 108
48
4
8
25
= ÷ × + = P

② calculated the interquartile range
Q=
75
P

25
P
=13.12-8.08=5.04

(3) Variance
A key step in computing the variance involves the computation of the
difference between each data values and the mean for the data set.
( )
¯
= ÷ 0 x x

(the positive and negative deviations cancel each orther, causing the sum
of deviations about the mean is 0)
( )
¯
= ÷ 0
2
x x


The value of the squared deviations dependent on the number of the
values, except the variability.
The average squared deviation is called Variance. Marked
2
o
for
population,
2
s
for sample.
( )
n
x
¯
÷
=
2
2
u
o


( )
n
x x
s
¯
÷
=
2
2


(4) Standard Deviation, SD
The Standard Deviation of a data set is defined to be the positive square
root of the variance.
( )
n
x
2
¯
÷
=
u
o

( )
n
x x
s
2
¯
÷
=


Application of standard deviation:
Standard deviation Show the dispersion degree of variable distribution.
The big standard deviation shows the large variation degree of variable
value. The variable values are more dispersion (father away from mean ),
the representation of mean is poor. Conversely smaller standard deviation
shows that the variable values more centralized around mean, so the
representation of mean for each variable value is better.

(6) Coefficient of Variation
Coefficient of variation, also called coefficient of dispersion, marked by
CV, is the ratio between standard deviation S and mean X expressed by
percentage. Formula is
% 100 × =
X
S
CV


All ranges Quartile intervals and standard deviations have measurement
units, which is the same as the unit of observed value. While coefficient
of variation is relative number and has no measurement unit. Thus it is
more suitable for data analysis and comparison.
The coefficient of variation is often used in:
1) Comparing variation degree of several data whose means are of great
disparity. One is bigger than another in twice times or more.
2) Comparing variation degree of several data of different measurement
units.

For example:
100 20-year-old men in a place, the mean of height is 166.o6 cm,
standard deviation is 4.95 cm; the mean of weight is 53.72 kg, standard
deviation is 4.96 kg. To compare which variation degree is larger. We
should compare the coefficient of variation instead of comparing the
standard deviation because of the different measurement unit (kg and cm).
Now


% 98 . 2 % 100
06 . 166
95 . 4
= × =
height
CV
% 23 . 9 % 100
72 . 53
96 . 4
= × =
wieght
CV
The variation degree of weight is larger than that of height.

Statistical inference of measurement data

There are two purposes of statistical analysis:
(1)The statistical description, to describe and to summarize the important
features of data by a few of statistics.
(2)The statistical inference, to make a generalization from the sample to
the population including the estimation of population parameter and the
hypothesis testing.


The sampling error
From a population, the samples are selected and the means of these
samples will be different from each other and from the population mean.
This difference is the sampling error.
Why ? The sampling error is related to the variation of observations in the
population.
2. Standard error, SE
The measure of sampling error is standard error.
Formula:
n
x
o
o =
(for population mean)
n
s
s
x
=
(for sample mean)
The sampling error is also related to the sample size.
If the sample size equals the population size, there is no sampling error. If
the sample size equals 1, the sampling error equals SD.

3. Hypothesis testing of population mean
Supposed there is a samples and sample mean is
1
X
. For example,
mean of hemoglobin of 280 healthy male adults is 136.0 g/L, and SD is
6.0 g/L. The population mean is 140.0 g/L. The sample mean is different
from the population mean. There are two possibilities:
a. The difference is because of the sampling error. It means μ
1
= μ
0
=140.
b. The difference is substantial, because the sample come from different
population. It means μ
1
≠ μ
0
=140.
Now we need to make a judgment : which possibility is true.

(1) The steps of hypothesis testing
①Setting Hypothesis
H
0
: null Hypothesis, H
1
:alternative Hypothesis
α=0.05 (to determine if reject H
0
or not)
②Calculating the value of statistic (t)
③ Determine value of P and making a judgment:
Judgment :
When P≤α,the conclusion is to refuse H
0
but accept H
1
, there are
significance different.
When P>α,the conclusion is not to refuse H
0
, there are no significance
different.

4. There are three kinds of design patterns:
●Comparing the sample with the population;
One-sample T Test
●Comparing in a matched pair way;
Paired-sample T Test
●Comparing two independent samples.
Two Independent-sample T Test

Home Work

1. To calculate the mean and SD, Median and Q
Table Mean and SD of Height of 14 years old female children
Height X
Min-point
f
i
Cumulative
frequency
Cumulative
percent /%
124~ 126 2 2 1.5
128~ 130 3 5 3.8
132~ 134 11 16 12.3
136~ 168 22 38 29.2
140~ 142 39 77 59.2
144~ 146 27 104 80.0
148~ 150 16 120 92.3
152~ 154 5 125 96.2
156~ 158 3 128 98.5
160~164 162 2 130 100.0
Total 130

2. 50 measles-susceptible children have been vaccinated for a month. The
antibody titire is shown in table below. Find the average titire.
caculation of average tiire
Antibody
titire
Number of
Children
Reciproced
of titire, x
lgx f.lgx
(1) (2) (3) (4) (5)=(2)(4)
1:4 1 4
1:8 5 8
1:16 6 16
1:32 7 32
1:64 8 64
1:128 10 128
1:256 8 256
1:512 5 512
total 50

3. The average sleep time is supposed to be 8 hours a day (m). We think
college students sleep a different amount, maybe more - maybe less. We
survey ten students to see how much they sleep. The data are as follows
(each value represents a student): 5, 4, 6, 4, 8, 6, 5, 4, 3, 7, 5, 5, 5, 6, 6
(hours). One-sample t test
①Setting Hypothesis
H0: H1: α=0.05
②Calculating the value of statistic (t value for t-test)

③Making a judgment:

4. In a small clinical to assess the value of a new tranquillizre on
psychoneurotic patients, each patient was given a week’s treatment with
the drug and a week’s treatment with a placebo, the order in which the
two sets of treatments were given being determined at random. At the end
of each week the patient had to complete a questionnaire, on the basic of
which he was given an ‘anxiety score’ (with bossible values from 0 to 30),
high score corresponding to states of anxiety. The results are shown in
Table.
Table 1 Anxiety scores recorded for 10 patients receiving a new drug and
placebo in random order
Anxiety score difference:
patient Drug Placebo di
(1) (2) (3)=(1)-(2)

1 19 22 -3
2 11 18 -7
3 14 17 -3
4 17 19 -2
5 23 22 1
6 11 12 -1
7 15 14 1
8 19 11 8
9 11 19 -8
10 8 7 1
paired-sample t test
①Setting Hypothesis
H0: ,H1: , α=0.05
②Calculating the value of statistic (t value for t-test)
n s
d
t
d
/
0 ÷
=

③Making a judgment:

5. Cardiovascular disease, Hypertension. Suppose a sample of 20 35-39-
year-old nonpregnant, premenopausal OC users who work in a company
are identified who have mean systolic blood pressure of 132.86 mmHg
and sample standard deviation of 15.34 mmHg. A sample of 21 35-39-
year-old non-pregnant, premenopausal non-OC users are similarly
identified who have mean systolic blood pressure of 127.44 mmHg and
sample standard deviation of 18.23 mmHg. What can be said about the
underlying mean difference in blood pressure between the two
populations?
Group n mean SD
OC users 20 132.86 15.34
non-OC users 21 127.44 18.23
two independent-sample t test
①Setting Hypothesis
2 1 0
: u u = H
,
2 1 0
: u u = H
α=0.05


②Calculating the value of statistic (t value for t-test)
)
1 1
(
) 1 ( ) 1 (
2 1 2 1
2
2 2
2
1 1
2 1 2 1
2 1
n n n n
s n s n
x x
s
x x
t
x x
+
+
÷ + ÷
÷
=
÷
=
÷

υ=n
1
+n
2
-2
③Making a judgment: