You are on page 1of 24

Statistics – Introduction

We will discuss the following questions:


1. What is Statistics?
2. Where does it come from?
3. Why should ‘we’ study it?

So, What is Statistics? It is the Collection of Data, and the Conversion of this data into Information.
What is Data, and what is Information? Data is pieces of attributes, and there are two main types,
Quantitative and Qualitative data. Information is organized data for specific purposes. In statistics
data is organized for two main purposes, to describe the Data (Descriptive), to use the information
to make decisions (Inferential). So statistical study can be broken into three parts, Statistical Data
Collection, Descriptive Statistics, and Inferential Statistics. We shall later expand on each of them.

So, Where does Statistics come from? On a bigger scale rulers have always being interested in
knowing the composition of their subjects especially the number of young men fit to be sent out to
kill, or be killed for the glory of the ruler. Rulers have therefore being very fond of Census, one of
the well known, according to the Bible, led to Jesus Christ being born in a manger. On a smaller
scale it is a major part of the basis of most of our personal decisions because it forms the basis of
what we call experience. “I shop at No Frills, because prices there are cheaper (most times) than
Loblaws’.” It is intertwined with chance or probability, so it comes from most of our daily routines.

So, Why should ‘we’ study it? Variation (according to Derek Stephens of Sick Kids) is what makes
the study of statistics necessary. Example, if all the people in a country are alike in every
conceivable characteristic, then what satisfies one satisfies all, and there will be no need for census.
We therefore study statistics primarily to make sense of the variation in a group. The scientific
method, which underlies experimental science, technological research and development, and
research in the social sciences, is essentially statistical methods, and for this alone it is worth
studying statistics. Knowledge in statistics makes us ‘better’ consumers of advertisement and
propaganda. Statistics is invaluable in planning especially for large groups like people in a country.

We will now learn a bit more about the three parts of Statistics: Statistical Data Collection,
Descriptive Statistics, and Inferential Statistics.
Statistics Text - http://www.cimt.plymouth.ac.uk/projects/mepres/allgcse/pbtxt.pdf
Statistical Data Collection:
http://www.cimt.plymouth.ac.uk/projects/mepres/alevel/stats_ch2.pdf
Types of Data: 1. Quantitative Data 2. Qualitative (Category) Data.

Quantitative Data are Numerical Data. There are two types; a) Discrete and b) Continuous.
Discrete Numerical Data: Possible numerical outcomes can be counted. Example, the number of
students in a class are either 0, 1, 2, 3, …. That is it is a non-negative integer. Shoe sizes, since there
are a finite number of them (note: some are fractions). Bank balance of Canadians
Continuous Numerical Data: Possible numerical outcomes cannot be counted. Example, the size of
feet measured in any units of length, e.g. centimetres. The weight of any three oranges measured in
any units of weight, e.g. grams.

Qualitative (Category) Data: Is a ‘measure’ that put subjects in non-quantifiable groups. There are
two types; a) Nominal: category by name, b) Rank: category by rank. Colour (e.g. of cars), place of
birth, SIN that starts with 405, 905, etc are examples of Nominal, and 1 st, 2nd, 3rd, year is an example
of Rank.

Statistical Population: Are all the subjects under a statistical study. TYP students become the
population if the study is restricted to TYP students, example, finding the height of TYP students.
Census is a statistical study, which includes each member of the adult population of a country.

If each member of a population is included in the study, then the problem of Data Collection is
reduced to ‘how ‘information’ is collected’ from the subjects. For most statistical studies, the size of
the population and/or the cost of the study and/or the nature of the study make it impractical to
collect data from each member of the population. In such situations, data is collected from a
representative group or sample chosen from the population. The data from this sample is then
assumed to be applicable to the whole population. So besides the problem of ‘how ‘information’ is
collected’ is added the problem of ‘how the sample was chosen’.
‘How ‘Information’ is collected’? Data is collected from subjects either passively by not interacting
with the subjects or actively by interacting with the subjects. Collecting data passively is mostly by
observation, and counting. Example of such, is data on number of people passing through a given
place in a given time interval. Active data collection involves measuring usually with an instrument,
and most often by questionnaire. Problems with measuring with instruments are problems
associated with the instruments, which most times are resolved technically. Data collection by
questionnaire on the other hand has many ‘hidden’ problems from the way the questions are framed
to whether subjects respond verbally or in writing. So we reduce the problems we will look at in
data collection to whether the whole population is studied and if not how a sample is chosen
(Sampling), and if questionnaires are used how they are prepared and used.

Sampling: http://www.cimt.plymouth.ac.uk/projects/mepres/book9/bk9_18.pdf
The criteria for choosing a sample to represent a population for statistical study is that each member
of the population must have an equal chance of being chosen. This is similar to Lotto 649, for which
each of the 49 numbers has an equal chance of being one of the 6 numbers chosen for the jackpot.
The best method to achieve this is by Random Sampling. So to Randomly choose a sample is to
give each member of your population the same chance of being chosen. There are many methods of
Random Sampling and one of the most used is by using Random Numbers, for example Lotto 649
numbers.

Questionnaire: http://www.cimt.plymouth.ac.uk/projects/mepres/book8/bk8_20.pdf
Check the above site for criteria a good questionnaire meets.

Descriptive Statistics: http://www.cimt.plymouth.ac.uk/projects/mepres/alevel/stats_ch3.pdf


This involves Sorting and Grouping, Graphical Illustration, and Calculation of Summary Statistics.
Sorting and Grouping: This brings some sort of order to the data. If it is numerical data, it may be
arranged in increasing or decreasing order. It may also be sorted into Stem and Leaf. Another
method of sorting is to put the data in a Frequency Table. The two types of Frequency Tables are,
the Ungroup Frequency Table, and the Group Frequency Table. The Ungroup Frequency Table is a
score and the frequency of the score in the data. The Group Frequency Table is a group of scores
and the sum of the frequency of each of the scores in the group. (Frequency of a score is the

3
number of times the score is in the data.)
Graphical Illustration: Some of the illustrations are: 1) Line Graph; 2) Pie Charts; 3) Bar Charts;
4) Histogram; 5) Cumulative Frequency Diagram; etc.

Summary, Statistics: These are values derived from the data to give a short description of data.
These are the Measures of Central Tendency, and the Measures of Dispersion. Besides describing
the data, these measures are convenient for the comparison of two sets of data.
Measures of Central Tendency: These are the Mode, Median, and the Mean. They are also known
as the averages. The Mode is the score with the highest frequency. The Median is the ‘score’ with
the same number of scores greater than it as are less than it. The mean is the sum of all the scores in
the data, divided by the number of scores in the data.

We shall illustrate all these by examples using the following set of Data:
Test 1:
56 40 7 70 31 17 56 71 70 36 71 71
63 56 46 91 46 73 60 97 67 53 86 64
44 92 46 75 77 93 70 97 53 71 79 57

Test 2:
65 60 30 70 63 65 45 56 70 58 48 37
92 40 86 62 60 40 53 47 45 60 31 91
31 35 61 61 72 80 50 60 73 28 58 38

3. The colours of cars in a car park were:


Blue, White, Blue, Blue, Black, Blue, Black, Silver, Silver, Blue, Silver, Green, Black, White,
Silver, Silver, Black, Blue, Green, Blue, Green, Red, Black, Red.

Comments: To paraphrase Derek (Sick Kids Biostatistician) ‘Statistics is necessary because of


variation in the scores of a data’. That is if there is no variation, then there is no statistical problem.
Example, if the people in a country do not change in terms of all its attributes, then there is no need
for a census, or at most only one census for all time. So the data may be uniformly distributed that is
the scores have the same frequency, symmetrically distributed (bell curve), skewed to the ‘right’, or
skewed to the left of a score. The mean and median are the same for a uniform distribution whilst
every score is a mode, so the mode cannot be used as ‘the’ measure of central tendency. For a
symmetrical distribution, the mode, median and mean are the same value, so each could be used as
a measure of central tendency. For skewed distribution, the median or mode may be a better
measure of central tendency. However the mean is most often used as the measure of central
tendency for quantitative data, because it is influenced by each of the scores, and also statistical
decision theory is more developed for the mean.
Stem and Leaf:
The Stem is the digit of the highest position of the numbers in the data, and the leaf is the remaining
digit(s) of the number. Example if the greatest number in the data is a two digit number, then the
stem is the digit in the tens position, and the leaf the digit in the unit position.
Diagram:
List all the possible stems, that is digits from 0 (sometimes) to 9 for the stem position, and for each
stem the leaf is attached (by listing them) on the right and arranged in order of magnitude.
Stem and Leaf Diagram for Test 1
Then
Arrange
leaf of
Stem Leaf Stem Leaf
each stem
in order of
magnitude
0 7 0 7
1 7 1 7
2 2
3 1, 6 3 1, 6
4 0, 6, 6, 4, 6 4 0, 4, 6, 6, 6
5 6, 6, 6, 3, 3, 7 5 3, 3, 6, 6, 6, 7
6 3, 0, 7, 4 6 0, 3, 4, 7
7 0, 1, 0, 1, 1, 3, 5, 7, 0, 1, 9 7 0, 0, 0, 1, 1, 1, 1, 3, 5, 7, 9
8 6 8 6
9 1, 7, 2, 3, 7 9 1, 2, 3, 7, 7
Final Diagram – Above.
Measures of Central Tendency or Averages:
Mode - From the Stem and Leaf diagram it is not difficult to see that the mode is 71. It occurs more
often than any other score.

5
Median – To find the median, the score is arranged in order of magnitude, and the score in the
middle position is the median. There are 36 scores or numbers in the data, so the middle position
lies between the 18th and the 19th positions. From the stem and leaf diagram, counting from the least
to the greatest score, the score in the 18th position is 64, and the score in the 19th position is 67. The
median is the sum of these two numbers divided by 2. Median is 65.5
N +1
F
orad
ataw
ithNsco
res, th
eMed
ianP
ositio
n = .Arran
ged
ataino
rdero
fmag
nitu
de.
2
Th
enth
eM ed
ianisth
escoreinth
ispositio
n.Itisfoundbycounting.Ifth
epo
sitio
nfallsb
etween
tw
osco
resasabovethenth
em eanofthetw oscoresisth
em ed
ian .

Arithmetic Mean:
This is the sum of all the scores divided by the number of scores. There are 36 scores, and the sum
can be found from the original data or from the stem and leaf diagram.
Sum of the scores = 2252. Then the Arithmetic mean is 2252 divided by 36. Arithmetic mean = 63.
Some Properties of Arithmetic Mean:
The product of the Arithmetic Mean and the Number of scores gives the sum of the scores. This is
useful in finding the required mark to make a certain grade.
Example 1: Akua’s mean mark for her first 3 tests is 78.
i. Akua wants her mean mark for the course to be at least 80. What should be her
minimum mark on the 4th (and last) test if she is to get 80?
ii. What is the highest possible mean mark Akua can get in the course?
Solution:
i. For Akua to get a mean mark of 80 for 4 tests, the sum of her marks for the 4 tests must be equal
to the product of 80 (mean of the tests) and 4 (the number of tests). This is 320. The sum of Akua’s
mark for the first three tests is the product of 78 (mean of the 3 tests) and 3 (number of tests). This
is 234. The difference between the sum of the 4 tests and the 3 tests is Akua’s mark for the 4 th test.
That is the difference of 320 and 234. This is 86. That is Akua must get 86 on the 4 th test for her
mean for the course to be 80.

ii. The highest possible mark Akua can get on the 4th test is 100. The sum of the first three tests is
234. So the sum of the 4 tests cannot be more than the sum of 234 and 100. This is 334. Akua’s
maximum mean mark is 334 divided by 4. This is 83.5. So the highest possible mean mark Akua
can get on the course is 83.5.

Some Measures of Dispersion - Range


The Range is the difference between the greatest number and the least number of in the data.
Example: For Test 1, from the Stem and Leaf diagram, the greatest number is 97, and the least
number is 7. So the Range for Test 1, is the difference of 97 and 7. So the Range is 90.
Another Statistical Diagram – Frequency Table: i. Ungroup and ii. Group
How often a score (number) appears in a data is called the frequency of that score. From the Stem
and Leaf diagram for Test 1, the score 46 appears 3 times, so the frequency of 46 is 3.
Frequency Table is a table of the scores (numbers) in a data and how often they appear in the data.
The Stem and Leaf diagram of a data makes it ‘easier’ to make the Frequency Table of the data.
Quite often Frequency Table diagrams are made without first making the Stem and Leaf diagram.
For Ungroup Frequency Table, the scores in the data are listed and then tally by going through the
data and putting a check mark opposite a score whenever it appears. The sum of the tally marks is
how often a score appears and is put under frequency opposite the score.

Ungroup Frequency Table for Test 1


Scores Frequency Scores Frequency Scores Frequency
7 1 56 3 73 1
17 1 57 1 75 1
31 1 60 1 77 1
36 1 63 1 79 1
40 1 64 1 86 1
44 1 67 1 91 1
46 3 70 3 92 1
53 2 71 4 93 1
97 2

Note: The sum of the frequencies in a frequency table is equal to the number of scores. The
frequency table lends itself to many uses in finding statistical measures, the measures of central

7
tendency, and measures of dispersion. Example to find the Arithmetic Mean from a frequency
table; (i) multiply each score by its corresponding frequency, (ii) find the sum of the products in (i),
(iii) divide the sum in (ii) by the sum of the frequencies.

Exercise: From the frequency table for Test 1, find the Mode, Median, Arithmetic Mean, and
Range. Compare your answers to the answers obtained by using the Stem and Leaf diagram.

Notation : x (or y, or z) represents a score or number in a data, and f the frequency of a score.
N is the number of scores, and is equal to the sum of frequencies of the scores.
Symbol : ∑ (sigma) is the symbol for summation or addition. Example, ∑x means sum all (of)
the x (or scores) . ∑ f means sum all (of) f (or the frequencies).

Formula for Arithmetic Mean x (or µ)

For Data not in a frequency table : x =


∑x . (Sum of all scores, divided by number of scores.)
N

For Data in a frequency table : x =


∑xf . Sum of all products of each score and
∑f
corredponding frequency, divided by the sum of the frequencies (which is the number of scores).
Example: Find the Arithmetic Mean for Test 1, in the Ungroup Frequency Table .
Ungroup Frequency Table for Test 1
x f xf x f xf x f xf
7 1 7 56 3 168 73 1 73
17 1 17 57 1 57 75 1 75
31 1 31 60 1 60 77 1 77
36 1 36 63 1 63 79 1 79
40 1 40 64 1 64 86 1 86
44 1 44 67 1 67 91 1 91
46 3 138 70 3 210 92 1 92
53 2 106 71 4 284 93 1 93
97 2 194
:∑ ∑
2
25
2
F
ro
mth
eta
ble f =2
x 25
2an
d f =3
6.B
yth
efo
rm
ula
; th
ea
rith
meticm
e
an
, x=
3
6
x =6
2.5
6 =6
3(n
ea
re
stw
ho
len
umb
er).S
oth
eme
anm
a
rkfo
rTe
st1is6
3.(A
so
bta
in
ede
arlie
r).

∑x
f isth
es
umo
fa
lln
u
mbe
r
sun
de
rth
ec
olu
mn' x
f ',s
im
ila
r
ly ∑fisth
es
um

Note: o
fa
llth
en
um
be
r
sun
de
rth
ec
olu
mn' f'.

Exercise: (i) Organize Test 2 in an Ungroup Frequency Table. (ii) Find the Arithmetic Mean of Test
2 (using the above procedure).
Measures of Dispersion: Variance and Standard Deviation

Notation: σ
2
,(
o
rs2
)r
ep
r
e
s
en
t
s
Va
r
ia
n
c
e
. σ
,(
ors
)r
e
pr
e
s
en
t
s
St
a
nd
a
r
dD
e
v
i
at
i
o
n.

N
o
t
e :T
h
e
s
q
u
ar
e
o
ft
h
e
St
a
n
da
r
d
De
v
i
a
t
io
n
i
st
h
e
Va
r
i
an
c
e
; o
r
S
t
a
nd
a
r
dD
e
v
i
a
t
io
n =V
a
r
i
a
n
c
e

Formula for Variance and Standard Deviation


Variance Standard Deviation
F
o
r
D
a
ta
n
o
ti
n
af
r
e
qu
e
n
c
yT
a
b
l
e F
o
r
D
a
ta
n
o
ti
n
af
r
e
qu
e
nc
y
Ta
bl
e

∑(x − x) . ∑(x − x) .
2 2

σ =
2
Sumof the squares of the σ =
2
Square Root, of the sum
N N
difference of each score and the mean, divided, of the squares of the difference of each score
by the number of scores. and the mean divided by the number of scores.
It simplifies to : It simplifies to :
∑x 2
∑x
()
2 2
σ = − x
()
2 2
N σ = − x
N

For Data in a frequency Table


For Data in a frequency Table
∑ (x − x ) f
2

∑ (x − x ) f
2
σ2 =
∑f σ =
∑f
It simplifies to :
It simplifies to :
∑x f2

()
2
σ = − x ∑x f
2 2
∑f σ = − x ()
2

∑f

9
Comments: The Standard Deviation and the Variance as the formula shows, give a measure of a
spread of the scores of a data about or around the Arithmetic Mean (a measure of central tendency).
Unlike the Range, which depends only on the two extreme scores, the lowest and highest, the
Standard Deviation and the Variance is dependent on all the scores of a data. They are the most
widely used measures of dispersion especially in Inferential Statistics.

Data with a large number of scores are most often given in a frequency table, or first organized in a
frequency table before any further analysis. So as an example, the Variance and Standard Deviation
will be calculated for Test 1 from the frequency table of Test 1.
Example: Find the Variance and Standard Deviation for Test 1 (in the Ungroup Frequency Table).
Solution:

∑x f 2
∑x f
2
∑xf
() ()
2 2
σ 2
= − x isV
ariance and σ= − x isS
tandardD
eviation. x=
∑f ∑f ∑f
So foreachscore' x', andcorrespondingfrequency' f', thefollow
ingm
ustbefound; xf, andx 2 f.
Notation: x is score; f is frequency; xf is the product of a score and its frequency as its x2f.

x f xf x2f = x(xf) x f xf x2f = x(xf)


7 1 7 49 64 1 64 4096
17 1 17 289 67 1 67 4489
31 1 31 961 70 3 210 14700
36 1 36 1296 71 4 284 20164
40 1 40 73 1 73
1600 5329
44 1 44 75 1 75
1936 5625
46 3 138 77 1 77
6348 5929
53 2 106 79 1 79
5618 6241
56 3 168 86 1 86
9408 7396
57 1 57 91 1 91
60 1 60 3249 92 1 92 8281

63 1 63 3600 93 1 93 8464
3969 97 2 194 8649
18818
Fromthetable : ∑f = 36; ∑xf = 2252 ; ∑x f
2
= 156 504

156 504 2252 2


So var iance σ =2
−  ∴ σ = 436.91 ≈ 437 (nearest
2
whole number )
36  36 
156 504 2252 2
Stan dard deviation σ= −  ∴ σ = 20.90 ≈ 21
36  36 

Exercise: Find the variance and standard deviation of Test 2?

Comment: The standard deviation acts as a unit of the scale of measurement of the scores in the
sense of the number of standard deviations of a score from the arithmetic mean.
Example: For Test 1, find the percentage of the number of scores that are within;
i. One standard deviation of the mean?
ii. Two standard deviations of the mean?
iii. 95, is how many standard deviations from the mean?
iv. Find the number of standard deviations, 7 is from the mean?

11
Solution :
The mean x = 62.56 and the Standard Deviation σ = 20.90
i. A number within one standard deviation of the mean is greater than or equal to x - σ
and less than or equal to x + σ . So; x - σ ≤ Number within one σ of the x ≤ x + σ .
Therefore, x - σ ≤ A score within one σ of the x ≤ x + σ .
Substituting, x - σ = 62.56 - 20.90 = 41.66 and x + σ = 62.56 + 20.90 = 83.46
From the Frequency Table or Stem Leaf, the number of scores greater than or equal to 41.66
and less than or equal to 83.46 is 25. This is the number of scores from 44 to 79. The number
of all the scores is 36. Therefore the percentage of the number of the scores that lie within one
standard deviation of the mean = 25
36 × 100% = 69.44%

ii. Similarly; x - 2σ ≤ A score within two σ of the x ≤ x + 2σ .


Substituting, x - 2σ = 62.56 - 2 × 20.90 = 20.76 and x + 2σ = 62.56 + 2 × 20.90 = 104.36
From the Frequency Table or Stem Leaf, the number of scores greater than or equal to 20.76 and
less than or equal to 104.36 is 34. This is the number of scores from 31 to 97. The number of all
the scores is 36. Therefore the percentage of the number of the scores that lie within two standard
deviations of the mean = 34
36 × 100% = 94.44%

x − x
iii. For any number ' x': z = is the number of standard deviations of ' x' from the mean.
σ
95 − 62.56
So for x = 95; z = = 1.55. ∴ 95 is 1.55 standard deviations from the mean.
20.90

7 − 62.56
iv. From (iii), for x = 7; z = = − 2.66. ∴ the number of standard deviations, 7
20.90
is from the mean, is - 2.66.

Exercise: For Test 2, find the percentage of the number of scores that are within;
i. One standard deviation of the mean? ii. Two standard deviations of the mean? iii. 37, is how
many standard deviations from the mean? iv. Find the number of σ s, 91 is from the mean?
Group Frequency Table: Is a table of groups of scores and sum of the frequencies of the
individual scores in the group. A group of scores is called a class. Each score belongs to a class,
and can belong to only one class. So classes do not overlap. The other aspects of a class are: (i)
Class Limits (Lower and Upper), (ii) Class Size, (iii) Class Boundary (Lower and Upper), (iv)
Class Interval, and (v) Class Mark. These would be discussed at the appropriate points. Whilst
there is only one Ungroup Frequency Table for a given data, there is more than one Group
Frequency Table for the same data. The distinguishing features are the Class Size, which is the
number of scores in a class, and the Lowest or Greatest Class Limit.
Group Frequency Table 1 for Test 1
Test Marks of Students Number of Students Test Marks of Students Number of Students
Scores Frequency Scores Frequency
7 - 11 1 57 - 61 2
12 - 16 0 62 - 66 2
17 - 21 1 67 - 71 8
22 - 26 0 72 - 76 2
27 - 31 1 77 - 81 2
32 - 36 1 82 - 86 1
37 - 41 1 87 - 91 1
42 - 46 4 92 - 96 2
47 - 51 0 97 - 101 2
52 - 56 5
Comment: Each class has the same size, 5 (different scores). The lower limit of the fourth class is
22, and the upper limit of the first class is 11. In general the class sizes need not be equal.
Group Frequency Table 2 for Test 1
Scores Frequency Scores Frequency
4 - 13 1 54 - 63 6
14 - 23 1 64 - 73 10
24 - 33 1 74 - 83 3
34 - 43 2 84 - 93 4
44 - 53 6 94 - 103 2
Comment: Each class size is 10. The lowest limit is a score of 4 and the greatest limit 103. None of
these is a score of the data.
Large data is often given in a Group Frequency Table. This summarizes the data at the expense of
details. The larger the class size the shorter the summary and the more detail that is lost. It is
therefore necessary to balance brevity of summary against too much detail. This is comparable to
the assignment of grades to course marks. By the rule of thumb or by convention, the number of
classes must not be less than 5, and it must not be more than 25.

13
Mean, Variance, and Standard Deviation from Group Frequency Table: Each class is represented
by a Class Mark which then is given the frequency of the class. This ‘reduces’ the Group Frequency
Table to an Ungroup Frequency Table with the Class Marks as the scores, with frequencies of the
corresponding Classes.
Class Mark, x of a Class: Is the mean of the Lower and Upper Class Limits of the class. That is,
(Lower Class Limit + Upper Class Limit) ÷2.

Example: Find the Mean, Variance and Standard Deviation for Test 1 Group Frequency Table 2.
Solution: The following table is in reference to the formulas to be used;
Test Scores # of students: f Class Mark: x xf x2f = x(xf)
4 - 13 1 8.5 8.5 72.25
14 - 23 1 18.5 18.5 342.25
24 - 33 1 28.5 28.5 812.25
34 - 43 2 38.5 77 2964.5
44 - 53 6 48.5 291 14113.5
54 - 63 6 58.5 351 20533.5
64 - 73 10 68.5 685 46922.5
74 - 83 3 78.5 235.5 18486.75
84 - 93 4 88.5 354 31329
94 - 103 2 98.5 197 19404.5
Sum Σ 36 2246 154981
2246
M
ean x= = 62.388.. ∴M ean x = 62 (nearestw holenum
ber)
36
154981 2246 2
Variance σ2 = −   = 412.654321 ∴Variance σ2 = 413 (nearestwholenum
ber)
36  36 
and Stan dard D eviation σ = 20.31 (2decim
alplaces) and σ = 20(nearestw
holenumber)
Comment: Compare these values to the corresponding values for the Ungroup Frequency Table.
Frequency Table for Category Data: Is the ‘non-numerical’ attributes of the Category Data with
their corresponding frequencies.
Example: The Frequency Table of the following Category Data of Colour of Cars in a Car Park;
Blue, White, Blue, Blue, Black, Blue, Black, Silver, Silver, Blue, Silver, Green, Black, White,
Silver, Silver, Black, Blue, Green, Blue, Green, Red, Black, Red.
Frequency Table of Colour of Cars in Car Park
Score Frequency
Colour of Car Number of Cars
Blue 7
White 2
Black 5
Silver 5
Green 3
Red 2
Mode is Blue. That is there are more Blue cars than any other Colour of cars.
Comment:
Numerical Data, were organized by, (i) Stem and Leaf, and (ii) Frequency table for both ungroup
that is single score, and group that is class of scores and frequency: and calculated (i) the Measures
of Central Tendency or the Averages; Mode, Median, and Mean, and (ii) some of the Measures of
Dispersion; Range, Variance, and (from the Variance) the Standard Deviation.

Category Data was organized in a Frequency Table of a category attribute and frequency: and
calculated the Mode, a Measure of Central Tendency or Average. The mode is the only measure of
central tendency that makes sense for category data. There is no measure of dispersion, because
none make sense for a category data.

PICTORIAL REPRESENTATION OF DATA by STATISTICAL GRAPHS


Statistical Graph is the pictorial representation of the relation between statistical variables, example,
scores and frequencies. The Pie, Line, and Bar graphs and the Histogram are examples of pictorial
representation of the relation between scores and frequencies. These graphs are about the most
common statistical graphs.
Numerical data can be represented by any one of the four graphs. Categorical data can be
represented by the Pie, and the Bar graphs, but cannot be represented by the Line graph or the
Histogram. Which graph to use, depends on the type of data, and the purpose of the graph.

15
Pie Graph or Chart:
The pie chart is a circle divided into sectors, to represent the proportion of the frequency of a
‘Score’, ‘Class’ or ‘Category’ to the number of scores (sum of frequencies of the scores).

Example:
Colour of Cars Number of Cars Pie Graph for the Colour of Cars in a Car Park
Blue 7
White 2
Black 5
Silver 5
Green 3
Red 2

Making a Pie Graph:


i. Find the number of scores or sum of frequencies. For the example: N = 24
ii. Find the ratio as fraction of the frequency of a category to the sum of frequencies.
Find the product of the fraction and 360o. At the centre of a circle measure an angle
equal to the product. Draw the sector of the circle subtended by this angle. This
sector represents the frequency of the category. Example, for the category Blue, the
7
fraction is 24 . The product of the fraction and 360o is 105o. The sector subtended by
105o represents the proportion or percent of Blue cars to the Number of cars, or the
number of Blue cars.
iii. Repeat for each category. Indicate which sector is for what by legend or writing in
sectors.
Or use a computer software, for examples Excel, SPSS, MathLab, etc.
Comment: Pie charts are good for comparing relative frequencies of the Categories and to the
sum.
Bar Graph or Chart:
The Bar chart is a graph of rectangular bars (or blocks). The ‘width’ of a bar represents a ‘Score’,
‘Class’ or ‘Category’, and the Area of the bar is equal to the frequency of the score. The ‘length’ is
therefore equal to the frequency of the score divided by the width. If the widths are all equal then
the lengths are taken to be the corresponding frequencies.
Bar graphs are mainly used for pictorial comparison of the frequencies of scores. This includes how
the scores are distributed around the measures of central tendency.
Example: Draw the Bar graph for the data below:
Frequency Table of Colour of Cars in Car Park

Score (Category) Frequency


Colour of Car Number of Cars
Blue 7
White 2
Black 5
Silver 5
Green 3
Red 2

Solution: From Microsoft Excel:

17
Colour of Cars

The line graph has the score as the independent variable and the frequency as the value of a
function. The bar graph has ‘category’ on the horizontal axis as the base of a rectangle (same
width) and the frequency as the height. If it is group numerical data, the class defined by the lower
and upper class limit is the ‘category’ and then, the base of the rectangle is proportional to the class
size and the height is the frequency.
Histogram:
The Histogram is the graph formed by rectangles representing the classes of a group frequency table
of numerical data. The area of the rectangle for a class on a histogram is equal to the frequency of
the class. The lower and upper class boundary is the base, and the height of the rectangle is the
frequency of the class divided by the class interval (width). For equal class intervals the height of a
rectangle is ‘frequency’ of the class. There are no gaps between the rectangles of a histogram.
(There can be gaps between the rectangles of a Bar graph.)
One use of the histogram is to find the ratio or fraction of the scores between numbers of standard
deviations from the mean, (for example, one standard deviation from the mean) and the total
number of scores. This is the ratio or fraction of the area of the rectangles in the region (of interest)
to the total area of the histogram. These ratios interpreted as the probability of a score in the region
are used in statistical decision-making.

Histogram For Group Frequency Table 2 for Test 1.


Test 1 Scores
Class
# of students: f

4 - 13
14 - 23
24 - 33
34 - 43
44 - 53
54 - 63
64 - 73
74 - 83
84 - 93
94 - 103
1
1
1
2
6
6
10
3
4
2

19
21
~~

Graphical Representation of Data


Statistical Graphs: References

23
http://www.cimt.plymouth.ac.uk/projects/mepres/book9/bk9_8.pdf
http://www.cimt.plymouth.ac.uk/projects/mepres/book8/bk8_5.pdf
http://www.cimt.plymouth.ac.uk/projects/mepres/allgcse/pbtxt.pdf

Percentage of (Number of) Scores between Standard Deviations of the Mean: