You are on page 1of 19

STATISTICS IN PRACTICE

Part I
The use of statistics in business process improvement

All rights reserved. Nothing from this publication may be copied, stored in an authorised data
file, or made public in any form or any manner whether electronic, mechanical, photocopying,
photography or any other means, without the prior written consent of the author.

Contents
1
1.1
1.2

Background to statistics
What is statistics?
Research terminology

2
2.1
2.2
2.3

2.4
2.5
2.6

Descriptive statistics
Types of tables and graphs
Measuring levels
Measuring trends
2.3.1 Centre measures
2.3.2 Distribution measures
Trend measures per measure level
Normal distribution (1)
Questions and assignments

3
3.1

Calculating risk
Questions and assignments

4
4.1
4.2

Inductive statistics
The normal distribution (2)
Z-scores

Appendices:
Table with z-values

Background to statistics

1
Imagine you have a computer shop and have signed a contract with IBMS. All students can
purchase a notebook at a discount from your shop. You are a good businessman and would like
to have an estimate of the number of purchasers this will create for you if only because you have
to ensure you have adequate stock. Its obvious an investigation is needed. But...
- How can you analyse the results from the questionnaire?
- How many students do you have to question to get a good picture?
- How can you present the results meaningfully and understandably?
- How great is the chance that the result from your survey is a good prediction of the final sales
figures?
To answer these questions, you need statistics. We understand statistics to mean turning data
into information, in order to be able to analyse better. The consistent methods used make an
analysis clearer, simpler and more efficient.

1.1

What is statistics?

It is usually impossible that all the people or objects about which you want to know something can
be involved in an investigation and it is also unnecessary. If the businessman from the example
on the previous page wants to have an indication of the number of computers he will sell to the
IBMS students as a result of his discount offer, he doesnt need to ask all IBMS students. Taking
a sample is sufficient. If his sample is sufficiently large and it forms a good reflection of the
student group as an entirety, then the result will, in terms of percentage, be a good match to what
he would get if he had asked all the students. On the basis of the samples results, it can be
estimated correctly how many computers he must have in stock.
Therefore, market research is about processing data at different levels. You are involved with the
results of surveys and also in the generalisation of the data to a larger group.
The analysis and description of random sample (survey) data is called descriptive statistics. This
is the simplest form and you are only involved with the data from the sample. You construct
tables, you calculate core numbers such as means, or you display graphically what you have
discovered.
However, we are not usually in this situation. With most surveys, we want to make statements
that go beyond this. For example, you would like to know how great the chance is that your
results from the sample are valid for the complete category from which the sample was taken, or
you want to compare your results with those of an earlier investigation. For these types of
questions, you need more complex calculations and considerably more knowledge. With all
calculations where you go beyond the sample results we talk about inductive statistics. An
example of this is when you try to forecast how many IBMS students will buy one of your laptops,
based on a sample of only 75 IBMS students.

1.2

Research terminology

Statistics is a discipline that is full of specialised terminology. The employment of the correct
concepts in the correct context is very important. By using uniform terms, you are in the position
to confer with colleagues in such a way that you understand each other and you can explain
simply to your customer what it means. But you also must have a command of the research
terminology in order to be able to understand statistics. Statistics is not about learning formulae in
isolation but about gaining fundamental insight into basic processes, so no misunderstandings
can exist about the meaning of the terms used. You must strictly adhere to this. To start with, we
make a differentiation between research units and research properties.
Research units and properties
We will stay with the example given earlier, a survey into students at IBMS, with the goal of
making predictions about all the students in the college. The IBMS students concerned are
therefore the research units of this investigation.
All the research units together we call the population of the investigation. You will make
predictions about the population based on the results of the survey after the completion of the
market investigation.
Research units are often people, but that is not always the case. If you want to investigate the
safety of pedestrian crossings, then the pedestrian crossings are your research units and if you
want to do an investigation into the percentage of accounts of Dutch companies that contain
errors, then the accounts of Dutch companies are your research units.
The selection of the population on which or about which you collect information is the sample. If
you question or observe a group of people who are part of the population, that group makes up
the category for your sample. But if all Dutch pedestrian crossings or accounts make up your
population, you can also create a sample from them.
A research unit in a sample is also called a record. If the research units are people that have
answered questions, then we usually speak of them as respondents.
Therefore, both your population and your random sample contain research units and one
research unit in your sample is called a respondent or a record.
How you collect information from a sample is, as a rule, not so important for the end result for
your research. Why would our shopkeeper want to know that, for example, one hundred students
questioned in the sample acknowledged that they want to buy a computer? The only important
thing is approximately how many students in the future he can expect as customers thus which
fraction of the population. We talk here about which proportion of the population.
What we are doing in the research is determining which percentage of those questioned in the
sample are interested and we use this percentage as an indicator for the situation for all IBMS
students. Therefore, characteristic for market research is the collection of information from
research units that are a part of the sample with the aim of making predictions about a
population! This is an important understanding.
Obviously, IBMS students differ from each other. For example, some students are male, others
are female. Some live in Eindhoven, others live somewhere else. Additionally, there are students
who are satisfied with their studies, and others who do not find much in them. In other words, the
research units have properties. These properties or characteristics of the research units we
denote in research terminology as variables. Pedestrian crossings also have properties because
some have traffic lights and others not. Such differences can be responsible for variations in the
research results because the number of accidents, for example, can vary according to the
presence of traffic lights.

Study town, study course, gender, satisfaction and the presence of traffic lights are examples of
the characteristics of research units and thereby of variables in your investigation.
In our notebook example, one variable like the income of students is important because there
could be differences in purchasing a computer. A student with a high income might be more
inclined to buy a computer than a student with less income.
We distinguish two types of properties. If the property of the research unit cannot change within
the proposed setting (or: conceptual model, see below), then we talk about an independent
variable.
Our example is directed to the question of whether the independent variable has a relationship
with the purchase or not of a computer. The latter, the purchase behaviour, we call a dependent
variable, because it is hypothesized to be dependent on the level of the independent variable(s).
Sometimes, the distinction between dependent and independent variables is difficult to draw
because both appear to be independent. In this type of situation, you can ask which of the two
precedes the other in time. The independent variable always precedes the dependent variable.
For an investigation into the relationship between intelligence and income, for example, it
appears that both characteristics are independent but a high IQ precedes a high income.
Back to the example investigation in which we wish to verify if there is a relationship between the
income of students (the independent variable) and the purchase behaviour (the dependent
variable). Reversing these is not possible because you are will not get more income just by
buying a computer. The purchasing behaviour is something that can change therefore we call this
type of data dependent.
It is normal to represent the relationship between an independent and dependent variable in a
diagram called the conceptual model. A conceptual model in its basic form looks like this:

study course

buying behaviour

(independent variable)

(dependent variable)

If you write a market research proposal or a report, a conceptual model makes it clear at a glance
what the focus of the research is to be. Therefore, it should seldom be absent.
Examples of independent variables are gender, education, shop outlets, hair colour, IQ, type of
car, the presence or absence of traffic lights, or departments in companies.
Examples of dependent variables are satisfaction, an opinion about a certain subject, an intention
to purchase and the number of accidents.
Representation
The core of a lot of market research is, as mentioned, that we collect data from a sample in order
to make predictions about the population. Naturally, this is only possible if the properties of the
research units of the sample have the same composition as the population. If 40% of the
population in the previous example consists of students with a high income, then this has to be
the case in the sample, otherwise we will get a distorted impression. The random sample must,
as it is called, be representative. Only if the sample is representative can we generalise the
results of the sample for the population without disaster.

Sample size
Whenever we research a sample, the results will generally not totally agree with the situation in
the population. For example, if we find that 15% of those questioned in our sample say they will
buy a computer soon, then we hope that this 15% gives a good indication as a prediction of the
purchasing behaviour of all the students, but we do not have that certainty. It is certain that the
more students we question the greater the number of research units in our sample the greater
the certainty that our 15% reflects the situation in the population.
In order to calculate how close our sample results are to the situation in the population, inductive
statistics become involved. The accuracy that we require and the extent to which we will accept
deviations are central to inductive statistics and are heavily involved with the theory of probability.
For example, we accept that the prediction will not deviate by more than 5% (called margin) with
a probability of 90% (accuracy). We will delve much deeper into this later. Lets start with
descriptive statistics.

Descriptive statistics
2.1

Types of tables and graphs

Two types of tables you will use in market research are frequency tables and crosstabulations. A
frequency table is concerned with one variable. A cross-relationship table is meant to display the
relationship between two or more variables.
Frequency tables
Here is an example of the SPSS (Statistical Package for Social Sciences) output of a frequency
table, also often called straight counting. It the simplest and most common form of presentation.

Study course

Frequency
Valid

Percent

Valid Percent

Cumulative
Percent

Technical

230

57,2

57,2

57,2

Nursing

172

42,8

42,8

100,0

Total

402

100,0

100,0

Of course, in a report for the customer you always process crude computer output. It is not much trouble to
rework the table above into the result below.

Table 1: Study courses

Number

Percent

Technical

230

57,2

Nursing

172

42,8

Total

402

100,0

From the table you can see that 402 students were questioned for this investigation. These are all
the respondents, which sets the total at 100%. Of the 402 interviewed, 230 are following a
technical course (approximately 57%) and 172 a nursing course (43%).

Crosstabulations
We select a cross-relationship table if we want to know if a part of the sample with a particular
property scores differently on a question to those with other properties. For example, if you want
to know from the sample if the pedestrian crossings with traffic lights are safer than those without,
or that customers of a particular chain store are more satisfied with one store than another, or
that women spend more money than men, then a cross-relationship table is the correct tool.
The construction and interpretation of crosstabulations appears simple but they are definitely not.
In day-to-day use, many errors are made using them and their output! To avoid mistakes, it is
strongly recommended to keep to a fixed procedure. The consistent adherence to the procedure

is essential and it is important to keep the steps in mind when you construct or interpret a crossreference table!
Basic rules for building a cross-relationship table
-

The independent variable is always at the top.


Percentages are always used in the cells (absolute numbers are optional).
Calculating percentages is only done in the columns.
Interpreting is only done by looking at the percentages.

For our example investigation, a cross-relationship table is constructed to see if participating in


the notebook project varies per study course. To make it easy, we start from the basis that a
university only has technical and nursing students. The table is based on computer output which
has had several cosmetic embellishments made which are also expected from you when you
construct this type of table.
Cross-relationship table: Participation in notebook project to study course.

Study course
Participate in
Notebook project?

Yes
No

Total

Total

Technical
174
77,0%
52

Nursing
116
67,4%
56

290
72,9%
108

23,0%

32,6%

27,1%

226
100,0%

172
100,0%

398
100,0%

In this case, the study course is the independent variable. This characteristic belongs at the top in
the table heading. The question in the investigation (Will you or will you not participate in the
notebook project?) is the dependent variable and therefore, by definition, is placed on the left of
the table.
In each of the cells, you can see the absolute numbers and the percentages. Only the
percentages are actually important! You ignore the absolute numbers in your reporting.
On the bottom row, you see 100% three times. This shows you that, according to the rule, the
percentages are calculated in columns because each column has a total of 100%. The number of
technical students questioned is 100% in total just as the number of nursing students and the
total number of respondents.
Only a table constructed in this way allows correct interpretation. In this case, the interpretation is
the following: 73% of all the students questioned, indicate that they will participate in the
notebook project. In the survey, there is a difference according to the course. A higher
percentage of technical students will take part in comparison with those doing the nursing course.
The percentages are 77% to 67%.
Whether or not we can conclude that the difference in the sample is also a difference in the
population is an important question which we cannot yet answer. For predictions about the
population we need inductive statistics. This will be covered later.

Figures

10

The readability of a report benefits considerably when the data from the tables is turned into
figures. The two most common figures are histograms (or bar charts) and pie charts). To
construct bar and pie charts, most researchers use Excel in view of its ease of use and the large
number of layout possibilities.

2.2

Measuring levels

In a questionnaire, questions are included that can be roughly split into four categories. The
difference is in the way the answers can be processed statistically.
Look at the first three questions in the text box below and try to figure out what the differences for
processing the results are.

Questions with answers of various measuring levels


1

Will you cross whether you are a man or a woman?


Man
Woman

Do you consider yourself as a satisfied or dissatisfied customer of


this store?
Very dissatisfied
Dissatisfied
Satisfied
Very satisfied

What time is it?

. o clock

How old are you?


................. years

------------------------------------------------------------------------------------------------------------------------

How old are you?


19 or younger
20 to 29
30 to 39
et cetera

The possible answers to the questions are each of a different order; each has its own measuring
level. It is important to know what the measuring level of a score is. The choice has
consequences for the statistical processing possibilities. Before a researcher develops a
questionnaire, this must be realised.
In the first question about gender, the respondent can select one of two options, in this case
without there being any mention of a value difference between either option. You are a man or a

11

woman. There is nothing between them and there is also no value difference. The only thing a
researcher can do is to determine how many people belong in the one category and how many
people belong in the other category. There is not a lot to calculate. We call this data at the
nominal level. All yes/no questions are also nominal data.
It is different with question 2 which, likewise, is a closed question but with a value differentiation
appearing. For each answer there is some indication of more or less, or better or worse. The
answers vary from high to low or from much to little. Therefore this is data at the ordinal level.
In question 3, respondents can fill in the time. This is an example of data on interval level. The
interval between 12.00h and 14.00h is just as long as the interval between 16.00h and 18.00h.
However, we cannot say that 14.00h is two times as late as 07.00h. Other examples of data on
interval level include peoples IQ and the outside temperature.
Finally, in question number 4, numbers are filled in for the age. Age, at least if they are separate
numbers, is ratio data. Numbers allow the most calculation possibilities. Someone that is four
years old has lived twice as song as someone that is only two years old. Other examples of ratio
scale data include income and weight.

2.3

Measuring trends

Every market researcher will be curious to know whether particular trends in the results are true.
The two most important ways to represent numerical results are by centre and distribution. Or, in
other words, where is the highest concentration of numerical observations and in how far do
these observations vary?
Mean, median and mode show which values the data is grouped around. These are centre
measures. The range, variation and the standard deviation are necessary to determine how wide
or narrow the reactions are. Percentile scores say something about the position of individual
scores compared with the other scores.

Difference between centre and distribution measurements


Centre:
where is the highest concentration of observations?
Distribution: how far do the observations vary?

Centre measurements are explained in section 2.3.1 and distribution measurements in section
2.3.2.

12

2.3.1

Centre measures

The most important centre measures are the arithmetical mean, the median and the mode.
Arithmetical mean
When five people questioned are 19, 25, 25, 21 and 35 years old, the mean age is 25 years
((19+25+25+21+35) / 5 = 25).
In calculating the arithmetic mean, you will meet two different symbols in the formulae. Population
and sample means have their own designations. They are as follows:

= Population mean

= Sample mean
N
= The number of observations in the population
n
= The number of observation in the sample
xi
= Individual score (1 = first score, 2 = second score, et cetera.)

= Sum of the observations


(Subscript from the first (i) up to and including the last (n), or 1 to n inclusive)
The calculation of the arithmetic mean in a formula:
n

( + + ... + x n )
= x = x1 x 2
=
n

i =1

Mode
The mode is the most frequent observation. Look at the following list:
19
25
25
21
35
There is one score occurring twice, 25. That is the mode.
Tip: You most certainly have heard the term mode income. Many people think that this means
the average income. This is not correct, it is the most common income.
We work with classes in many investigations. For example, respondents can indicate if their age
falls in a certain category such as 10-19 years or 20-29. In this case, the most prevalent class is
the modal class.
Numbers reproduced in classes also have a mode. The mode is the middle of the modal class.

13

Median
The median is the middle observation after all the observations have been sorted from low to
high. In other words, the median is the observation where 50% of the observations are above and
50% below. In order to determine the median, you first sort the observations from low to high (19,
21, 25, 25, 35). Subsequently, we look at which one has the middle score. In this case, it is 25
because there are two numbers below and two above.
Whenever the number of observations is even and therefore you have two middle observations,
the median is the arithmetical mean of these two scores. Imagine the list is 19, 21, 25, 27, 29, 35,
then there are two numbers in the middle namely 25 and 27. The median is 26.
You can use the following formula, to determine the location of the median in the set of
observations:
Lm = (n+1)/2
In which,
= the location of the median
Lm
n
= the number of observations
Using the five observations above (19, 21, 25, 25, 35), we determine Lm = (5 + 1)/2 = 3. Hence,
the median is located at the third observation. Be careful though, because 3 is not the median!
The median is equal to the value of the third observation, which is 25! Using the six observations
given above (19, 21, 25, 27, 29, 35), we determine Lm = (6 + 1)/2 = 3.5. The median in this set of
th
th
observations is the value of the 3.5 observation. The 3.5 observation is exactly in the middle of
the third and fourth observation (25 and 27 respectively). Therefore, we take the average of these
two observations; (25 + 27) = 26.
Examples of centre measures
Below you will find the frequency division according to the SPSS output in which the mean,
median and mode is to be calculated.
As you can see, the mean score of these 13 observations is 27614 (rounded). The median is
25666. You can see for yourself where the median is if you look under 'Cumulative Percent'. 50%
of the observations are larger than 25666 and 50% are less. The values in the first column are
automatically sorted from low to high. It must be the 7th observation (middle) because there are
13.
Under 'Frequency', or number, you can see how often an observation appears. There is one
number that appears more often than the rest, 28950, and therefore that must be the mode.

14

Valid
Missing

Mean
Median
Mode

Valid

22222
22333
23434
25050
25555
25666
28950
34244
34333
Total

13
0
27613,92
25666,00
28950

Frequency
1
1
1
2
1
1
3
2
1
13

Percent
7,7
7,7
7,7
15,4
7,7
7,7
23,1
15,4
7,7
100,0

Valid
Percent
7,7
7,7
7,7
15,4
7,7
7,7
23,1
15,4
7,7
100,0

Cumulative
Percent
7,7
15,4
23,1
38,5
46,2
53,8
76,9
92,3
100,0

15

2.6

Questions and assignments

In an investigation for the C-1000 supermarket chain, researchers were looking into whether
there was a difference in satisfaction between two stores.
Store A:
Store B:

In this shop, 80 were satisfied about the shop from the 100 interviewed.
In this shop, 100 were satisfied about the shop from the 150 interviewed.

What is the population of the investigation?


..

What is the independent variable in this investigation?


..

Fill in the absolute numbers in the table below:

total

total

250

Fill in the percentages in the table below:

total

total

100%

Interpret the results of this investigation


...........................................................................................................................
...........................................................................................................................
.

Another investigation was performed by Nike. The goal of this research was to find out whether
there was a difference between young and older people concerning the judgement of the
coolness of the brand Nike. This trend-sensitive manufacturer of fashionable brand-name
articles has few of its own sales outlets. Therefore, customers are difficult to find for research.
From a previous survey it was found that there was high degree of correlation between the
customers of Foot Locker shops and the purchasers of Nike products, so with the permission of
Foot Locker, people were interviewed with questions about Nike outside the shops.
(This example is fictional.)
6
What is the population of the investigation?

16

..

Nike feared that people over twenty found the brand cooler than youngsters under twenty which
would be a disaster for Nike. From the investigation, the following appeared:
< 20
20 or older

Of the 200 youngsters in the sample, 80 found Nike to be a cool brand.


From this group of 150, 100 found Nike to be a cool brand.

What is the independent variable in this investigation?


..

Fill in the absolute numbers in the table below:

total

total

350

Fill in the percentages in the table below:

total

total

10

100%

Interpret the results

...........................................................................................................................
...........................................................................................................................
.

17

Given the following investigation result (n = number of people interviewed):

Age
15 to 24
25 to 34
35 to 44
45 to 54
55+

11.

Visits a caf weekly


Number
60
55
52
50
75

n=
200
200
200
200
200

How great is the chance that a 23 year old will visit a caf weekly?
..

12.

What is the median of the numbers in the column 'Visits a caf weekly?
..

13.

What is the arithmetic mean for the numbers in the column 'Visits a caf weekly?
..

14.

We can make a distinction between four different measuring levels. Explain for all
of the following data what measuring level is used.
a) Data about the weight of packages of coffee

b) Data about the country of origin of customers

c) Data about the intelligence quotient (IQ) of students

d) Data about the level of satisfaction on a five-point scale

18

15.

7
4
1
8
6,5
6,5
5
3,5
4
10

Calculate the mean, mode and median for the following scores of the statistics
exam from last year:

5,5
2,5
6
7
5
4,5
5
8
9
2

Mean:
..

Mode:
.

Median:
.

19

You might also like