You are on page 1of 43

TOPIC 1

INTRODUCTION
Topic outline
1.1 What is Statistics
• Key statistical concepts
• Practical applications
1.2 Types of data
• Methods of collecting data
1.3 Sampling
• Sampling plans
• Sampling and non-sampling errors

1.1 What is Statistics


In today’s world we have access to more data than ever.
For example, data are collected for business applications
from:
• Direct observation or measurement
• Customer surveys
• Political polls
• Economic surveys
• Marketing surveys
• Scanner data

This topic introduces various statistical concepts. In this


topic, we also introduce various methods of data
collection.
In today’s world…
How can we make use of the collected data to
help make informed business decisions?

By learning statistics, which is


• a collection of various techniques and tools,
that can help make such decisions.

What is Statistics?

‘Statistics is a body of principles and


methods concerned with extracting
useful information from a set of
data to help people make informed
business decisions.’
What is Statistics?
‘Statistics is a way to get information
from data to make informed decisions.’

Statistics

Data Information

Information: Knowledge communicated


Data: Mostly numerical facts collected from
concerning some particular fact, which
direct observations, measurements or can be used for decision making.
surveys.

Example 1: Stats anxiety…


(Example 1.1, page 2)

A student enrolled in a business program is attending his


first lecture of the compulsory business statistics course.
The student is somewhat apprehensive because he believes
the myth that the course is difficult. To alleviate his
anxiety, the student asks the lecturer about last year’s
exam marks of the business statistics course. The lecturer
obliges and provides a list of the final marks. The marks are
composed of all the within-semester assessment items plus
the end-of-semester final exam.

What information can the student obtain from the list?


Example 1: Stats anxiety…
List of data provided by the lecturer to the student.

Example 1: Stats anxiety…


Statistics

Data Information
List of last year’s Summary information
statistics marks derived about the statistics
65 class.
71
E.g. Class average,
66
proportion of class receiving
79
F’s, most frequent mark,
65
Highest and lowest marks,
82
spread of the marks, grade
:
(A,B,C,D,F) distribution, etc.
Example 1: Stats anxiety…

‘Typical mark’
Mean (average mark)
Median (mark such that 50% above & 50% below)

Mean = 72.67
Median = 72

Is this enough information?

Example 1: Stats anxiety…

Are most of the marks clustered around the mean or are


they more spread out?

Range = Maximum – minimum = 92 – 53 = 39


Variance
Standard deviation
Example 1: Stats anxiety…

Are there many marks below 60 or above 80?

What proportion are A, B, C, D and F grades (distribution


of grades)?

A graphical technique – histogram – can provide us with


this and other information.

EXAMPLE 1: STATS ANXIETY


Histogram
30
Frequency

20
10
0
50 60 70 80 90 100

Marks

A majority of students received marks between 60 and 90.


No student received marks below 50.
A significant number of students received marks above 80.
Two major branches of Statistics

1. Descriptive Statistics

2. Inferential Statistics

Descriptive Statistics
Descriptive statistics deals with methods of organising,
summarising, and presenting data in a convenient and
informative way.

One form of descriptive statistics uses graphical


techniques, which allow statistics practitioners to present
data in ways that make it easy for the reader to extract
useful information.

Chapters 3 and 4 introduce several graphical methods.


Descriptive Statistics
Another form of descriptive statistics uses numerical
measures to summarise data.
The mean and median are popular numerical measures
to describe the location of the data.
The range, variance and standard deviation measure
the variability of the data

Chapter 5 introduces several numerical statistical


measures that describe different features of the data.

17

Inferential Statistics
Descriptive statistics describe the data set that is
being analysed, but does not provide any tools for us
to draw any conclusions or make any inferences
about the data. Hence we need another branch of
statistics: inferential statistics.
Inferential statistics is also a set of methods, but it is
used to draw conclusions or inferences about
characteristics of populations based on sample
statistics calculated from sample data.
Key Statistical concepts
Population
A population is the group of all items (data) of interest.

Population is frequently very large; sometimes infinite.

E.g. 1. All current million or so members of an automobile club


(Example 1.3).
2. All goats available on eid at the ‘bakramandi’ in Islamabad.

Key statistical concepts


Sample
A sample is a set of items (data) drawn from the
population of interest.

Sample could potentially be very large, but much less


than the population.
E.g. 1. A sample of 500 members of the automobile club selected.
2. A sample of 1000 goats selected from different sections of
the ‘Bakramandi’.
Key statistical concepts

Parameter
A descriptive measure of a population.

Statistic
A descriptive measure of a sample.

Key statistical concepts


Population Sample

Subset

Statistic
Parameter
A descriptive measure of a population is called a parameter
(e.g. Population mean)
A descriptive measure of a sample is called a statistic (e.g.
Sample mean)
Statistical Inference
Statistical inference is the process of making an
estimate, prediction, or decision about a population
based on a sample.
Population Sample

Inference

Statistic
Parameter

What can we infer about a population’s parameter based


on a sample’s statistic?

Statistical Inference

We use sample statistics to make inferences about


population parameters.
Therefore, we can produce an estimate, prediction,
or decision about a population based on sample data.
Thus, we can apply what we know about a sample to
the larger population from which it was drawn!
Statistical inference
Rationale:
• Large populations make investigating each member
impractical and expensive.
• Easier and cheaper to take a sample and make
estimates about the population from the sample.
However:
• Such conclusions and estimates are not always going
to be correct.
• For this reason, we build into the statistical inference
‘measures of reliability’, namely confidence level
and significance level.

Confidence and Significance Levels


When the purpose of the statistical inference is to
draw a conclusion about a population, the
significance level measures how frequently the
conclusion will be wrong in the long run.

For example, a 5% significance level means that, in the


long run, this type of conclusion will be wrong 5% of
the time.
Confidence and Significance Levels

The confidence level is the proportion of times that an


estimating procedure will be correct.

For example, a confidence level of 95% means that,


estimates based on this form of statistical inference will be
correct 95% of the time.

Confidence and Significance Levels


Consider a statement from polling data you may hear
about in the news:

‘This poll is considered accurate within


3.4 percentage points, 19 times out of 20.’

In this case, our confidence level is 95% (19/20 =


0.95), while our significance level is 5%.
Practical applications
Example 2: Pepsi’s Exclusivity Agreement
A large university with a total enrolment of about
50,000 students has offered Pepsi-Cola an exclusivity
agreement that would give Pepsi exclusive rights to
sell its products at all university facilities for the next
year with an option for future years. In return, the
university would receive 35% of the on-campus
revenues and an additional lump sum of $200,000
per year.

Pepsi has been given 2 weeks to respond.

Example 2:
Pepsi’s Exclusivity Agreement…
The market for soft drinks is measured in terms of 375ml
cans.
Pepsi currently sells an average of 10,000 cans per week
(over the 30 weeks of the year during two teaching
semesters that the university operates).
The cans sell for an average of $2.00 each. The costs
include a labour amount of 50 cents per can.
Pepsi is unsure of its market share but suspects it is
considerably less than 50%.
Example 2:
Pepsi’s Exclusivity Agreement
A quick analysis reveals that if its current market
share were 25%, then, with an exclusivity agreement,
Pepsi would sell 40,000 (10,000 is 25% of 40,000)
cans per week or 1,200,000 cans per year.
The profit or loss can be calculated.
The only problem is that we do not know how many
soft drinks (all types including Pepsi) are sold weekly at
the university.

Example 2: Solution
Pepsi’s Exclusivity Agreement
The population in Example 2 is the soft drink
consumption of the university’s 50,000 students. The
cost of interviewing each student would be prohibitive
and extremely time consuming. Statistical techniques
make such endeavours unnecessary. Instead, we can
sample a much smaller number of students (the sample
size is 500) and infer from the sample data the number
of soft drinks consumed by all 50,000 students. We can
then estimate annual profits for Pepsi.
Example 2: Solution
Pepsi’s Exclusivity Agreement
Pepsi assigned a recent university graduate to survey
the university’s students to supply the required
information.
Accordingly, she organises a survey that asks 500
students to keep track of the number of soft drinks by
type of drink (Pepsi, Coke, Lemonade etc.) they
purchase during the next 7 days.

Example 2: Solution
Pepsi’s Exclusivity Agreement
The information we would like to acquire in Example 2 is an
estimate of annual profits from the exclusivity agreement.
The sample data to be used for this purpose are the number
of cans of the various types of soft drinks consumed during
the 7-day survey period by the 500 students in the sample.
To summarize the data collected from the 500 sampled
students, we could use the graphical descriptive statistics
methods (to show the distribution of purchase by drink type)
and numerical descriptive measures (to calculate the mean
number of soft drinks purchased per day by the students).
Example 2: Solution
Pepsi’s Exclusivity Agreement
To make an informed decision about signing-up for the
Exclusivity agreement, we want to estimate the mean
number of the various soft drinks consumed by all
50,000 students on campus.
To accomplish this goal we use another branch of
statistics – inferential statistics, which is a collection
of techniques used to make inferences about the
population using sample data.

Example 3: Exit polls


When an election for political office takes place, the
television networks cancel regular programming and
instead provide election coverage.
The television networks often compete on the evening
of an election day to be the first to correctly identify
the winner of the election.
One commonly used technique is through exit polls,
wherein a random sample of voters who exit the
polling booths is asked for whom they voted.
Example 3: Exit polls
Suppose that in the Brisbane electorate, 500 voters
from various booths were asked to whom they voted.
From the data, the sample proportion of voters
supporting the candidates is computed.
A statistical technique is applied to determine whether
there is enough evidence to infer that the incumbent
Labor party candidate will garner enough votes to win.

Example 3: Exit polls Voter Response

1 1
Suppose that the results were coded on 2 2
a two-party preferred basis as 1 = 3 2
Liberal/National candidate and 2 = 4 1
Labor candidate. 5 2
.
The network analysts would like to know .
whether they can conclude that the 495 2
incumbent Labor party candidate will 496 1
win. 497 1
498 1
499 2
500 1
Example 3: Exit polls
This example describes a very common application of
statistical inference.
The population the television networks wanted to
make inferences about is the approximately 87,000
who voted in the electorate of Brisbane.
The sample consisted of the 500 people randomly
selected by the polling company who voted for either
of the two main candidates.

Example 3: Exit polls


The characteristic of the population that we would like
to know is the proportion of the total electorate that
voted for Labor after preferences (on a two-party
preferred basis).
Specifically, we would like to know whether more than
50% of the electorate voted for Labor (after
preferences) in the electorate of Brisbane.
Example 3: Exit polls
Because we will not ask every one of the 87,000
actual voters for whom they voted, we cannot predict
the outcome with 100% certainty.
A sample that is only a small fraction of the size of the
population can lead to correct inferences only a
certain percentage of the time.
You will find that statistics practitioners can control
that percentage and usually set it between 90% and
99%.

Practice Questions
• 1.3
• 1.4
• 1.6
• 1.7
1.2 Types of data
Definitions
A variable is some characteristic of a population or sample.
E.g. student marks.

A variable is typically denoted with a capital letter: X, Y, Z…


The values of the variable are the range of possible values
for a variable.
E.g. student marks (0,…,100)

Data are the observed values of a variable.


E.g. student marks: {67, 74, 71, 83, 93, 55, 48}

Types of data…
Data (at least for purposes of Statistics) fall into three
main groups:

Numerical data
Nominal Data
Ordinal Data
Numerical data
Numerical data
The values of numerical data are real numbers.
E.g. heights, weights, prices, waiting time at a medical
practice, etc.

Arithmetic operations can be performed on numerical data,


thus its meaningful to talk about 2*Height, or Price + $1, and
so on.

Numerical data are also called quantitative or interval.

Nominal data
Nominal Data
The values of nominal data are categories.
E.g. Responses to questions about marital status are categories,
coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4

These data are categorical in nature; arithmetic operations


don’t make any sense (e.g. does Married ÷ 2 = Divorced?!)

Nominal data are also called qualitative or categorical.


Ordinal data
Ordinal Data
Ordinal data appear to be categorical in nature, but their
values have an order; a ranking to them:
E.g. University course evaluation system:
Poor = 1, fair = 2, good = 3, very good = 4, excellent = 5

While its still not meaningful to do arithmetic on this data


(e.g. does 2*fair = very good?!), we can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what numeric values
are assigned to each category.
Ordinal data are also called ranked.

Types of data – Examples


Numerical data Nominal data Ordinal data
exam grade
age income person married A
55 75 000 1 yes
42 68 000 B
. . 2 no C
. . 3 no D
. . F
weight gain With nominal. data, . all we brand
computer With ordinal data, all we
Food quality
+10 can calculate1 is the IBM can use is computations
2 Excellent
+5 proportion of data thatDell
3 Compaq involving the ordering
Good
falls into each category.
4 IBM
. . .
Satisfactory
process.
Poor
.
IBM Dell Compaq other total
25 11 8 6 50
50% 22% 16% 12%
Calculations for types of data
As mentioned above,
• All calculations are permitted on numerical data.
• No calculations are allowed for nominal data, except
counting the number of observations in each category
and calculating their proportions.
• Only calculations involving a ranking process are
allowed for ordinal data.

This lends itself to the following ‘hierarchy of data’…

Hierarchy of data
Numerical
• Values are real numbers.
• All calculations are valid.
• Data may be treated as ordinal or nominal.

Nominal
• Values are the arbitrary numbers that represent categories.
• Only calculations based on the frequencies of occurrence are valid.
• Data may not be treated as ordinal or numerical.

Ordinal
• Values must represent the ranked order of the data.
• Calculations based on an ordering process are valid.
• Data may be treated as nominal but not as numerical.
Other forms of data
Cross-sectional data is collected at a certain point in time
across a number of units of interest
• marketing survey (observe preferences by gender, age)
• students’ marks in a statistics course exam
• starting salaries of graduates of an MBA program in a particular
year.

Time-series data is collected over successive points in


time
• weekly closing price of gold
• monthly tourist arrivals in Australia.

Methods of collecting data


Recall,
Statistics is a tool for converting data into useful information:

Statistics

Data Information

But where then does data come from? How is it gathered? How
do we ensure its accuracy? Is the data reliable? Is it
representative of the population from which it was drawn?
Now we explore some of these issues.
Data quality
The reliability and accuracy of the data affect the
validity of the results of a statistical analysis.

The reliability and accuracy of the data depend on the


method of data collection.

There are many methods used to collect or obtain data


for statistical analysis.

Sources of data
Four of the most popular sources of statistical data are:
• Published data
• Data collected from observational studies (Observational data)
• Data collected from experimental studies (Experimental data)
• Data collected from surveys
Published data
This is often a preferred source of data due to low cost and
convenience.
Published data is found as printed material, tapes, disks, and
on the Internet.
Types of published data
 Primary data
 Secondary data.

Published data…
Primary data
Data published by the organisation that has collected it is
called primary data.
E.g. Data published by the Australian Bureau of Statistics (ABS).

Secondary data
Data published by an organisation different from the one that
was originally collected and published is called secondary data.
E.g. 1. The Yearbook of National Accounts Statistics (United Nations,
New York), compiles data from primary sources of various
country departments of statistics, like ABS in Australia;
2. Compustat sells a variety of financial data tapes compiled
from several primary sources.
Observational and experimental data
When published data is unavailable, one needs to conduct a
study to generate the data.
• Observational study is one in which measurements representing a
variable of interest are observed and recorded, without
controlling any factor that might influence their values
– e.g. measuring the height of a tree in the rainforest over time.
• Experimental study is one in which measurements representing a
variable of interest are observed and recorded, while controlling
factors that might influence their values
– e.g. measuring the yield of different type of rice using a
certain amount of fertilizer (control factor).

Surveys
A survey solicits information from survey participants;
e.g. Gallup polls; pre-election polls; marketing surveys.

The response rate (i.e. the proportion of selected


participants who completed the survey) is a key survey
parameter.
Surveys may be administered in a variety of ways,
e.g.
• Personal interview
• Telephone interview
• Self-administered questionnaire.

• Practice: 2.2, 2.3, 2.4, 2.10, 2.15


Practice Questions
• 2.2
• 2.3
• 2.4
• 2.10
• 2.15

1.3 Sampling
If the data are collected from the whole population, it is called a
census.
For example, ABS conducts a census every 5 years in Australia.

Recall that statistical inference permits us to draw conclusions about


a population based on a sample.

Sampling (i.e. selecting a sub-set of a whole population) is often


done instead of a census for a number of reasons including
• cost
For example, it’s less expensive to sample 1,000 television
viewers than 20 million TV viewers
• practicality
For example, performing a crash test on every automobile
produced is impractical.
Sampling…
Target population
The population about which we want to draw inferences.

Sampled population
The actual population from which the sample has been
drawn.

In any case, the sampled population and the target


population should be similar to one another. Otherwise the
sample selected may become self-selected.

Sampling…
Example:
A survey of opinion on a radio talk-back show topic
Target population: All radio listeners who listen to the
talk-back show.
Sample selected: Those listeners who are interested in
the topic and managed to contact the radio station.
Sampled population: Those listeners who are interested
in the topic.
Sampling plans
A sampling plan is just a method or procedure for
specifying how a sample will be taken from a population.
Most commonly used sampling plans,
• Simple random sampling
• Stratified random sampling
• Cluster sampling.

Simple random sampling


A simple random sample is a sample selected in such a
way that every possible sample of the same size is
equally likely to be chosen.

For example, drawing three names from a hat


containing all the names of the students in a class of
200 is an example of a simple random sample: any
group of three names is as equally likely as picking any
other group of three names.
Simple random sampling…
To conduct simple random sampling…
• assign a number to each element of the chosen
population (or use already given numbers),
e.g. Medicare card number of each Australian resident
• randomly select the sample numbers (members)
using a random numbers table, or a software
package.

Example 1

A government income-tax auditor is responsible for 1,000


tax returns. The auditor will randomly select 30 returns
to audit. Use Excel’s random number generator to select
the returns.

Solution:
We generate 50 numbers between 1 and 1,000 (we need
only 30 numbers, but the extra numbers might be used if
duplicate numbers are generated.)
Example 1 - Solution
Use Excel to generate 50 random numbers between 1
and 1000.

50 random uniformly
distributed whole-
numbers between
1 and 1000.

Extra #’s may be used if duplicate random numbers are generated.

Stratified random sampling


A stratified random sample is obtained by separating the
population into mutually exclusive sets (or strata), and
then drawing simple random samples from each
stratum.
Population 2
Population 1 Age Population 3
Occupation • Under 20
• Professional • 20–30 Gender
• Clerical • 31–40 • Male
• Blue-collar • 41–50 • Female
• Other • 51–60
• > 60
Stratified random sampling…
With this procedure we can acquire information or make
inferences about
• the whole population
• each stratum
• the relationships among strata.

Stratified random sampling…


There are several ways to build a stratified random
sample. For example, one could keep the proportion of
each stratum in the population to decide how many units
from each stratum are to be taken.

A sample of size 1 000 is to be drawn


Stratum Income Population proportion Stratum size
(from previous census)
1 under $25,000 25% (0.25 x 1000) 250
2 25,000-39,999 40% (0.40 x 1000) 400
3 40.000-60,000 30% (0.30 x 1000) 300
4 over $60,000 5% (0.05 x 1000) 50
Total 1 000
Stratified random sampling …
After the population has been stratified, we can use
simple random sampling to generate the complete
sample:

If we only have sufficient resources to sample 400 people total,


we would draw 20 of them from the high income group…

…if we are sampling 1 000 people, we’d draw 50 of


them from the high income group.

Cluster sampling
Cluster sample is a simple random sample of groups or
clusters of elements (vs. a simple random sample
consists of individual objects).

This procedure is useful when


 it is difficult and costly to develop a complete list of
the population members (making it difficult to develop
a simple random sampling procedure).
 the population members are widely dispersed
geographically.
Cluster sampling…
Cluster sampling may increase sampling error, because
of probable similarities among cluster members.

For example, to draw a cluster sample of residents in the


Brisbane city area, first select a number of streets in the
Brisbane city area using a simple random sampling
method and then include all residents in those selected
streets to form the cluster sample.

Sample size
Numerical techniques for determining sample sizes will
be described later, but it is sufficient to say that the
larger the sample size, the more accurate we can
expect the sample estimates to be.
Sampling and non-sampling errors
Two major types of errors can arise when a sampling
procedure is performed.
• Sampling error
• Non-sampling error

Sampling errors
Sampling error refers to differences between the
sample and the population, because of the specific
observations that happen to be selected.
Sampling error is expected to occur when making a
statement about the population based on the sample
taken.
• For example, when estimating a population mean () using a
sample mean (𝑋),
sampling error = 𝑋-

Increasing the sample size will reduce the sampling


error.
Sampling errors…

Population income distribution

 ( population mean)
Sampling error
The sample mean falls here only because
certain randomly selected observations
were included in the sample.

x ( sample mean)

Non-sampling errors
Non-sampling errors occur due to:

• Mistakes made along the process of data acquisition


• Sample observations being selected improperly.
Three types of non-sampling errors
There are three types of non-sampling errors:
• Errors in data acquisition,
• Non-response errors,
• Selection bias.

Increasing the sample size will not reduce this type of


error.

Errors in data acquisition…


Errors in data acquisition arises from the recording of
incorrect responses, due to:
 incorrect measurements being taken because of faulty
equipment,
 mistakes made during transcription from primary sources,
 inaccurate recording of data due to misinterpretation of
terms,
 inaccurate responses to questions concerning sensitive issues,
or
 clerical mistakes when transferring/recording data.
Non-response error
• Non-response error refers to error (or bias)
introduced when responses are not obtained from
some members of the sample surveyed due to refusal
by them to respond for some reason, i.e. the sample
observations that are collected may not be
representative of the target population.
• As mentioned earlier, the Response Rate (i.e. the
proportion of all people selected who complete the
survey) is a key survey parameter and helps in the
understanding in the validity of the survey and
sources of non-response error.

Selection bias
Selection bias occurs when the sampling plan is such
that some members of the target population cannot
possibly be selected for inclusion in the sample.
For example, selecting a sample of households in New
South Wales (NSW) using telephone numbers listed in
NSW White Pages, as not every NSW household
telephone number is listed in the White Pages.
Non-sampling errors…
Data acquisition error

Population

If this observation…
Sample Sampling error + Data acquisition error

…is wrongly recorded here…


…then the sample mean is affected.

Non-sampling errors…
Population Non-response error

No response here... …may lead to biased results here.

Sample
Non-sampling errors…
Selection bias

Population

When parts of the population cannot be selected...

…the sample cannot represent


Sample the whole population.

Practice Questions
• 2.18
• 2.19
• 2.22
• 2.26

You might also like