Professional Documents
Culture Documents
INTRODUCTION
Topic outline
1.1 What is Statistics
• Key statistical concepts
• Practical applications
1.2 Types of data
• Methods of collecting data
1.3 Sampling
• Sampling plans
• Sampling and non-sampling errors
What is Statistics?
Statistics
Data Information
Data Information
List of last year’s Summary information
statistics marks derived about the statistics
65 class.
71
E.g. Class average,
66
proportion of class receiving
79
F’s, most frequent mark,
65
Highest and lowest marks,
82
spread of the marks, grade
:
(A,B,C,D,F) distribution, etc.
Example 1: Stats anxiety…
‘Typical mark’
Mean (average mark)
Median (mark such that 50% above & 50% below)
Mean = 72.67
Median = 72
20
10
0
50 60 70 80 90 100
Marks
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics
Descriptive statistics deals with methods of organising,
summarising, and presenting data in a convenient and
informative way.
17
Inferential Statistics
Descriptive statistics describe the data set that is
being analysed, but does not provide any tools for us
to draw any conclusions or make any inferences
about the data. Hence we need another branch of
statistics: inferential statistics.
Inferential statistics is also a set of methods, but it is
used to draw conclusions or inferences about
characteristics of populations based on sample
statistics calculated from sample data.
Key Statistical concepts
Population
A population is the group of all items (data) of interest.
Parameter
A descriptive measure of a population.
Statistic
A descriptive measure of a sample.
Subset
Statistic
Parameter
A descriptive measure of a population is called a parameter
(e.g. Population mean)
A descriptive measure of a sample is called a statistic (e.g.
Sample mean)
Statistical Inference
Statistical inference is the process of making an
estimate, prediction, or decision about a population
based on a sample.
Population Sample
Inference
Statistic
Parameter
Statistical Inference
Example 2:
Pepsi’s Exclusivity Agreement…
The market for soft drinks is measured in terms of 375ml
cans.
Pepsi currently sells an average of 10,000 cans per week
(over the 30 weeks of the year during two teaching
semesters that the university operates).
The cans sell for an average of $2.00 each. The costs
include a labour amount of 50 cents per can.
Pepsi is unsure of its market share but suspects it is
considerably less than 50%.
Example 2:
Pepsi’s Exclusivity Agreement
A quick analysis reveals that if its current market
share were 25%, then, with an exclusivity agreement,
Pepsi would sell 40,000 (10,000 is 25% of 40,000)
cans per week or 1,200,000 cans per year.
The profit or loss can be calculated.
The only problem is that we do not know how many
soft drinks (all types including Pepsi) are sold weekly at
the university.
Example 2: Solution
Pepsi’s Exclusivity Agreement
The population in Example 2 is the soft drink
consumption of the university’s 50,000 students. The
cost of interviewing each student would be prohibitive
and extremely time consuming. Statistical techniques
make such endeavours unnecessary. Instead, we can
sample a much smaller number of students (the sample
size is 500) and infer from the sample data the number
of soft drinks consumed by all 50,000 students. We can
then estimate annual profits for Pepsi.
Example 2: Solution
Pepsi’s Exclusivity Agreement
Pepsi assigned a recent university graduate to survey
the university’s students to supply the required
information.
Accordingly, she organises a survey that asks 500
students to keep track of the number of soft drinks by
type of drink (Pepsi, Coke, Lemonade etc.) they
purchase during the next 7 days.
Example 2: Solution
Pepsi’s Exclusivity Agreement
The information we would like to acquire in Example 2 is an
estimate of annual profits from the exclusivity agreement.
The sample data to be used for this purpose are the number
of cans of the various types of soft drinks consumed during
the 7-day survey period by the 500 students in the sample.
To summarize the data collected from the 500 sampled
students, we could use the graphical descriptive statistics
methods (to show the distribution of purchase by drink type)
and numerical descriptive measures (to calculate the mean
number of soft drinks purchased per day by the students).
Example 2: Solution
Pepsi’s Exclusivity Agreement
To make an informed decision about signing-up for the
Exclusivity agreement, we want to estimate the mean
number of the various soft drinks consumed by all
50,000 students on campus.
To accomplish this goal we use another branch of
statistics – inferential statistics, which is a collection
of techniques used to make inferences about the
population using sample data.
1 1
Suppose that the results were coded on 2 2
a two-party preferred basis as 1 = 3 2
Liberal/National candidate and 2 = 4 1
Labor candidate. 5 2
.
The network analysts would like to know .
whether they can conclude that the 495 2
incumbent Labor party candidate will 496 1
win. 497 1
498 1
499 2
500 1
Example 3: Exit polls
This example describes a very common application of
statistical inference.
The population the television networks wanted to
make inferences about is the approximately 87,000
who voted in the electorate of Brisbane.
The sample consisted of the 500 people randomly
selected by the polling company who voted for either
of the two main candidates.
Practice Questions
• 1.3
• 1.4
• 1.6
• 1.7
1.2 Types of data
Definitions
A variable is some characteristic of a population or sample.
E.g. student marks.
Types of data…
Data (at least for purposes of Statistics) fall into three
main groups:
Numerical data
Nominal Data
Ordinal Data
Numerical data
Numerical data
The values of numerical data are real numbers.
E.g. heights, weights, prices, waiting time at a medical
practice, etc.
Nominal data
Nominal Data
The values of nominal data are categories.
E.g. Responses to questions about marital status are categories,
coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4
Hierarchy of data
Numerical
• Values are real numbers.
• All calculations are valid.
• Data may be treated as ordinal or nominal.
Nominal
• Values are the arbitrary numbers that represent categories.
• Only calculations based on the frequencies of occurrence are valid.
• Data may not be treated as ordinal or numerical.
Ordinal
• Values must represent the ranked order of the data.
• Calculations based on an ordering process are valid.
• Data may be treated as nominal but not as numerical.
Other forms of data
Cross-sectional data is collected at a certain point in time
across a number of units of interest
• marketing survey (observe preferences by gender, age)
• students’ marks in a statistics course exam
• starting salaries of graduates of an MBA program in a particular
year.
Statistics
Data Information
But where then does data come from? How is it gathered? How
do we ensure its accuracy? Is the data reliable? Is it
representative of the population from which it was drawn?
Now we explore some of these issues.
Data quality
The reliability and accuracy of the data affect the
validity of the results of a statistical analysis.
Sources of data
Four of the most popular sources of statistical data are:
• Published data
• Data collected from observational studies (Observational data)
• Data collected from experimental studies (Experimental data)
• Data collected from surveys
Published data
This is often a preferred source of data due to low cost and
convenience.
Published data is found as printed material, tapes, disks, and
on the Internet.
Types of published data
Primary data
Secondary data.
Published data…
Primary data
Data published by the organisation that has collected it is
called primary data.
E.g. Data published by the Australian Bureau of Statistics (ABS).
Secondary data
Data published by an organisation different from the one that
was originally collected and published is called secondary data.
E.g. 1. The Yearbook of National Accounts Statistics (United Nations,
New York), compiles data from primary sources of various
country departments of statistics, like ABS in Australia;
2. Compustat sells a variety of financial data tapes compiled
from several primary sources.
Observational and experimental data
When published data is unavailable, one needs to conduct a
study to generate the data.
• Observational study is one in which measurements representing a
variable of interest are observed and recorded, without
controlling any factor that might influence their values
– e.g. measuring the height of a tree in the rainforest over time.
• Experimental study is one in which measurements representing a
variable of interest are observed and recorded, while controlling
factors that might influence their values
– e.g. measuring the yield of different type of rice using a
certain amount of fertilizer (control factor).
Surveys
A survey solicits information from survey participants;
e.g. Gallup polls; pre-election polls; marketing surveys.
1.3 Sampling
If the data are collected from the whole population, it is called a
census.
For example, ABS conducts a census every 5 years in Australia.
Sampled population
The actual population from which the sample has been
drawn.
Sampling…
Example:
A survey of opinion on a radio talk-back show topic
Target population: All radio listeners who listen to the
talk-back show.
Sample selected: Those listeners who are interested in
the topic and managed to contact the radio station.
Sampled population: Those listeners who are interested
in the topic.
Sampling plans
A sampling plan is just a method or procedure for
specifying how a sample will be taken from a population.
Most commonly used sampling plans,
• Simple random sampling
• Stratified random sampling
• Cluster sampling.
Example 1
Solution:
We generate 50 numbers between 1 and 1,000 (we need
only 30 numbers, but the extra numbers might be used if
duplicate numbers are generated.)
Example 1 - Solution
Use Excel to generate 50 random numbers between 1
and 1000.
50 random uniformly
distributed whole-
numbers between
1 and 1000.
Cluster sampling
Cluster sample is a simple random sample of groups or
clusters of elements (vs. a simple random sample
consists of individual objects).
Sample size
Numerical techniques for determining sample sizes will
be described later, but it is sufficient to say that the
larger the sample size, the more accurate we can
expect the sample estimates to be.
Sampling and non-sampling errors
Two major types of errors can arise when a sampling
procedure is performed.
• Sampling error
• Non-sampling error
Sampling errors
Sampling error refers to differences between the
sample and the population, because of the specific
observations that happen to be selected.
Sampling error is expected to occur when making a
statement about the population based on the sample
taken.
• For example, when estimating a population mean () using a
sample mean (𝑋),
sampling error = 𝑋-
( population mean)
Sampling error
The sample mean falls here only because
certain randomly selected observations
were included in the sample.
x ( sample mean)
Non-sampling errors
Non-sampling errors occur due to:
Selection bias
Selection bias occurs when the sampling plan is such
that some members of the target population cannot
possibly be selected for inclusion in the sample.
For example, selecting a sample of households in New
South Wales (NSW) using telephone numbers listed in
NSW White Pages, as not every NSW household
telephone number is listed in the White Pages.
Non-sampling errors…
Data acquisition error
Population
If this observation…
Sample Sampling error + Data acquisition error
Non-sampling errors…
Population Non-response error
Sample
Non-sampling errors…
Selection bias
Population
Practice Questions
• 2.18
• 2.19
• 2.22
• 2.26