Professional Documents
Culture Documents
Lecture 3
2021
Statistical surveys
1. Expense
2. Speed of Response
3. The Large Population
4. Destructive Sampling
5. Accuracy
1. The most important - it is usually less expensive to obtain a sample than to survey the
whole population.
2. Nowadays information is often needed very quickly, it must be up-to-date to be useful.
3. Information concerning characteristics of an infinite population can only be obtained from a
sample.
Populations are frequently denoted as infinite when the observations result from a continuing or recurring process eg. selling something, producing
something.
The cases to observe every single element are impossible because new observations are continually created.
4. Very often the testing of a product (tasting chocolate, or other sweets) requires the destruction of the
product.
5. Sometimes a sample survey can be more accurate than a census
stages of a statistical survey
Next step - analysis
koniec
Presentation of Data: Frequency Distributions and Graphs
QUALITATIVE DATA - group the data into classes according to the categories
QUANTITATIVE DATA - useful way to simplify a set of data -> frequency distribution
Division of all units into nonoverlapping subsets called class intervals (classes)
xi ni
Frequency Distribution - table that shows how many observations fall in each class.
class frequency (fi or ni) - the number of observations falling in a particular class (class i)
xi ni
k 5 log N (1)
• the appropriate number of the classes (the aim of the analysis),
the number of classes should be relatively small
k = 1 + 3,322 log N (2)
X max − X min
• the size of the classes (homogeneity of the distribution), i = (3)
k
a. b. c.
0,0 - 5,0 0,0 - 4,9 0,1 - 5,0
5,0 - 10,0 5,0 - 9,9 5,1 - 10,0
10,0 - 15,0 10,0 - 14,9 10,1 - 15,0
The choice of the class boundaries depends on the problem being studied.
The best way to find the most useful frequency distribution - test a few different choices of classes in
order to find the most useful solution.
Despite the arbitrariness of the choice of the classes, it is worth to remember!!!
• the classes must be nonoverlapping and must contain all observations
- each observation must fall into exactly one class
• the number of classes usually ranges between 5 and 20 with low numbers of classes used for smaller data
sets
• the number of classes too small -> too much detail is lost;
• the number of classes to large -> detecting the major clustering of the observations is difficult
In some cases it is useful to construct open classes (have either no upper or no lower limit).
We use a phrase such as: “greater than”, “or more”, “less than”, “or less”
Open classes should be used, when we have some observations extremely large (or small) in comparison with others.
Eg. for wages - the class defined as ”1000 or more” can contain units for: - wage equel to 1100 or - 1mln PLN.
This is a symptom of heterogeneity of the distribution.
When possible, however, open classes should be avoided !! -> EXCEL
Presentation of Data: Frequency Distributions and Graphs
The numbers a1 and a2 are called the class boundaries or class limits - a1 is called the lower class limit, and a2 is called the upper class limit
The difference b1 – a1 is called the width of the class - designated by the symbol i or c (i=1,2, ..., k)
the number (a1+ b1)/2 is called the midpoint or class mark - designated by the symbol Xi’ or xi’
Presentation of Data: Frequency Distributions and Graphs
although the absolute frequencies are different -> much more unemployed people in Poznań are in a worse position than in Oborniki
- much more of them are seeking work for a long time while in Oborniki only a few
So we can say that - although the absolute freq. are different, the relative freq. are similar and indicate similarities in the labour market situation in both poviats.
Presentation of Data: Frequency Distributions and Graphs
UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency
BIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has two peaks – two
maximums in the frequency - 2 nonhomogenous sectors,
e.g.
heights and weights of individuals have bimodal distributions - one peak refers to
males and one to females.
MULTIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has several peaks
–several maximums in the frequency
Descriptive Statistics - Frequency Distributions and Graphs
UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency
SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).
there is a value (a place on the graph) -> the proportion of the distribution to the left of
the value is the mirror image of the portion of the distribution to the right of the value
Descriptive Statistics - Frequency Distributions and Graphs
UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency
SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).
A SKEWED DISTRIBUTION - A distribution is skewed if one tail is longer than the other -> if most of the population is
located towards one side of the distribution.
A RIGHT-SKEWED DISTRIBUTION (POSITIVELY-SKEWED)
if the right tail is longer than the left one ->
if most of the population is located towards the left side of the distribution
- Income - most people have low incomes, many economic (cost, revenue of enterprises )and
demographic variables
A LEFT-SKEWED DISTRIBUTION (NEGATIVELY SKEWED)
if the left tail is longer than the right one, which means if most of the population is
located towards the right side of the distribution
- Age of death - most of people die at the ages over 60, a long tail extends to the left between
ages 0 to 55
J-shaped There is no tail on the side of the class with the highest frequency.
DISTRIBUTION WHICH IS EXTREMELY SKEWED DISTRIBUTION WHICH IS EXTREMELY SKEWED
TO THE RIGHT TO THE LEFT.
peaks at or near the origin of the data and peaks at or near the maximum value of the
than tails off to the right data and than tails off to the left
e.g. age of unemployed people e.g. duration of unemployment
Most of unemployed people are young up to the age of 24, Most of unemployed people seek work for over 12 month,
but a long tail extends to the right between ages 24 to 64. but a long tail extends to the left between 1 month to 12.
Descriptive Statistics - Frequency Distributions and Graphs
NORMAL DISTRIBUTION
The bell-shaped symmetric curve is called the normal distribution.
This is the most important distribution in statistical theory because numerous
variables conform to it.
A symmetrical distribution is densely scattered about the mean and
becomes sparse at the extremes.
4. ANALYSIS OF CONCENTRATION
ANALYSIS OF CONCENTRATION – 5. 6.
Distribution of the total value between the elementary units -
whether the total value of the variable is uniformly distributed
between the elementary units (1, 5) or not (rest graphs)
ANALYSIS OF KURTOSIS(PEAKEDNESS)
whether the distribution is mesokurtic, leptokurtic or platykurtic 1. 2. 3. 4. 5.
Concentration of elementary units near the mean value wage - xi xi xi xi xi xi
(in thous.)
0 1 1 1 0 1
2 1 1 2 2 1
4 6 1 3 4 6
6 6 9 3 6 6
8 6 9 11 8 6
10 11 10 11 10 11
12 11 11 11 12 11
total = 42 42 42 42 42 42
mean = 6 6 6 6 6 6