2022 Statistics Fin 3

Statistics
Lecture 3
2021
Statistical surveys
Reasons for Sampling
1. Expense
2. Speed of Response
3. The Large Population
4. Destructive Sampling
5. Accuracy
1. The most important - it is usually less expensive to obtain a sample than to survey the
whole population.
2. Nowadays information is often needed very quickly, it must be up-to-date to be useful.
3. Information concerning characteristics of an infinite population can only be obtained from a
sample.
Populations are frequently denoted as infinite when the observations result from a continuing or recurring process eg. selling something, producing
something.
The cases to observe every single element are impossible because new observations are continually created.
4. Very often the testing of a product (tasting chocolate, or other sweets) requires the destruction of the
product.
5. Sometimes a sample survey can be more accurate than a census
stages of a statistical survey
Next step - analysis
Now we are here
This requires: breaking the data down,

applying methods, interpreting, presenting,
and trying to answer the survey objectives
koniec
Presentation of Data: Frequency Distributions and Graphs
Tabular Presentation of Data - frequency distribution
QUALITATIVE DATA - group the data into classes according to the categories
QUANTITATIVE DATA - useful way to simplify a set of data -> frequency distribution
Division of all units into nonoverlapping subsets called class intervals (classes)
xi ni
Frequency Distribution - table that shows how many observations fall in each class.
class frequency (fi or ni) - the number of observations falling in a particular class (class i)
xi ni
we can create one class for every value

or for the group of values
Constructing a frequency table it is important to specify:
k  5 log N (1)
• the appropriate number of the classes (the aim of the analysis),
the number of classes should be relatively small
k = 1 + 3,322 log N (2)
X max − X min
• the size of the classes (homogeneity of the distribution), i = (3)
k
• the boundaries (limits) of the classes.

https://www.mathsisfun.com/data/frequency-distribution.html
• the boundaries (limits) of the classes.
a. b. c.
0,0 - 5,0 0,0 - 4,9 0,1 - 5,0
5,0 - 10,0 5,0 - 9,9 5,1 - 10,0
10,0 - 15,0 10,0 - 14,9 10,1 - 15,0
The choice of the class boundaries depends on the problem being studied.
We should remember that:

− min (max) should be included to the first class
− range of every class should be equal
− we try to get the SYMMETRIC destribution with only one peak
− number of units - increase and decrease
NORMAL DISTRIBUTION - by testing different solutions

The best way to find the most useful frequency distribution - test a few different choices of classes in
order to find the most useful solution.
Despite the arbitrariness of the choice of the classes, it is worth to remember!!!
• the classes must be nonoverlapping and must contain all observations
- each observation must fall into exactly one class
• the number of classes usually ranges between 5 and 20 with low numbers of classes used for smaller data
sets
• the number of classes too small -> too much detail is lost;
• the number of classes to large -> detecting the major clustering of the observations is difficult
Recommended size of classes: equal width BUT…..

in some cases unequal class width is necessary to avoid classes with low relative frequencies, or
class with relatively high frequency
In some cases it is useful to construct open classes (have either no upper or no lower limit).
We use a phrase such as: “greater than”, “or more”, “less than”, “or less”
Open classes should be used, when we have some observations extremely large (or small) in comparison with others.
Eg. for wages - the class defined as ”1000 or more” can contain units for: - wage equel to 1100 or - 1mln PLN.
This is a symptom of heterogeneity of the distribution.
When possible, however, open classes should be avoided !! -> EXCEL
The frequency distribution table consists of two columns:

• classes
• frequencies Important !
Frequency distribution of telephone-call duration

class boundaries or
class limits Telephone-calls duration Number of calls
(in minutes) (Frequency) x i’
Xi ni or fi
a1=2 a2=7,9 2 – 7.9 9 5
b1=8 8 – 13.9 15 11
lower class limit 14 – 19.9 6 17
upper class limit Total 30
midpoint or
class mark
width of the class b1- a1 = 8-2= 6
not 5.9
Construction of Class Intervals xi’ =Xi’=(a1+ b1)/2 = (2+8)/2= 5
Let a1 and a2 be two real numbers such that a1 is less than a2
not 4.95
A class interval consists of all real numbers that are greater than or equal to a1 and less than b1
The numbers a1 and a2 are called the class boundaries or class limits - a1 is called the lower class limit, and a2 is called the upper class limit
The difference b1 – a1 is called the width of the class - designated by the symbol i or c (i=1,2, ..., k)
the number (a1+ b1)/2 is called the midpoint or class mark - designated by the symbol Xi’ or xi’
Relative frequency distribution

• shows what proportion of the observations fall in each class interval
• more useful than absolute frequencies when making comparisons between groups of different size
the class frequency ni the class relative frequency wi=ni /N
Frequency distribution of duration of unemployment in two selected

poviats of Wielkopolska voivodship
Duriation of Poznań poviat Oborniki poviat

unemployment Absolute Relative Absolute Relative
(in month) frequency ni frequency wi frequency ni frequency wi
open classes Xi fi (ni) wi fi (ni) Wi
different width Less than 1 1362 0,13 204 0,08
of the classes 1–3 2727 0,26 411 0,17
3–6 2667 0,25 558 0,23
6 – 12 2580 0,24 1006 0,42
12 – 24 845 0,08 153 0,06
24 and over 400 0,04 91 0,04
Total - N 10581 1 2423 1
although the absolute frequencies are different -> much more unemployed people in Poznań are in a worse position than in Oborniki
- much more of them are seeking work for a long time while in Oborniki only a few
the relative frequencies are similar ->

-> indicate similarities in the labour market situation in both poviats
So we can say that - although the absolute freq. are different, the relative freq. are similar and indicate similarities in the labour market situation in both poviats.
Cumulative frequency distribution

shows how many observation fall below each of the class boundaries
Cumulative relative frequency distribution

shows what proportion of observations fall bellow each class boundary
Unemployed in Poland in may 2018 by the duration of unemployment spell
Duriation of Absolute Relative Absolute Relative

unemployment frequency frequency cumulative cumulative
(in month) frequency frequency
Xi fi wi cf i cwi
Less than 1 173 0,0808 173 0,0808
2–3 330 0,1541 503 0,2349
4–6 414 0,1934 917 0,4283
7 – 12 479 0,2237 1396 0,6520
13 – 24 498 0,2326 1894 0,8846
24 and over 247 0,1154 2141 1,0000
Total 2141 1,0000
Descriptive Statistics - Frequency Distributions and Graphs
Shapes of frequency Distributions and Histograms
UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency
BIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has two peaks – two
maximums in the frequency - 2 nonhomogenous sectors,
e.g.
heights and weights of individuals have bimodal distributions - one peak refers to
males and one to females.
MULTIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has several peaks
–several maximums in the frequency
Shapes of frequency Distributions and Histograms - SYMETRY
SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).
there is a value (a place on the graph) -> the proportion of the distribution to the left of
the value is the mirror image of the portion of the distribution to the right of the value
Shapes of frequency Distributions and Histograms - SYMETRY
SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).
A SKEWED DISTRIBUTION - A distribution is skewed if one tail is longer than the other -> if most of the population is
located towards one side of the distribution.
A RIGHT-SKEWED DISTRIBUTION (POSITIVELY-SKEWED)
if the right tail is longer than the left one ->
if most of the population is located towards the left side of the distribution
- Income - most people have low incomes, many economic (cost, revenue of enterprises )and
demographic variables
A LEFT-SKEWED DISTRIBUTION (NEGATIVELY SKEWED)
if the left tail is longer than the right one, which means if most of the population is
located towards the right side of the distribution
- Age of death - most of people die at the ages over 60, a long tail extends to the left between
ages 0 to 55
J-shaped There is no tail on the side of the class with the highest frequency.
DISTRIBUTION WHICH IS EXTREMELY SKEWED DISTRIBUTION WHICH IS EXTREMELY SKEWED
TO THE RIGHT TO THE LEFT.
peaks at or near the origin of the data and peaks at or near the maximum value of the
than tails off to the right data and than tails off to the left
e.g. age of unemployed people e.g. duration of unemployment
Most of unemployed people are young up to the age of 24, Most of unemployed people seek work for over 12 month,
but a long tail extends to the right between ages 24 to 64. but a long tail extends to the left between 1 month to 12.
Shapes of frequency Distributions and Histograms - CONCENTRATION
NORMAL DISTRIBUTION
The bell-shaped symmetric curve is called the normal distribution.
This is the most important distribution in statistical theory because numerous
variables conform to it.
A symmetrical distribution is densely scattered about the mean and
becomes sparse at the extremes.
MESOKURTIC (normal) DISTRIBUTION - The distributions, which are of the same

concentration about the mean value as the normal distribution
LEPTOKURTIC DISTRIBUTION - The distributions, which are of a higher (greater)

concentration about the mean value than the normal distribution.
The distributions, which are more peaked than the normal one.
PLATYKURTIC DISTRIBUTION - The distributions, which are of a lower (smaller)

concentration about the mean value than the normal distribution.
The distributions, which are flattered than the normal one.
THE UNIFORM DISTRIBUTION (rectangular)
Every value appears with equal frequency (no particular interval of values has higher
relative frequency than any other interval of equal width).
e.g. ages of pupils at an elementary school (6-14)
each age has approximately the same relative frequency
(assuming that each grade has about the same number of pupils).
or Relative frequency of appearance of each digit from 0 to 9 in a random number table
is approximately the same.
Descriptive Statistics
1. ANALYSIS OF LOCATION (CENTRAL TENDENCY)

where along the scale of all possible values our particular distribution
happens to be centered (mean, median, mode)
1. 2.
The purpose - to describe population in one figure a representative value of a mass of data.
2. ANALYSIS OF DISPERSION (VARIATION)
how the data varies (it always should complement measures of location
- mean, when appear alone can be very misleading
3. ANALYSIS OF SKEWNESS (SYMMETRY) 3. 4.

whether the distribution is symmetric or skewed
- if the both sides of this distribution are identical
4. ANALYSIS OF CONCENTRATION
ANALYSIS OF CONCENTRATION – 5. 6.
Distribution of the total value between the elementary units -
whether the total value of the variable is uniformly distributed
between the elementary units (1, 5) or not (rest graphs)
ANALYSIS OF KURTOSIS(PEAKEDNESS)
whether the distribution is mesokurtic, leptokurtic or platykurtic 1. 2. 3. 4. 5.
Concentration of elementary units near the mean value wage - xi xi xi xi xi xi
(in thous.)
0 1 1 1 0 1
2 1 1 2 2 1
4 6 1 3 4 6
6 6 9 3 6 6
8 6 9 11 8 6
10 11 10 11 10 11
12 11 11 11 12 11
total = 42 42 42 42 42 42
mean = 6 6 6 6 6 6

2022 Statistics Fin 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022 Statistics Fin 3

Uploaded by

Copyright:

Available Formats

Statistics

Reasons for Sampling

Now we are here

This requires: breaking the data down,

Tabular Presentation of Data - frequency distribution

we can create one class for every value

Constructing a frequency table it is important to specify:

• the boundaries (limits) of the classes.

• the boundaries (limits) of the classes.

We should remember that:

NORMAL DISTRIBUTION - by testing different solutions

Recommended size of classes: equal width BUT…..

The frequency distribution table consists of two columns:

Frequency distribution of telephone-call duration

Relative frequency distribution

the class frequency ni the class relative frequency wi=ni /N

Frequency distribution of duration of unemployment in two selected

Duriation of Poznań poviat Oborniki poviat

the relative frequencies are similar ->

Cumulative frequency distribution

Cumulative relative frequency distribution

Unemployed in Poland in may 2018 by the duration of unemployment spell

Duriation of Absolute Relative Absolute Relative

Shapes of frequency Distributions and Histograms

Shapes of frequency Distributions and Histograms - SYMETRY

Shapes of frequency Distributions and Histograms - SYMETRY

Shapes of frequency Distributions and Histograms - CONCENTRATION

MESOKURTIC (normal) DISTRIBUTION - The distributions, which are of the same

LEPTOKURTIC DISTRIBUTION - The distributions, which are of a higher (greater)

PLATYKURTIC DISTRIBUTION - The distributions, which are of a lower (smaller)

1. ANALYSIS OF LOCATION (CENTRAL TENDENCY)

3. ANALYSIS OF SKEWNESS (SYMMETRY) 3. 4.

You might also like