You are on page 1of 14

Statistics

Lecture 3

2021
Statistical surveys

Reasons for Sampling

1. Expense
2. Speed of Response
3. The Large Population
4. Destructive Sampling
5. Accuracy

1. The most important - it is usually less expensive to obtain a sample than to survey the
whole population.
2. Nowadays information is often needed very quickly, it must be up-to-date to be useful.
3. Information concerning characteristics of an infinite population can only be obtained from a
sample.
Populations are frequently denoted as infinite when the observations result from a continuing or recurring process eg. selling something, producing
something.
The cases to observe every single element are impossible because new observations are continually created.
4. Very often the testing of a product (tasting chocolate, or other sweets) requires the destruction of the
product.
5. Sometimes a sample survey can be more accurate than a census
stages of a statistical survey
Next step - analysis

Now we are here

This requires: breaking the data down,


applying methods, interpreting, presenting,
and trying to answer the survey objectives

koniec
Presentation of Data: Frequency Distributions and Graphs

Tabular Presentation of Data - frequency distribution

QUALITATIVE DATA - group the data into classes according to the categories

QUANTITATIVE DATA - useful way to simplify a set of data -> frequency distribution
Division of all units into nonoverlapping subsets called class intervals (classes)

xi ni
Frequency Distribution - table that shows how many observations fall in each class.
class frequency (fi or ni) - the number of observations falling in a particular class (class i)
xi ni

we can create one class for every value


or for the group of values

Constructing a frequency table it is important to specify:

k  5 log N (1)
• the appropriate number of the classes (the aim of the analysis),
the number of classes should be relatively small
k = 1 + 3,322 log N (2)
X max − X min
• the size of the classes (homogeneity of the distribution), i = (3)
k

• the boundaries (limits) of the classes.


https://www.mathsisfun.com/data/frequency-distribution.html
Presentation of Data: Frequency Distributions and Graphs

• the boundaries (limits) of the classes.

a. b. c.
0,0 - 5,0 0,0 - 4,9 0,1 - 5,0
5,0 - 10,0 5,0 - 9,9 5,1 - 10,0
10,0 - 15,0 10,0 - 14,9 10,1 - 15,0

The choice of the class boundaries depends on the problem being studied.

We should remember that:


− min (max) should be included to the first class
− range of every class should be equal
− we try to get the SYMMETRIC destribution with only one peak
− number of units - increase and decrease

NORMAL DISTRIBUTION - by testing different solutions


Presentation of Data: Frequency Distributions and Graphs

The best way to find the most useful frequency distribution - test a few different choices of classes in
order to find the most useful solution.
Despite the arbitrariness of the choice of the classes, it is worth to remember!!!
• the classes must be nonoverlapping and must contain all observations
- each observation must fall into exactly one class
• the number of classes usually ranges between 5 and 20 with low numbers of classes used for smaller data
sets
• the number of classes too small -> too much detail is lost;
• the number of classes to large -> detecting the major clustering of the observations is difficult

Recommended size of classes: equal width BUT…..


in some cases unequal class width is necessary to avoid classes with low relative frequencies, or
class with relatively high frequency

In some cases it is useful to construct open classes (have either no upper or no lower limit).
We use a phrase such as: “greater than”, “or more”, “less than”, “or less”
Open classes should be used, when we have some observations extremely large (or small) in comparison with others.
Eg. for wages - the class defined as ”1000 or more” can contain units for: - wage equel to 1100 or - 1mln PLN.
This is a symptom of heterogeneity of the distribution.
When possible, however, open classes should be avoided !! -> EXCEL
Presentation of Data: Frequency Distributions and Graphs

The frequency distribution table consists of two columns:


• classes
• frequencies Important !

Frequency distribution of telephone-call duration


class boundaries or
class limits Telephone-calls duration Number of calls
(in minutes) (Frequency) x i’
Xi ni or fi
a1=2 a2=7,9 2 – 7.9 9 5
b1=8 8 – 13.9 15 11
lower class limit 14 – 19.9 6 17
upper class limit Total 30
midpoint or
class mark
width of the class b1- a1 = 8-2= 6
not 5.9
Construction of Class Intervals xi’ =Xi’=(a1+ b1)/2 = (2+8)/2= 5
Let a1 and a2 be two real numbers such that a1 is less than a2
not 4.95
A class interval consists of all real numbers that are greater than or equal to a1 and less than b1

The numbers a1 and a2 are called the class boundaries or class limits - a1 is called the lower class limit, and a2 is called the upper class limit
The difference b1 – a1 is called the width of the class - designated by the symbol i or c (i=1,2, ..., k)
the number (a1+ b1)/2 is called the midpoint or class mark - designated by the symbol Xi’ or xi’
Presentation of Data: Frequency Distributions and Graphs

Relative frequency distribution


• shows what proportion of the observations fall in each class interval
• more useful than absolute frequencies when making comparisons between groups of different size

the class frequency ni the class relative frequency wi=ni /N

Frequency distribution of duration of unemployment in two selected


poviats of Wielkopolska voivodship

Duriation of Poznań poviat Oborniki poviat


unemployment Absolute Relative Absolute Relative
(in month) frequency ni frequency wi frequency ni frequency wi
open classes Xi fi (ni) wi fi (ni) Wi
different width Less than 1 1362 0,13 204 0,08
of the classes 1–3 2727 0,26 411 0,17
3–6 2667 0,25 558 0,23
6 – 12 2580 0,24 1006 0,42
12 – 24 845 0,08 153 0,06
24 and over 400 0,04 91 0,04
Total - N 10581 1 2423 1

although the absolute frequencies are different -> much more unemployed people in Poznań are in a worse position than in Oborniki
- much more of them are seeking work for a long time while in Oborniki only a few

the relative frequencies are similar ->


-> indicate similarities in the labour market situation in both poviats

So we can say that - although the absolute freq. are different, the relative freq. are similar and indicate similarities in the labour market situation in both poviats.
Presentation of Data: Frequency Distributions and Graphs

Cumulative frequency distribution


shows how many observation fall below each of the class boundaries

Cumulative relative frequency distribution


shows what proportion of observations fall bellow each class boundary

Unemployed in Poland in may 2018 by the duration of unemployment spell

Duriation of Absolute Relative Absolute Relative


unemployment frequency frequency cumulative cumulative
(in month) frequency frequency
Xi fi wi cf i cwi
Less than 1 173 0,0808 173 0,0808
2–3 330 0,1541 503 0,2349
4–6 414 0,1934 917 0,4283
7 – 12 479 0,2237 1396 0,6520
13 – 24 498 0,2326 1894 0,8846
24 and over 247 0,1154 2141 1,0000
Total 2141 1,0000
Descriptive Statistics - Frequency Distributions and Graphs

Shapes of frequency Distributions and Histograms

UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency

BIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has two peaks – two
maximums in the frequency - 2 nonhomogenous sectors,
e.g.
heights and weights of individuals have bimodal distributions - one peak refers to
males and one to females.

MULTIMODAL DISTRIBUTION
When the population being studied is nonhomogeneous, has several peaks
–several maximums in the frequency
Descriptive Statistics - Frequency Distributions and Graphs

Shapes of frequency Distributions and Histograms - SYMETRY

UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency

SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).
there is a value (a place on the graph) -> the proportion of the distribution to the left of
the value is the mirror image of the portion of the distribution to the right of the value
Descriptive Statistics - Frequency Distributions and Graphs

Shapes of frequency Distributions and Histograms - SYMETRY

UNIMODAL DISTRIBUTION
When the population being studied is homogeneous, has only one peak – one
maximum in the frequency

SYMMETRIC DISTRIBUTION
Both sides of this distribution are identical (halves are mirror images).

A SKEWED DISTRIBUTION - A distribution is skewed if one tail is longer than the other -> if most of the population is
located towards one side of the distribution.
A RIGHT-SKEWED DISTRIBUTION (POSITIVELY-SKEWED)
if the right tail is longer than the left one ->
if most of the population is located towards the left side of the distribution
- Income - most people have low incomes, many economic (cost, revenue of enterprises )and
demographic variables
A LEFT-SKEWED DISTRIBUTION (NEGATIVELY SKEWED)
if the left tail is longer than the right one, which means if most of the population is
located towards the right side of the distribution
- Age of death - most of people die at the ages over 60, a long tail extends to the left between
ages 0 to 55

J-shaped There is no tail on the side of the class with the highest frequency.
DISTRIBUTION WHICH IS EXTREMELY SKEWED DISTRIBUTION WHICH IS EXTREMELY SKEWED
TO THE RIGHT TO THE LEFT.
peaks at or near the origin of the data and peaks at or near the maximum value of the
than tails off to the right data and than tails off to the left
e.g. age of unemployed people e.g. duration of unemployment
Most of unemployed people are young up to the age of 24, Most of unemployed people seek work for over 12 month,
but a long tail extends to the right between ages 24 to 64. but a long tail extends to the left between 1 month to 12.
Descriptive Statistics - Frequency Distributions and Graphs

Shapes of frequency Distributions and Histograms - CONCENTRATION

NORMAL DISTRIBUTION
The bell-shaped symmetric curve is called the normal distribution.
This is the most important distribution in statistical theory because numerous
variables conform to it.
A symmetrical distribution is densely scattered about the mean and
becomes sparse at the extremes.

MESOKURTIC (normal) DISTRIBUTION - The distributions, which are of the same


concentration about the mean value as the normal distribution

LEPTOKURTIC DISTRIBUTION - The distributions, which are of a higher (greater)


concentration about the mean value than the normal distribution.
The distributions, which are more peaked than the normal one.

PLATYKURTIC DISTRIBUTION - The distributions, which are of a lower (smaller)


concentration about the mean value than the normal distribution.
The distributions, which are flattered than the normal one.
THE UNIFORM DISTRIBUTION (rectangular)
Every value appears with equal frequency (no particular interval of values has higher
relative frequency than any other interval of equal width).
e.g. ages of pupils at an elementary school (6-14)
each age has approximately the same relative frequency
(assuming that each grade has about the same number of pupils).
or Relative frequency of appearance of each digit from 0 to 9 in a random number table
is approximately the same.
Descriptive Statistics

1. ANALYSIS OF LOCATION (CENTRAL TENDENCY)


where along the scale of all possible values our particular distribution
happens to be centered (mean, median, mode)
1. 2.
The purpose - to describe population in one figure a representative value of a mass of data.
2. ANALYSIS OF DISPERSION (VARIATION)
how the data varies (it always should complement measures of location
- mean, when appear alone can be very misleading

3. ANALYSIS OF SKEWNESS (SYMMETRY) 3. 4.


whether the distribution is symmetric or skewed
- if the both sides of this distribution are identical

4. ANALYSIS OF CONCENTRATION
ANALYSIS OF CONCENTRATION – 5. 6.
Distribution of the total value between the elementary units -
whether the total value of the variable is uniformly distributed
between the elementary units (1, 5) or not (rest graphs)
ANALYSIS OF KURTOSIS(PEAKEDNESS)
whether the distribution is mesokurtic, leptokurtic or platykurtic 1. 2. 3. 4. 5.
Concentration of elementary units near the mean value wage - xi xi xi xi xi xi
(in thous.)
0 1 1 1 0 1
2 1 1 2 2 1
4 6 1 3 4 6
6 6 9 3 6 6
8 6 9 11 8 6
10 11 10 11 10 11
12 11 11 11 12 11
total = 42 42 42 42 42 42
mean = 6 6 6 6 6 6

You might also like