Unit 1 - VZHK

PRINCIPLES OF STATISTICS
UNIT 1
FREQUENCY DISTRIBUTION
Prof. Vita Zhukova

CONTENT
Unidimensional frequency distribution
Measures of position/location
Measures of dispersion
CONTENT
Measures of shape
Unidimensional
Measuresfrequency distribnution
of concentration
Bidimensional frequency distributions
1. Unidimensional frequency distributions
Statistics is the science that deals with the collection,

classification, analysis, and interpretation of facts or data.
It imposes order and/or regularity on the aggregates of elements by means

of mathematical theories, e.g. probability theory.
To carry out a statistical analysis we look at the information from the

variable of the interest.
This information comes from the whole population

or from a part of it, called sample.
Population is the total set of elements that we want to know/learn about
Sample is the subset of the population that we actually analyse (randomness)
Variables provide characteristics of the population that are not constant under
the elements of such a population
Population
Sample
Example:
1. undergraduate students Example:
2. BMW/Chanel sales 1. UCAM students, UCLA students, etc.
2. Spanish BMW/Chanel sales
Variable
X is the characteristic of the population (Example: weight, nº of employees)

Xi is the observation of X (Example: Xi =63,85,130,48)
We are not able to observe the whole population.

Instead, we do can observe a subset of the population, the sample.
Variable
In order to ensure that the sample we work with is a representative sample of the whole
population, we take a random sample.
In a random sample all elements of the population are equally likely to be selected for
inclusion into the sample, e.i., the elements are randomly drawn.
For example:
We are interested in conducting a study on the amount of physical exercise undertaken
by the general public. Then, if we survey people entering and leaving a gymnasium, the
sample would provide a biased sample of the population, and the results obtained would
not be generalised to the population at large. Hence, this sample is not a representative
sample.
Variables are either qualitative or quantitative
Qualitative variables have a non-numeric outcomes.

Nominal - X assigns the colour of eyes. Example: Xi=grey, blue, black
Ordinal - X qualifies the course achievement. Example: Xi= passed, failed
Quantitative variables have a numeric outcomes.

Discrete - take a countable number of possible values, are integer numbers.
X assigns the number of children in a family. Example: Xi=2, 0, 3
Continuous - take any value over a continuous scale, are real numbers.
X assigns the height. Example: Xi= 1.77, 1.534 mts
• A set of data on its own is very hard to interpret. There is lots of information contained in
the data, which it is hard to see. We need ways of understanding important features of
the data, and to summarise it in a meaningful way.
• The use of graphs and summary statistics for understanding data is very useful and is the
first step we take in the statistical analysis.
We start with the frequency distribution of the variable which provides an ordered
exposition of the set of its observations.
Given the sample of n elements, the absolute frequency of class i, denoted

by ni, is the number of times that the event of class i occurs σ𝑛𝑖=1 𝑛𝑖 = 𝑁
Qualitative data
Example: X= colour of eyes

Xi= {Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue}
N=10 (total number of observations)

Class={Gree, Blue, Black, Brown}
𝑥𝑖 𝑛𝑖
Green 1
Blue 4
Black 3
Brown 2
Given the sample of n elements, the absolute frequency of class i, denoted by

ni, is the number of times that the event of class i occurs and σ𝑛𝑖=1 𝑛𝑖 = 𝑁
The relative frequency of the class i, denoted by fi, is its absolute frequency
divided by the total number of the elements in the sample.
𝑓𝑖 = 𝑛𝑖 /𝑁, σ𝑛𝑖=1 𝑓𝑖 = 1
Qualitative data

Xi= Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%)
Green 1 1/10 10
Blue 4 4/10 40
Black 3 3/10 30
Brown 2 2/10 20
The cumulative absolute frequency, denoted by Ni, and cumulative absolute

frequency of the first class is the absolute frequency of the given class.
The cumulative absolute frequency of the second class is equal to the
cumulative absolute frequency of the first class plus the absolute frequency
of the second class, and so on.
The cumulative relative frequency, denoted by Fi, and cumulative relative

frequency of the first class is the relative frequency of the given class.
The cumulative relative frequency of the second class is equal to the
cumulative relative frequency of the first class plus the relative frequency of
the second class, and so on.
In general, the cumulative absolute frequency of a given class is the

cumulative absolute frequency of the previous class plus the absolute
frequency of this given class.
𝑁𝑖 = 𝑁𝑖−1 + 𝑛𝑖
Then, 𝑁2 = 𝑁1 + 𝑛2 and 𝑁𝑛 = 𝑛
The cumulative relative frequency of a given class, denoted by Fi, is the

cumulative relative frequency of the previous class plus the relative
frequency of this given class.
𝐹𝑖 = 𝐹𝑖−1 + 𝑓𝑖
Then, 𝐹2 = 𝐹1 + 𝑓2 and 𝐹𝑛 = 100% (𝑜𝑟 = 1)
Qualitative data

Xi= {Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue }
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%) 𝑁𝑖 𝐹𝑖 (%)
Green 1 1/10 10 1 10
Blue 4 4/10 40 5 50
Black 3 3/10 30 8 80
Brown 2 2/10 20 10 100
Qualitative data
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 𝑁𝑖 𝐹𝑖 (%) 𝑤𝑖 =𝑓𝑖 *360·
(%)
Green 1 1/10 10 1 10 36·
Blue 4 4/10 40 5 50 144·
Black 3 3/10 30 8 80 108·
Brown 2 2/10 20 10 100 72· 𝑛
෍ 𝑤𝑖 = 360 ·
𝑛𝑖 𝑓𝑖 (%) 𝑖=1
4 40
3 30
2 20
1 10
𝑤𝑖
𝑥𝑖 𝑥𝑖
Pie chart / sector graph
Bar charts of frequency distribution
Quantitative data. DISCRETE
Example: X= number of passed subjects/courses/exams

Xi= {3,4,2,0,1,3,3,4,0,1}

Quantitative data. DISCRETE
Example: X= number of passed 𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%) 𝑤𝑖 =𝑓𝑖 *360· 𝑁𝑖 𝐹𝑖
subjects/exams. 0 2 0,2 20 72 2 0,2
Xi= {3,4,2,0,1,3,3,4,0,1} 1 2 0,2 20 72 4 0,4
N=10 2 1 0,1 10 36 5 0,5

3 3 0,3 30 108 8 0,8
4 2 0,2 20 72 10 1
𝑓𝑖 (%) 𝐹𝑖
𝑛𝑖
1
0,8
0,5
0,4
0,2 𝑤𝑖
0 1 2 3 4 𝑥𝑖 0 1 2 3 4 𝑥𝑖
Bar chart Cumulative frequency distribution Pie chart
Quantitative data. CONTINUOUS
To obtain the frequency distribution of quantitative continuous data, it is

convenient to previously group the observations into intervals.
When grouping observations, each interval is called “interval class”, which

lengths is called ”interval amplitude” and the middle point of the interval is
“class mark”. The extreme points of the interval are “inferior/superior
frontier”.
Example: X= weight
Xi= {80, 81, 85.3, 70, 89, 93, 100, 82, 85.5, 91}
N=10
How to group the data?
• Nª of intervals = 𝑁 = 3,16≈3 intervals (round the number toward the
lowest integer number)
𝑀𝑎𝑥 𝑥𝑖 −𝑀𝑖𝑛 𝑥𝑖 100 −70
• Amplitude 𝑎𝑖 = = = 10
𝑛º 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 3
𝐿𝑖+1 −𝐿𝑖
• Class mark 𝑥𝑖 ( 𝑎𝑙𝑠𝑜 𝑑𝑒𝑛𝑜𝑑𝑒𝑑 𝑏𝑦 𝑐𝑖 )= 𝐿𝑖 + 2
𝑛 𝑓𝑖 (%)
• Density 𝑑𝑖 = 𝑎𝑖=
𝑖 𝑎𝑖
Interval 𝒙𝒊 𝒂𝒊 𝒏𝒊 𝒇𝒊 𝒇𝒊 (%) 𝒘𝒊 𝑵𝒊 𝑭𝒊 𝑭𝒊 (%) 𝒅𝒊
𝐿𝑖 − 𝐿𝑖+1 (𝒄𝒊 )
[70 - 80) 75 10 1 0,1 10 36 1 0,1 10 0,1
[80 - 90) 85 10 6 0,6 60 216 7 0,7 70 0,6
or [90 - ) [90 – 100] 95 10 3 0,3 30 108 10 1 100 0,3
10 1 100 360
𝒅𝒊 𝑭𝒊
1
0,60
0,70
0,30
0,10 0,10
𝑤𝑖
70 80 90 100
70 80 90 100 𝒙𝒊 𝒙𝒊
Histogram Cumulative Frequency polygon Pie chart

2. Measures of location/position
Measures of central location
Measures of the centre or central location is a summary measure aimed to describe

the whole data set with a single value that represents the middle or centre of its
distribution.
There are three main measures of central location:
• Mean
• Median
• Mode
Each of these measures describes a different indication of the typical or central

value in the distribution.
• Mean, x :
The Mean of a quantitative dataset is the sum of the observations in the dataset
divided by the number of observations in the dataset.
• Median, 𝑀𝑑 :
The Median of a quantitative dataset is the number in the middle of the
observations arranged in ascending order.
• Mode, 𝑀𝑜 :
The Mode of a dataset is the observation that occurs most frequently in the
dataset.
Mean:
ഥ.
There are two means, the population mean, μ, and the sample mean, 𝒙
The calculation of both is the same, except that μ is calculated/estimated for the
ഥ is calculated for a sample taken from that population.
entire population and 𝒙
ഥ , as in practice we never calculate μ.

We will work with 𝒙
Estimating μ, is one of the main concerns of inferential statistics.

Mean:
Given dataset: 𝑥1 , 𝑥, 𝑥3 , 𝑥4 , … … 𝑥𝑛 , the sample mean is

n
x i
x1  x2  x3    xn 1  xn 𝑐 𝑖 𝑛𝑖
x i 1
 or ҧ σ𝑛𝑖=1
𝑥=
𝑛
n n
Example:
Compute the mean for the following sample: x = {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60}
The mean is calculated by adding all the values and dividing the sum by the number
of observations, which equals 56.6 years.
Properties:….
Median:
The median is the middle value of the distribution of all observations in the
sample arranged in ascending or descending order.
The median divides the distribution in half (there are 50% of observations on
either side of the median value).
To calculate the median we need to arrange the n observations.
• if n is odd, the median is, 𝑀𝑑 = 𝑥(𝑛+1) ,

2
𝑥𝑛 +𝑥𝑛
2 +1
• if n is even, the median is, 𝑀𝑑 = 2
2
For the continuous variables, we look for the value of the observation which
accumulates 50% of the relative frequency (𝐹𝑖 ). Since the data is grouped into
intervals, we might need to apply linear interpolation technique.
Median:
Example of samples where
• the number of observations is odd: {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the middle value, which is 57 years
• the number of observations is even: {52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the mean of the two middle values, which equals 56.5 years
Mode:
The mode corresponds to the value of the highest relative frequency in a

sample.
In a given relative frequency distribution with class intervals, the mode is the
mid-point of the class interval which has the highest relative frequency. The
class interval of the highest relative frequency is called the Modal Class.
The mode measures data concentration and, then, can be used to locate the
region in a large dataset where much of the data is concentrated.
Mode:
The are some limitations using the mode.
In some distributions, the mode may not exactly represent the centre of the
distribution.
For example, in the sample {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60} the centre of
the distribution is 57 years, but the mode is lower, at 54 years.
It is also possible to have more than one mode for the same distribution of data,
(bi-modal, or multi-modal). The presence of more than one mode can limit the
ability of the mode in describing the centre or typical value of the distribution,
because the single value which describes the centre of the distribution cannot be
identified.
Plot A Plot B
Measures of non-central location
Measures of non-central locations are used when one is interested in

values in a distribution associated with positions other than the centre.
The main measures of non-central location are:
• Quartiles
• Percentiles
Example: in a distribution of students’ grades obtained in a course, only

the highest 10 percent of the grade distributions will be granted a
scholarship. Then, we would need to know the 90th percentile grade.
Measures of non-central location
Quartiles
are those values that divide the frequency distribution into four equal parts.
𝑄1 - 25% of observations are ≤ 𝑄1 , 75% are ≥ 𝑄1

Percentiles
are those values that divide the frequency distribution into 100 equal parts.
The 𝑝𝑡ℎ percentile is a value, such that at least p percent of the observations in
the frequency distribution are ≤ than this value and at least (100- p) percent of
the observations are ≥ than this value.
3. Measures of dispersion
Measures of variability
Terms like variability, spread, and dispersion are synonyms. They refer to
how the distribution is spread out, that is we will look for measures of the
variability of a distribution.
The main measures of variability are:
• Range
• Inter-Quartile Range
• Standard deviation and Variance
Range R:
The range is the most obvious measure of dispersion and is the difference
between the lowest/minimum and the highest/maximum value in a dataset:
R=maximum-minimum
R is useful to evaluate the whole dataset, to show the spread within a dataset
and to compare the spread between similar datasets.
Since the range is based solely on the two most extreme values within the
dataset, if one of these is either extremely high or low (sometimes referred to
as outlier) it will result in a range that is not typical for the variability within the
dataset.
Inter Quartile Range IQR:

is a measure that indicates the extent to which 50% of the central of values
of the distribution are spread out. It is based upon, and is related to, the
median.
In the same way that the median divides a dataset into two halves, it can be
further divided into quarters by identifying the upper and lower quartiles.
The lower quartile is found one quarter of the way along a dataset and the
upper quartile is found three quarters along the dataset.
IQR = Q3 - Q1 =75th percentile - 25th percentile

Inter Quartile Range IQR:
IQR = Q3 - Q1 =75th percentile - 25th percentile
The IQR is the difference between upper and lower quartiles and is not
affected by extreme values. It is thus a resistant measure of variability.
Standard deviation S:
is a measure that summarises the amount by which every value within a dataset
varies from the mean.
Effectively it indicates how tightly the values in the dataset are bunched around
the mean value.
It is the most robust and widely used measure of dispersion since, unlike the range
and inter-quartile range, it takes into account every observation in the dataset.
When the values in a dataset are pretty tightly bunched together the standard
deviation is small. When the values are spread apart, the standard deviation will
be relatively large. The standard deviation is usually presented in conjunction with
the mean and is measured in the same units.
Standard deviation S:
Two vending machines A and B drop candies when a quarter is inserted. The number
of pieces of candy one gets is random. The following data are recorded for six trials at
each vending machine:
Pieces of candy from vending machine 𝑥𝐴 = {2,1,3,3,5,4} Mean=median=mode = 3

Pieces of candy from vending machine 𝑥𝐴 = 2,3,3,3,3,4 Mean=median=mode = 3
Dotplots for the pieces of candy from vending machine A and vending machine B:
They have the same values of the center measures, but what about the spread of the
distribution? One way to look at the spread is to compute the standard deviations.
Standard deviation S,
• Sample:
σ𝑛 ҧ 2 σ𝑛 2
𝑖=1 𝑥 −𝑛𝑥ҧ
2
𝑖=1(𝑥𝑖 −𝑥) 𝑖
Sx = =
𝑛−1 𝑛−1
The difference between each score and the mean is squared and then added
together. This sum is then divided by the number of scores minus one. Finally,
compute the square root of the expression.
• Population
σ𝑁
𝑖=1(𝑥𝑖 − μ)
2
σ𝑥 =
𝑁
Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by 𝑥,ҧ

the xi's tend to be closer to 𝑥ҧ than to μ. To compensate, we divide by a smaller
number, n - 1 .
Variance, 𝑺𝟐 :
Variability can also be defined in terms of how close the scores in the distribution are to
the middle of the distribution. Using the mean as the measure of the middle of the
distribution, the variance is defined as the average squared difference of the scores from
the mean.
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
• Sample variance, Var(x) 𝑺𝟐𝒙 =
𝑛−1
σ𝑁
𝑖=1(𝑥𝑖 −μ)
2
• Population variance, Var(x) σ2𝑥 =
𝑁
Since the variance represents the squared differences, the standard deviation represents
the true differences and is, therefore, easier to interpret and much more commonly
used.
Properties: 𝑉𝑎𝑟 2𝑥 = 22 · 𝑉𝑎𝑟 𝑥 = 4 · 𝑆𝑥2 ; 𝑉𝑎𝑟 −3𝑥 = (−3)2 · 𝑉𝑎𝑟 𝑥 = 9 · 𝑆𝑥2 ;

Relative measures of variability
Relative measures of variability are measures of variability adjusted by the measure of the
central location/position
• Coefficient of variation (Pearson coefficient) 𝑪𝑽𝒙 :

𝑆
𝐶𝑉𝑥 =
𝑥ҧ
When CV is close to zero, the distribution presents a low dispersion in the data.
When CV is close to one , the distribution presents a high dispersion in the data and worse is the
representativeness of the mean.
When CV is bigger than one, the mean is not representative for the dataset.
• Relative Inter Quartile Range 𝑹𝑰𝑸𝑹𝒙
𝑅𝐼𝑄
𝑅𝐼𝑄𝑅𝑥 =
𝑀𝑑
The relative measure of the variability is expressed in units of the central position
measure
To be able to compare distributions of different data sets, we standardize

or normalize the data
This method consists in subtracting the average and divide by the standard
deviation each and every observation in the dataset.
𝒙−ഥ𝒙
z= 𝑺
The obtained dataset is the set of data with the average equal to 0 and
standard deviation equal to 1.
Example: The following table provides students’ grades in different subjects
Subject\Student Ana Pedro Maria Juan Mean Standard deviation

Mathematics 1 1 2 4 2 1,14142
Biology 0 3 1 4 2 1,8257
Literature 8 6 5,5 8,5 7 1,4720
In absolute terms, Juan got the highest grade in Literature. However, if we want
to know in what subject did Juan get the highest grade with respect to others, we
have to compute standardized grades.
Example: The following table provides students’ grades in different subjects
Subject\Student Ana Pedro Maria Juan Mean Standard deviation

Mathematics 1 1 2 4 2 1,14142
Biology 0 3 1 4 2 1,8257
Literature 8 6 5,5 8,5 7 1,4720
Standardized grade of Juan in Mathematics is (4-2)/1,14142=1,41

Standardized grade of Juan in Biology is (4-2)/1,18257=1,10
Standardized grade of Juan in Literature is (8,5-7)/1,4720=1,02
Then, with respect to other students, Juan got the highest grade in Mathematics
4. Measures of Shape: Skewness and Kurtosis
The first thing you usually notice about a distribution’s shape is whether it has one
mode (peak) or more than one. If it’s unimodal (has just one peak), like most data
sets, the next thing you notice is whether it’s symmetric or skewed to one side.
Skewness is the tendency for the values to be more frequent around high or low
ends of the x-axis.
Distributions that have the same shape on both sides of the centre are called
symmetric.
(A symmetric distribution with only one peak is referred to as a normal distribution.)
The moment coefficient of Skewness of a data set is:

𝑚3
𝑔1 = 3/2 ,
𝑚2
where 𝑚𝑘 is the k-th sample central moment, hence,
𝑛 (𝑥𝑖 −𝑥 ഥ )3
σ𝑖=1
𝑛
𝑔1 = 2 3/2
(𝑥 −𝑥ഥ )
(σ𝑛
𝑖=1
𝑖
𝑛
)
• 𝑔1 >0 the distribution is positively skewed

• 𝑔1 <0 the distribution is negatively skewed
• 𝑔1 =0 the distribution is symmetric
If the bulk of the data is at the left and the right tail is longer, we say that the
distribution is skewed right or positively skewed;
if the peak is towards right and the left tail is longer, we say that the
distribution is skewed left or negatively skewed.
Kurtosis refers to how scores are concentrated in the centre of the distribution, the upper
and lower tails (ends), and the shoulders (between the centre and tails) of a distribution.
𝑚4
The moment coefficient of kurtosis of a data set is: 𝑔2 =
𝑚22
,
where 𝑚𝑘 is the k-th sample central moment, hence,
(𝑥 − ഥ
𝑥 ) 4
σ𝑛𝑖=1 𝑛
𝑖
𝑔2 = 2 2
(𝑥 − ഥ
𝑥 )
(σ𝑛
𝑖=1 𝑛
𝑖 )
• 𝑔2 >0 the distribution is leptokurtic

• 𝑔2 <0 the distribution is platikurtic
• 𝑔2 =0 the distribution is mesokurtic
Mesokurtic - A normal distribution is called mesokurtic. The tails of a mesokurtic

distribution are neither too thin not too thick, and there are neither too many not
too few scores in the centre of the distribution.
Platykurtic - Starting with a mesokurtic distribution and moving scores from both
centre and tails into the shoulders, the distribution flattens out and is referred to as
platykurtic.
Leptokurtic - If you move scores from

shoulders of a mesokurtic distribution
into the centre and tails of a distribution,
the result is a peaked distribution with
thick tails. This shape is referred to as
leptokurtic.
𝐗𝐢 ഥ
𝑿𝒊 − 𝑿 ഥ )𝟐
(𝑿𝒊 − 𝑿 ഥ )𝟑
(𝑿𝒊 − 𝑿 ഥ )𝟒
(𝑿𝒊 − 𝑿
23 0,2 0,04 0,008 0,0016

17 -5,8 33,64 -195,112 1131,6496
22 -0,8 0,64 -0,512 0,4096
21 -1,8 3,24 -5,832 210,4976
16 -6,8 46,24 -314,432 2138,1376
22 -0,8 0,64 -0,512 0,4096
23 0,2 0,04 0,008 0,0016
22 -0,8 0,64 -0,512 0,4096
40 17,2 295,84 5088,448 87521,3056
22 -0,8 0,64 -0,512 0,4096
Compute
10
the moment
10
coefficient
2
of Skewness
10 3
and Kurtosis 10
𝑥=228
ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ =0 σ𝑖=1(𝑥𝑖 −𝑥)ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ 4
=381,6/10 = 4571,6/10 = 90803,232/10
10 10 10
Compute the moment coefficient of Skewness and Kurtosis

Overview
The average tells us about the central value of the distribution and the measure of
dispersion tell us about the concentration of the items around a central value.
These measures do not reveal whether the dispersal of value on either side of an
average is symmetric or not.
If observations are arranged in a symmetrical manner around a measure of central

tendency, we get a symmetrical distribution, otherwise, it may be arranged in an
asymmetrical order which gives asymmetrical distribution.
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion,
study the characteristics of a frequency distribution.
Thus, Skewness is a measure that studies the degree and direction of departure from
the symmetry.
Overview (cont.)
A symmetrical distribution gives a ‘symmetrical curve’, where the value of mean,

median and mode are exactly equal.
On the other hand, in an asymmetric distribution, values of mean, median and mode
are not equal.
When two or more symmetric distributions are compared, the difference between them
is studied by means of Kurtosis.
On the other hand, when two or more symmetrical distributions are compared, they
will give different degrees of Skewness.
These measures are mutually exclusive i.e. the presence of Skewness implies absence of
Kurtosis and vice-versa.
5. Measures of concentration
The analysis of the concentration takes into account the degree of the
inequality in the distribution of the variable. It is often used in economic
series and studies of the income/wealth inequality analysis.
The maximum concentration happens when one element possesses the

whole variable and the minimum concentration happens when the
distribution between elements is equal, or it is equidistributional.
Gini coefficient is a measure of the income distribution of a population and

is based on the ‘Lorenz curve’ that shows the income distribution in a
population where income is not equally distributed.
Gini index takes values from 0 to 1.
σ𝑛−1
𝑖=1 (𝑝𝑖 −𝑞𝑖 )
𝐼𝑔 = σ𝑛−1
,
𝑖=1 𝑝𝑖
𝑥𝑖 ∗𝑛𝑖
where 𝑝𝑖 = 𝐹𝑖 ∗ 100 and 𝑞𝑖 = σ𝑛 *100
𝑖=1 𝑥𝑖 ∗𝑛𝑖
Gini index =0
represents a distribution
where the Lorenz curve is just
the ‘Line of Equality’ and
income is perfectly equally
distributed and there is a
minimum concentration of the
income.
Gini index =1
maximal inequality and the
maximum concentration of
the income (one person has all
income and all others receive
no income).
Example. There are 4 professional categories in a firm. Each category has a

different income level per month. The frequency distribution of income
levels (𝑥𝑖 ) and the number of people per category (𝑛𝑖 ) are shown as follows:
𝑥𝑖 𝑛𝑖 𝑁𝑖 𝑝𝑖 𝑥𝑖 𝑛𝑖 𝑞𝑖
1000 25 25 62,5 25000 40,98

2000 10 35 87,5 20000 73,77
3000 4 39 97,5 12000 93,44
4000 1 40 100 4000 100
σ 𝑥𝑖 𝑛𝑖 =61000
The value of the index is close to zero, therefore, the income is equally
distributed and there is a minimum concentration of the income.
6. Bidimensional frequency distribution
Contingency and correlation tables
We often wish to measure the degree to which one variable affects the value of
another or we want to study the association between two or more features of
the variable for each element of the population (bivariate data).
We use:
• Contingency tables to summarize the relationship between several variables.

A contingency table is a special type of frequency distribution table, where
two variables are shown simultaneously.
• Correlation to quantify the strength of the relationship among variables

Example 1:
We want to study the relation between the colour of eyes and the colour of hair.
We denote the variables as follows.

• Variable X indicates colour of eyes, such that 𝑥1 is light-coloured eyes and 𝑥2 is
dark-coloured eyes individual.
• Variable Y indicates colour of hair, such that 𝑦1 is light-coloured hair and 𝑦2 is
dark-coloured hair individual.
Example 2:
In a study involving 30 patients from a psychiatric hospital with two types of

neural problems (strong and light), we want to compare the effects of a new
drug with an old one on the patients.
We observe the data structured as follows:
Individuals type a – individuals with strong problem and old treatment = 10

Individuals type b – individuals with light problem and old treatment = 4
Individuals type c – individuals with strong problem and new treatment = 5
Individuals type d – individuals with light problem and new treatment = 11
How can we represent this situation?

How can we see if the treatment of new drug is preferable to the previous
treatment?
Example 2 (cont.)
We denote variables as follows:
• Variable X indicates the type of the treatment, such that 𝑥1 is an old

treatment and 𝑥2 is a new one.
• Variable Y indicates the neural problem, such that 𝑦1 is a strong problem and
𝑦2 is a light one.
Then, we count and classify patients according to their problems and the type of
drug they are taking:
Treatment Neuronal problems
Strong Light
Old a b
New c d
X Y
𝑦1 𝑦2
𝑥1 10 4
𝑥2 5 11
These 4 values are called double absolute frequencies (n), which tell us the
number of subjects there are with specific values of the variables.
Treatment(X) Neuronal problems (Y)
Strong (𝑦1 ) Light (𝑦2 )
Old (𝑥1 ) a = 10 (𝑛11 ) b = 4 (𝑛12 )
New (𝑥2 ) c = 5 (𝑛21 ) d = 11 (𝑛22 )
How many of them have light neuronal problems and take new drugs (that is,
what is the absolute frequency of individuals with light neuronal problems
taking new drugs )?
𝑛22 = 11
Now, to find relative frequencies (f), which tell us the number of subjects there are, with
specific values of the variables, out of the total amount of responders, divide the value of
each cell by the total number of patients. If we multiply them by 100, we obtain the
values in percentage terms.
What is the percentage of patients with strong problems taking new drugs?
0,167·100=16,7 %
Treatment (X) Neuronal problems (Y)

Old (𝑥1 ) a = 10/30=0,333 (𝑓11 ) b = 4/30=0,133 (𝑓12 )
New (𝑥2 ) c = 5/30=0,167 (𝑓21 ) d = 11/30=0,367 (𝑓22 )

Marginal frequency and distribution
In the table from above we perform some changes.
We add a column in the right hand side and add a row in the bottom of the
table and call them “Total”.
The column “Total” is the marginal distribution of X and the row is the
marginal distribution of Y.
Treatment (X) Neuronal problems (Y)

Strong (𝑦1 ) Light (𝑦2 ) Total
Old (𝑥1 ) a = 10 b=4 14
New (𝑥2 ) c=5 d = 11 16
Total 15 15 30

Old (𝑥1 ) a = 10 (𝑛11 ) b = 4 (𝑛12 ) 14 (𝒏𝒙𝟏 )
New (𝑥2 ) c = 5 (𝑛21 ) d = 11 (𝑛22 ) 16 (𝒏𝒙𝟐 )
Total 15 (𝒏𝒚𝟏 ) 16 (𝒏𝒚𝟐 ) 30 (n)
Now we look at conditional distributions of 𝑥1 , 𝑥2 , 𝑦1 , and 𝑦2 which we compute by

dividing the absolute frequency by the total number of observations for a specific
category.
A conditional distribution lists the relative frequency of each category of variable, given a
specific value of the other variable in the contingency table.

New (𝑥2 ) c=5 d = 11 16
Conditional distributions are:

Total
New (𝑥2 ) c = 5/16=0,312 d = 11/16=0,687

h(𝑛21 /𝑛𝑥2 ) h(𝑛22 /𝑛𝑥2 ) 1
31,2% of patients taking new drugs have strong neuronal problems,

67,7% of patients taking new drugs have light neuronal problems
Covariance
Covariance is a measure of how two random variables vary together.
It’s similar to variance. However, the variance is a measure of the variation of

one variable, while the covariance is a measure of the variation of two
variables.
Covariance indicates the existence of linear relationship between variables.

Covariance
Covariance is a measure of how two random variables vary together.
• It’s similar to variance. However, the variance is a measure of the variation of one
variable, while the covariance is a measure of the variation of two variables.
• Covariance indicates the existence of linear relationship between variables.
1 1
𝐶𝑜𝑣 𝑥, 𝑦 = 𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ = σ𝑛𝑖=1(𝑥𝑖 −𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത
𝑛 𝑛
If 𝑆𝑥𝑦 > 0 the relation b/w var.s is linear and positive
If 𝑆𝑥𝑦 < 0 the relation b/w var.s is linear and negative
If 𝑆𝑥𝑦 = 0 there is no linear relation b/w var.s

Covariance
Example.
x\y 1 2 3
2 1 4 1
3 2 4 2
4 1 2 1
Compute the covariance for the given dataset
Solution. We first compute the average values of x and y and then use the
formula of covariance
1
𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ =
𝑛
(2·1·1+2·2·4+2·3·1+3·1·2+3·2·4+3·3·2+4·1·1+4·2·2+4·3·1)
= − 3 · 2 = 5.1
18
Covariance
Example. x\y 1 2 3 Total

2 1 4 1 6
3 2 4 2 8
4 1 2 1 4
Total 4 10 4 18
Are X and Y independent?

Solution. If we want to know whether two variables are independent, we have to
check if the following expression is satisfied:
𝑛𝑖𝑗 𝑛𝑥𝑖 𝑛𝑦𝑗

= · ∀ij
𝑛 𝑛 𝑛
𝑛21 𝑛𝑥2 𝑛𝑦1 2 8 4
= · , ≠ · ,
𝑛 18 18 18 18 18
hence, variables X and Y are dependent

Variance and Covariance
Unidimensional case
𝑉𝑎𝑟 𝑎𝑋 = 𝑎2 · 𝑉𝑎𝑟 𝑋
Bidimensional case,
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + 𝑏2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · 𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + (−𝑏)2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · −𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
If 𝑿 𝒂𝒏𝒅 𝒀 𝒂𝒓𝒆 𝒊𝒏𝒅𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕:
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 + 𝑏2 𝑉𝑎𝑟 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 +(−𝑏)2 𝑉𝑎𝑟 𝑌
Examples
Unidimensional case
𝑉𝑎𝑟 3𝑋 = 32 · 𝑉𝑎𝑟 𝑋 = 9 · Sx2
𝑉𝑎𝑟 −4𝑋 = (−4)2 · 𝑉𝑎𝑟 𝑋 = 16 · 𝑆𝑥2
Bidimensional case,
𝑉𝑎𝑟 3𝑋 + 2𝑌 = 32 · 𝑉𝑎𝑟 𝑋 + 22 · 𝑉𝑎𝑟 𝑌 + 2 · 3 · 2 · 𝐶𝑜𝑣 𝑋, 𝑌
= 9 · 𝑉𝑎𝑟 𝑋 + 4 · 𝑉𝑎𝑟 𝑌 + 12 · 𝐶𝑜𝑣 𝑋, 𝑌 = 9 · 𝑆𝑥2 + 4 · 𝑆𝑥2 + 12 · 𝑆𝑥𝑦
𝑉𝑎𝑟 2𝑋 − 3𝑌 = 22 · 𝑉𝑎𝑟 𝑥 + (−3)2 · 𝑉𝑎𝑟 𝑦 + 2 · 2 · −3 · 𝐶𝑜𝑣 𝑥, 𝑦

= 4 · 𝑉𝑎𝑟(𝑥) + 9 · 𝑉𝑎𝑟 𝑦 + 2 · 2 · −3 · 𝐶𝑜𝑣 𝑥, 𝑦 =4 · 𝑆𝑥2 + 9 · 𝑆𝑥2 − 12 · 𝑆𝑥𝑦
𝑉𝑎𝑟 𝑋 − 𝑌 = 12 · 𝑉𝑎𝑟 𝑋 + (−1)2 · 𝑉𝑎𝑟 𝑌 + 2 · 1 · −1 · 𝐶𝑜𝑣 𝑋, 𝑌

= 𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑦 − 2 · 𝐶𝑜𝑣(𝑥, 𝑦)=𝑆𝑥2 + 𝑆𝑥2 − 2 · 𝑆𝑥𝑦

Unit 1 - VZHK

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 - VZHK

Uploaded by

Copyright:

Available Formats

PRINCIPLES OF STATISTICS

Prof. Vita Zhukova

Statistics is the science that deals with the collection,

It imposes order and/or regularity on the aggregates of elements by means

To carry out a statistical analysis we look at the information from the

This information comes from the whole population

Population is the total set of elements that we want to know/learn about

Sample is the subset of the population that we actually analyse (randomness)

X is the characteristic of the population (Example: weight, nº of employees)

We are not able to observe the whole population.

Variables are either qualitative or quantitative

Qualitative variables have a non-numeric outcomes.

Ordinal - X qualifies the course achievement. Example: Xi= passed, failed

Quantitative variables have a numeric outcomes.

Given the sample of n elements, the absolute frequency of class i, denoted

Example: X= colour of eyes

N=10 (total number of observations)

Given the sample of n elements, the absolute frequency of class i, denoted by

Example: X= colour of eyes

N=10 (total number of observations)

The cumulative absolute frequency, denoted by Ni, and cumulative absolute

The cumulative relative frequency, denoted by Fi, and cumulative relative

In general, the cumulative absolute frequency of a given class is the

The cumulative relative frequency of a given class, denoted by Fi, is the

Example: X= colour of eyes

N=10 (total number of observations)

Blue 4 4/10 40 5 50 144·

Black 3 3/10 30 8 80 108·

Brown 2 2/10 20 10 100 72· 𝑛

Example: X= number of passed subjects/courses/exams

N=10 (total number of observations)

N=10 2 1 0,1 10 36 5 0,5

To obtain the frequency distribution of quantitative continuous data, it is

When grouping observations, each interval is called “interval class”, which

[70 - 80) 75 10 1 0,1 10 36 1 0,1 10 0,1

[80 - 90) 85 10 6 0,6 60 216 7 0,7 70 0,6

or [90 - ) [90 – 100] 95 10 3 0,3 30 108 10 1 100 0,3

Histogram Cumulative Frequency polygon Pie chart

Measures of the centre or central location is a summary measure aimed to describe

There are three main measures of central location:

Each of these measures describes a different indication of the typical or central

ഥ , as in practice we never calculate μ.

Estimating μ, is one of the main concerns of inferential statistics.

Given dataset: 𝑥1 , 𝑥, 𝑥3 , 𝑥4 , … … 𝑥𝑛 , the sample mean is

To calculate the median we need to arrange the n observations.

• if n is odd, the median is, 𝑀𝑑 = 𝑥(𝑛+1) ,

Example of samples where

The mode corresponds to the value of the highest relative frequency in a

The are some limitations using the mode.

Measures of non-central locations are used when one is interested in

The main measures of non-central location are:

Example: in a distribution of students’ grades obtained in a course, only

𝑄1 - 25% of observations are ≤ 𝑄1 , 75% are ≥ 𝑄1

The main measures of variability are:

Inter Quartile Range IQR:

IQR = Q3 - Q1 =75th percentile - 25th percentile

IQR = Q3 - Q1 =75th percentile - 25th percentile

Pieces of candy from vending machine 𝑥𝐴 = {2,1,3,3,5,4} Mean=median=mode = 3

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by 𝑥,ҧ

Properties: 𝑉𝑎𝑟 2𝑥 = 22 · 𝑉𝑎𝑟 𝑥 = 4 · 𝑆𝑥2 ; 𝑉𝑎𝑟 −3𝑥 = (−3)2 · 𝑉𝑎𝑟 𝑥 = 9 · 𝑆𝑥2 ;

• Coefficient of variation (Pearson coefficient) 𝑪𝑽𝒙 :

• Relative Inter Quartile Range 𝑹𝑰𝑸𝑹𝒙