You are on page 1of 73

PRINCIPLES OF STATISTICS

UNIT 1
FREQUENCY DISTRIBUTION

Prof. Vita Zhukova


CONTENT
Unidimensional frequency distribution
Measures of position/location
Measures of dispersion
CONTENT
Measures of shape
Unidimensional
Measuresfrequency distribnution
of concentration
Bidimensional frequency distributions
1. Unidimensional frequency distributions

Statistics is the science that deals with the collection,


classification, analysis, and interpretation of facts or data.

It imposes order and/or regularity on the aggregates of elements by means


of mathematical theories, e.g. probability theory.

To carry out a statistical analysis we look at the information from the


variable of the interest.

This information comes from the whole population


or from a part of it, called sample.
1. Unidimensional frequency distributions

Population is the total set of elements that we want to know/learn about

Sample is the subset of the population that we actually analyse (randomness)

Variables provide characteristics of the population that are not constant under
the elements of such a population

Population
Sample
Example:
1. undergraduate students Example:
2. BMW/Chanel sales 1. UCAM students, UCLA students, etc.
2. Spanish BMW/Chanel sales
1. Unidimensional frequency distributions

Variable

X is the characteristic of the population (Example: weight, nº of employees)


Xi is the observation of X (Example: Xi =63,85,130,48)

We are not able to observe the whole population.


Instead, we do can observe a subset of the population, the sample.
1. Unidimensional frequency distributions

Variable
In order to ensure that the sample we work with is a representative sample of the whole
population, we take a random sample.

In a random sample all elements of the population are equally likely to be selected for
inclusion into the sample, e.i., the elements are randomly drawn.

For example:
We are interested in conducting a study on the amount of physical exercise undertaken
by the general public. Then, if we survey people entering and leaving a gymnasium, the
sample would provide a biased sample of the population, and the results obtained would
not be generalised to the population at large. Hence, this sample is not a representative
sample.
1. Unidimensional frequency distributions

Variables are either qualitative or quantitative

Qualitative variables have a non-numeric outcomes.


Nominal - X assigns the colour of eyes. Example: Xi=grey, blue, black

Ordinal - X qualifies the course achievement. Example: Xi= passed, failed

Quantitative variables have a numeric outcomes.


Discrete - take a countable number of possible values, are integer numbers.
X assigns the number of children in a family. Example: Xi=2, 0, 3
Continuous - take any value over a continuous scale, are real numbers.
X assigns the height. Example: Xi= 1.77, 1.534 mts
1. Unidimensional frequency distributions

• A set of data on its own is very hard to interpret. There is lots of information contained in
the data, which it is hard to see. We need ways of understanding important features of
the data, and to summarise it in a meaningful way.

• The use of graphs and summary statistics for understanding data is very useful and is the
first step we take in the statistical analysis.

We start with the frequency distribution of the variable which provides an ordered
exposition of the set of its observations.
1. Unidimensional frequency distributions

Given the sample of n elements, the absolute frequency of class i, denoted


by ni, is the number of times that the event of class i occurs σ𝑛𝑖=1 𝑛𝑖 = 𝑁
1. Unidimensional frequency distributions
Qualitative data

Example: X= colour of eyes


Xi= {Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue}

N=10 (total number of observations)


Class={Gree, Blue, Black, Brown}

𝑥𝑖 𝑛𝑖
Green 1
Blue 4
Black 3
Brown 2
1. Unidimensional frequency distributions

Given the sample of n elements, the absolute frequency of class i, denoted by


ni, is the number of times that the event of class i occurs and σ𝑛𝑖=1 𝑛𝑖 = 𝑁

The relative frequency of the class i, denoted by fi, is its absolute frequency
divided by the total number of the elements in the sample.
𝑓𝑖 = 𝑛𝑖 /𝑁, σ𝑛𝑖=1 𝑓𝑖 = 1
1. Unidimensional frequency distributions
Qualitative data

Example: X= colour of eyes


Xi= Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue

N=10 (total number of observations)

𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%)
Green 1 1/10 10
Blue 4 4/10 40
Black 3 3/10 30
Brown 2 2/10 20
1. Unidimensional frequency distributions

The cumulative absolute frequency, denoted by Ni, and cumulative absolute


frequency of the first class is the absolute frequency of the given class.
The cumulative absolute frequency of the second class is equal to the
cumulative absolute frequency of the first class plus the absolute frequency
of the second class, and so on.

The cumulative relative frequency, denoted by Fi, and cumulative relative


frequency of the first class is the relative frequency of the given class.
The cumulative relative frequency of the second class is equal to the
cumulative relative frequency of the first class plus the relative frequency of
the second class, and so on.
1. Unidimensional frequency distributions

In general, the cumulative absolute frequency of a given class is the


cumulative absolute frequency of the previous class plus the absolute
frequency of this given class.
𝑁𝑖 = 𝑁𝑖−1 + 𝑛𝑖
Then, 𝑁2 = 𝑁1 + 𝑛2 and 𝑁𝑛 = 𝑛

The cumulative relative frequency of a given class, denoted by Fi, is the


cumulative relative frequency of the previous class plus the relative
frequency of this given class.
𝐹𝑖 = 𝐹𝑖−1 + 𝑓𝑖
Then, 𝐹2 = 𝐹1 + 𝑓2 and 𝐹𝑛 = 100% (𝑜𝑟 = 1)
1. Unidimensional frequency distributions
Qualitative data

Example: X= colour of eyes


Xi= {Green, Blue, Blue, Black, Brown, Black, Blue, Brown, Black, Blue }

N=10 (total number of observations)

𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%) 𝑁𝑖 𝐹𝑖 (%)
Green 1 1/10 10 1 10
Blue 4 4/10 40 5 50
Black 3 3/10 30 8 80
Brown 2 2/10 20 10 100
1. Unidimensional frequency distributions
Qualitative data
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 𝑁𝑖 𝐹𝑖 (%) 𝑤𝑖 =𝑓𝑖 *360·
(%)
Green 1 1/10 10 1 10 36·

Blue 4 4/10 40 5 50 144·

Black 3 3/10 30 8 80 108·

Brown 2 2/10 20 10 100 72· 𝑛

෍ 𝑤𝑖 = 360 ·
𝑛𝑖 𝑓𝑖 (%) 𝑖=1

4 40
3 30

2 20

1 10
𝑤𝑖
𝑥𝑖 𝑥𝑖
Pie chart / sector graph
Bar charts of frequency distribution
1. Unidimensional frequency distributions
Quantitative data. DISCRETE

Example: X= number of passed subjects/courses/exams


Xi= {3,4,2,0,1,3,3,4,0,1}

N=10 (total number of observations)


1. Unidimensional frequency distributions
Quantitative data. DISCRETE
Example: X= number of passed 𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%) 𝑤𝑖 =𝑓𝑖 *360· 𝑁𝑖 𝐹𝑖
subjects/exams. 0 2 0,2 20 72 2 0,2
Xi= {3,4,2,0,1,3,3,4,0,1} 1 2 0,2 20 72 4 0,4

N=10 2 1 0,1 10 36 5 0,5


3 3 0,3 30 108 8 0,8
4 2 0,2 20 72 10 1
𝑓𝑖 (%) 𝐹𝑖
𝑛𝑖
1
0,8

0,5
0,4
0,2 𝑤𝑖
0 1 2 3 4 𝑥𝑖 0 1 2 3 4 𝑥𝑖
Bar chart Cumulative frequency distribution Pie chart
1. Unidimensional frequency distributions
Quantitative data. CONTINUOUS

To obtain the frequency distribution of quantitative continuous data, it is


convenient to previously group the observations into intervals.

When grouping observations, each interval is called “interval class”, which


lengths is called ”interval amplitude” and the middle point of the interval is
“class mark”. The extreme points of the interval are “inferior/superior
frontier”.
1. Unidimensional frequency distributions
Quantitative data. CONTINUOUS

Example: X= weight
Xi= {80, 81, 85.3, 70, 89, 93, 100, 82, 85.5, 91}

N=10
How to group the data?
• Nª of intervals = 𝑁 = 3,16≈3 intervals (round the number toward the
lowest integer number)
𝑀𝑎𝑥 𝑥𝑖 −𝑀𝑖𝑛 𝑥𝑖 100 −70
• Amplitude 𝑎𝑖 = = = 10
𝑛º 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 3

𝐿𝑖+1 −𝐿𝑖
• Class mark 𝑥𝑖 ( 𝑎𝑙𝑠𝑜 𝑑𝑒𝑛𝑜𝑑𝑒𝑑 𝑏𝑦 𝑐𝑖 )= 𝐿𝑖 + 2

𝑛 𝑓𝑖 (%)
• Density 𝑑𝑖 = 𝑎𝑖=
𝑖 𝑎𝑖
1. Unidimensional frequency distributions
Quantitative data. CONTINUOUS
Interval 𝒙𝒊 𝒂𝒊 𝒏𝒊 𝒇𝒊 𝒇𝒊 (%) 𝒘𝒊 𝑵𝒊 𝑭𝒊 𝑭𝒊 (%) 𝒅𝒊
𝐿𝑖 − 𝐿𝑖+1 (𝒄𝒊 )

[70 - 80) 75 10 1 0,1 10 36 1 0,1 10 0,1

[80 - 90) 85 10 6 0,6 60 216 7 0,7 70 0,6

or [90 - ) [90 – 100] 95 10 3 0,3 30 108 10 1 100 0,3

10 1 100 360
𝒅𝒊 𝑭𝒊
1
0,60
0,70

0,30

0,10 0,10
𝑤𝑖
70 80 90 100
70 80 90 100 𝒙𝒊 𝒙𝒊

Histogram Cumulative Frequency polygon Pie chart


2. Measures of location/position
Measures of central location

Measures of the centre or central location is a summary measure aimed to describe


the whole data set with a single value that represents the middle or centre of its
distribution.

There are three main measures of central location:

• Mean
• Median
• Mode

Each of these measures describes a different indication of the typical or central


value in the distribution.
2. Measures of location/position
Measures of central location

• Mean, x :
The Mean of a quantitative dataset is the sum of the observations in the dataset
divided by the number of observations in the dataset.

• Median, 𝑀𝑑 :
The Median of a quantitative dataset is the number in the middle of the
observations arranged in ascending order.

• Mode, 𝑀𝑜 :
The Mode of a dataset is the observation that occurs most frequently in the
dataset.
2. Measures of location/position
Measures of central location

Mean:

ഥ.
There are two means, the population mean, μ, and the sample mean, 𝒙

The calculation of both is the same, except that μ is calculated/estimated for the
ഥ is calculated for a sample taken from that population.
entire population and 𝒙

ഥ , as in practice we never calculate μ.


We will work with 𝒙

Estimating μ, is one of the main concerns of inferential statistics.


2. Measures of location/position
Measures of central location

Mean:

Given dataset: 𝑥1 , 𝑥, 𝑥3 , 𝑥4 , … … 𝑥𝑛 , the sample mean is


n

x i
x1  x2  x3    xn 1  xn 𝑐 𝑖 𝑛𝑖
x i 1
 or ҧ σ𝑛𝑖=1
𝑥=
𝑛
n n

Example:
Compute the mean for the following sample: x = {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60}
The mean is calculated by adding all the values and dividing the sum by the number
of observations, which equals 56.6 years.
Properties:….
2. Measures of location/position
Measures of central location

Median:

The median is the middle value of the distribution of all observations in the
sample arranged in ascending or descending order.

The median divides the distribution in half (there are 50% of observations on
either side of the median value).

To calculate the median we need to arrange the n observations.

• if n is odd, the median is, 𝑀𝑑 = 𝑥(𝑛+1) ,


2
𝑥𝑛 +𝑥𝑛
2 +1
• if n is even, the median is, 𝑀𝑑 = 2
2

For the continuous variables, we look for the value of the observation which
accumulates 50% of the relative frequency (𝐹𝑖 ). Since the data is grouped into
intervals, we might need to apply linear interpolation technique.
2. Measures of location/position
Measures of central location

Median:

Example of samples where

• the number of observations is odd: {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the middle value, which is 57 years

• the number of observations is even: {52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the mean of the two middle values, which equals 56.5 years
2. Measures of location/position
Measures of central location

Mode:

The mode corresponds to the value of the highest relative frequency in a


sample.

In a given relative frequency distribution with class intervals, the mode is the
mid-point of the class interval which has the highest relative frequency. The
class interval of the highest relative frequency is called the Modal Class.

The mode measures data concentration and, then, can be used to locate the
region in a large dataset where much of the data is concentrated.
2. Measures of location/position
Measures of central location

Mode:

The are some limitations using the mode.

In some distributions, the mode may not exactly represent the centre of the
distribution.

For example, in the sample {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60} the centre of
the distribution is 57 years, but the mode is lower, at 54 years.

It is also possible to have more than one mode for the same distribution of data,
(bi-modal, or multi-modal). The presence of more than one mode can limit the
ability of the mode in describing the centre or typical value of the distribution,
because the single value which describes the centre of the distribution cannot be
identified.
2. Measures of location/position
Measures of central location

Plot A Plot B
2. Measures of location/position
Measures of central location
2. Measures of location/position
Measures of non-central location

Measures of non-central locations are used when one is interested in


values in a distribution associated with positions other than the centre.

The main measures of non-central location are:

• Quartiles

• Percentiles

Example: in a distribution of students’ grades obtained in a course, only


the highest 10 percent of the grade distributions will be granted a
scholarship. Then, we would need to know the 90th percentile grade.
2. Measures of location/position
Measures of non-central location
Quartiles
are those values that divide the frequency distribution into four equal parts.

𝑄1 - 25% of observations are ≤ 𝑄1 , 75% are ≥ 𝑄1


𝑄2 - 50% of observations are ≤ 𝑄2 , 50% are ≥ 𝑄2
𝑄3 - 75% of observations are ≤ 𝑄3 , 25% are ≥ 𝑄3
𝑄4 - 100% of observations are ≤ 𝑄4 , 0% are ≥ 𝑄4

Percentiles
are those values that divide the frequency distribution into 100 equal parts.

The 𝑝𝑡ℎ percentile is a value, such that at least p percent of the observations in
the frequency distribution are ≤ than this value and at least (100- p) percent of
the observations are ≥ than this value.
3. Measures of dispersion
Measures of variability

Terms like variability, spread, and dispersion are synonyms. They refer to
how the distribution is spread out, that is we will look for measures of the
variability of a distribution.

The main measures of variability are:

• Range
• Inter-Quartile Range
• Standard deviation and Variance
3. Measures of dispersion
Measures of variability

Range R:
The range is the most obvious measure of dispersion and is the difference
between the lowest/minimum and the highest/maximum value in a dataset:
R=maximum-minimum
R is useful to evaluate the whole dataset, to show the spread within a dataset
and to compare the spread between similar datasets.

Since the range is based solely on the two most extreme values within the
dataset, if one of these is either extremely high or low (sometimes referred to
as outlier) it will result in a range that is not typical for the variability within the
dataset.
3. Measures of dispersion
Measures of variability

Inter Quartile Range IQR:


is a measure that indicates the extent to which 50% of the central of values
of the distribution are spread out. It is based upon, and is related to, the
median.

In the same way that the median divides a dataset into two halves, it can be
further divided into quarters by identifying the upper and lower quartiles.
The lower quartile is found one quarter of the way along a dataset and the
upper quartile is found three quarters along the dataset.

IQR = Q3 - Q1 =75th percentile - 25th percentile


3. Measures of dispersion
Measures of variability
Inter Quartile Range IQR:

IQR = Q3 - Q1 =75th percentile - 25th percentile

The IQR is the difference between upper and lower quartiles and is not
affected by extreme values. It is thus a resistant measure of variability.
3. Measures of dispersion
Measures of variability

Standard deviation S:

is a measure that summarises the amount by which every value within a dataset
varies from the mean.

Effectively it indicates how tightly the values in the dataset are bunched around
the mean value.

It is the most robust and widely used measure of dispersion since, unlike the range
and inter-quartile range, it takes into account every observation in the dataset.

When the values in a dataset are pretty tightly bunched together the standard
deviation is small. When the values are spread apart, the standard deviation will
be relatively large. The standard deviation is usually presented in conjunction with
the mean and is measured in the same units.
3. Measures of dispersion
Measures of variability
Standard deviation S:

Two vending machines A and B drop candies when a quarter is inserted. The number
of pieces of candy one gets is random. The following data are recorded for six trials at
each vending machine:

Pieces of candy from vending machine 𝑥𝐴 = {2,1,3,3,5,4} Mean=median=mode = 3


Pieces of candy from vending machine 𝑥𝐴 = 2,3,3,3,3,4 Mean=median=mode = 3

Dotplots for the pieces of candy from vending machine A and vending machine B:

They have the same values of the center measures, but what about the spread of the
distribution? One way to look at the spread is to compute the standard deviations.
3. Measures of dispersion
Measures of variability

Standard deviation S,

• Sample:
σ𝑛 ҧ 2 σ𝑛 2
𝑖=1 𝑥 −𝑛𝑥ҧ
2
𝑖=1(𝑥𝑖 −𝑥) 𝑖
Sx = =
𝑛−1 𝑛−1

The difference between each score and the mean is squared and then added
together. This sum is then divided by the number of scores minus one. Finally,
compute the square root of the expression.

• Population
σ𝑁
𝑖=1(𝑥𝑖 − μ)
2
σ𝑥 =
𝑁

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by 𝑥,ҧ


the xi's tend to be closer to 𝑥ҧ than to μ. To compensate, we divide by a smaller
number, n - 1 .
3. Measures of dispersion
Measures of variability

Variance, 𝑺𝟐 :

Variability can also be defined in terms of how close the scores in the distribution are to
the middle of the distribution. Using the mean as the measure of the middle of the
distribution, the variance is defined as the average squared difference of the scores from
the mean.

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
• Sample variance, Var(x) 𝑺𝟐𝒙 =
𝑛−1

σ𝑁
𝑖=1(𝑥𝑖 −μ)
2
• Population variance, Var(x) σ2𝑥 =
𝑁

Since the variance represents the squared differences, the standard deviation represents
the true differences and is, therefore, easier to interpret and much more commonly
used.

Properties: 𝑉𝑎𝑟 2𝑥 = 22 · 𝑉𝑎𝑟 𝑥 = 4 · 𝑆𝑥2 ; 𝑉𝑎𝑟 −3𝑥 = (−3)2 · 𝑉𝑎𝑟 𝑥 = 9 · 𝑆𝑥2 ;


3. Measures of dispersion
Relative measures of variability

Relative measures of variability are measures of variability adjusted by the measure of the
central location/position

• Coefficient of variation (Pearson coefficient) 𝑪𝑽𝒙 :


𝑆
𝐶𝑉𝑥 =
𝑥ҧ
When CV is close to zero, the distribution presents a low dispersion in the data.
When CV is close to one , the distribution presents a high dispersion in the data and worse is the
representativeness of the mean.
When CV is bigger than one, the mean is not representative for the dataset.

• Relative Inter Quartile Range 𝑹𝑰𝑸𝑹𝒙

𝑅𝐼𝑄
𝑅𝐼𝑄𝑅𝑥 =
𝑀𝑑

The relative measure of the variability is expressed in units of the central position
measure
3. Measures of dispersion

To be able to compare distributions of different data sets, we standardize


or normalize the data

This method consists in subtracting the average and divide by the standard
deviation each and every observation in the dataset.

𝒙−ഥ𝒙
z= 𝑺

The obtained dataset is the set of data with the average equal to 0 and
standard deviation equal to 1.
3. Measures of dispersion

Example: The following table provides students’ grades in different subjects

Subject\Student Ana Pedro Maria Juan Mean Standard deviation


Mathematics 1 1 2 4 2 1,14142
Biology 0 3 1 4 2 1,8257
Literature 8 6 5,5 8,5 7 1,4720

In absolute terms, Juan got the highest grade in Literature. However, if we want
to know in what subject did Juan get the highest grade with respect to others, we
have to compute standardized grades.
3. Measures of dispersion

Example: The following table provides students’ grades in different subjects

Subject\Student Ana Pedro Maria Juan Mean Standard deviation


Mathematics 1 1 2 4 2 1,14142
Biology 0 3 1 4 2 1,8257
Literature 8 6 5,5 8,5 7 1,4720

Standardized grade of Juan in Mathematics is (4-2)/1,14142=1,41


Standardized grade of Juan in Biology is (4-2)/1,18257=1,10
Standardized grade of Juan in Literature is (8,5-7)/1,4720=1,02

Then, with respect to other students, Juan got the highest grade in Mathematics
4. Measures of Shape: Skewness and Kurtosis

The first thing you usually notice about a distribution’s shape is whether it has one
mode (peak) or more than one. If it’s unimodal (has just one peak), like most data
sets, the next thing you notice is whether it’s symmetric or skewed to one side.

Skewness is the tendency for the values to be more frequent around high or low
ends of the x-axis.

Distributions that have the same shape on both sides of the centre are called
symmetric.
(A symmetric distribution with only one peak is referred to as a normal distribution.)
4. Measures of Shape: Skewness and Kurtosis

The moment coefficient of Skewness of a data set is:


𝑚3
𝑔1 = 3/2 ,
𝑚2

where 𝑚𝑘 is the k-th sample central moment, hence,

𝑛 (𝑥𝑖 −𝑥 ഥ )3
σ𝑖=1
𝑛
𝑔1 = 2 3/2
(𝑥 −𝑥ഥ )
(σ𝑛
𝑖=1
𝑖
𝑛
)

• 𝑔1 >0 the distribution is positively skewed


• 𝑔1 <0 the distribution is negatively skewed
• 𝑔1 =0 the distribution is symmetric
4. Measures of Shape: Skewness and Kurtosis

If the bulk of the data is at the left and the right tail is longer, we say that the
distribution is skewed right or positively skewed;

if the peak is towards right and the left tail is longer, we say that the
distribution is skewed left or negatively skewed.
4. Measures of Shape: Skewness and Kurtosis

Kurtosis refers to how scores are concentrated in the centre of the distribution, the upper
and lower tails (ends), and the shoulders (between the centre and tails) of a distribution.

𝑚4
The moment coefficient of kurtosis of a data set is: 𝑔2 =
𝑚22
,
where 𝑚𝑘 is the k-th sample central moment, hence,

(𝑥 − ഥ
𝑥 ) 4
σ𝑛𝑖=1 𝑛
𝑖
𝑔2 = 2 2
(𝑥 − ഥ
𝑥 )
(σ𝑛
𝑖=1 𝑛
𝑖 )

• 𝑔2 >0 the distribution is leptokurtic


• 𝑔2 <0 the distribution is platikurtic
• 𝑔2 =0 the distribution is mesokurtic
4. Measures of Shape: Skewness and Kurtosis

Mesokurtic - A normal distribution is called mesokurtic. The tails of a mesokurtic


distribution are neither too thin not too thick, and there are neither too many not
too few scores in the centre of the distribution.

Platykurtic - Starting with a mesokurtic distribution and moving scores from both
centre and tails into the shoulders, the distribution flattens out and is referred to as
platykurtic.

Leptokurtic - If you move scores from


shoulders of a mesokurtic distribution
into the centre and tails of a distribution,
the result is a peaked distribution with
thick tails. This shape is referred to as
leptokurtic.
4. Measures of Shape: Skewness and Kurtosis

𝐗𝐢 ഥ
𝑿𝒊 − 𝑿 ഥ )𝟐
(𝑿𝒊 − 𝑿 ഥ )𝟑
(𝑿𝒊 − 𝑿 ഥ )𝟒
(𝑿𝒊 − 𝑿

23 0,2 0,04 0,008 0,0016


17 -5,8 33,64 -195,112 1131,6496
22 -0,8 0,64 -0,512 0,4096
21 -1,8 3,24 -5,832 210,4976
16 -6,8 46,24 -314,432 2138,1376
22 -0,8 0,64 -0,512 0,4096
23 0,2 0,04 0,008 0,0016
22 -0,8 0,64 -0,512 0,4096
40 17,2 295,84 5088,448 87521,3056
22 -0,8 0,64 -0,512 0,4096
Compute
10
the moment
10
coefficient
2
of Skewness
10 3
and Kurtosis 10
𝑥=228
ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ =0 σ𝑖=1(𝑥𝑖 −𝑥)ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ σ𝑖=1(𝑥𝑖 −𝑥)ҧ 4
=381,6/10 = 4571,6/10 = 90803,232/10
10 10 10

Compute the moment coefficient of Skewness and Kurtosis


Overview

The average tells us about the central value of the distribution and the measure of
dispersion tell us about the concentration of the items around a central value.

These measures do not reveal whether the dispersal of value on either side of an
average is symmetric or not.

If observations are arranged in a symmetrical manner around a measure of central


tendency, we get a symmetrical distribution, otherwise, it may be arranged in an
asymmetrical order which gives asymmetrical distribution.

Measures of Skewness and Kurtosis, like measures of central tendency and dispersion,
study the characteristics of a frequency distribution.

Thus, Skewness is a measure that studies the degree and direction of departure from
the symmetry.
Overview (cont.)

A symmetrical distribution gives a ‘symmetrical curve’, where the value of mean,


median and mode are exactly equal.

On the other hand, in an asymmetric distribution, values of mean, median and mode
are not equal.

When two or more symmetric distributions are compared, the difference between them
is studied by means of Kurtosis.

On the other hand, when two or more symmetrical distributions are compared, they
will give different degrees of Skewness.

These measures are mutually exclusive i.e. the presence of Skewness implies absence of
Kurtosis and vice-versa.
5. Measures of concentration

The analysis of the concentration takes into account the degree of the
inequality in the distribution of the variable. It is often used in economic
series and studies of the income/wealth inequality analysis.

The maximum concentration happens when one element possesses the


whole variable and the minimum concentration happens when the
distribution between elements is equal, or it is equidistributional.
5. Measures of concentration

Gini coefficient is a measure of the income distribution of a population and


is based on the ‘Lorenz curve’ that shows the income distribution in a
population where income is not equally distributed.

Gini index takes values from 0 to 1.

σ𝑛−1
𝑖=1 (𝑝𝑖 −𝑞𝑖 )
𝐼𝑔 = σ𝑛−1
,
𝑖=1 𝑝𝑖

𝑥𝑖 ∗𝑛𝑖
where 𝑝𝑖 = 𝐹𝑖 ∗ 100 and 𝑞𝑖 = σ𝑛 *100
𝑖=1 𝑥𝑖 ∗𝑛𝑖
5. Measures of concentration

Gini index =0
represents a distribution
where the Lorenz curve is just
the ‘Line of Equality’ and
income is perfectly equally
distributed and there is a
minimum concentration of the
income.

Gini index =1
maximal inequality and the
maximum concentration of
the income (one person has all
income and all others receive
no income).
5. Measures of concentration

Example. There are 4 professional categories in a firm. Each category has a


different income level per month. The frequency distribution of income
levels (𝑥𝑖 ) and the number of people per category (𝑛𝑖 ) are shown as follows:

𝑥𝑖 𝑛𝑖 𝑁𝑖 𝑝𝑖 𝑥𝑖 𝑛𝑖 𝑞𝑖

1000 25 25 62,5 25000 40,98


2000 10 35 87,5 20000 73,77
3000 4 39 97,5 12000 93,44
4000 1 40 100 4000 100
σ 𝑥𝑖 𝑛𝑖 =61000

The value of the index is close to zero, therefore, the income is equally
distributed and there is a minimum concentration of the income.
6. Bidimensional frequency distribution
Contingency and correlation tables

We often wish to measure the degree to which one variable affects the value of
another or we want to study the association between two or more features of
the variable for each element of the population (bivariate data).

We use:

• Contingency tables to summarize the relationship between several variables.


A contingency table is a special type of frequency distribution table, where
two variables are shown simultaneously.

• Correlation to quantify the strength of the relationship among variables


6. Bidimensional frequency distribution

Example 1:

We want to study the relation between the colour of eyes and the colour of hair.

We denote the variables as follows.


• Variable X indicates colour of eyes, such that 𝑥1 is light-coloured eyes and 𝑥2 is
dark-coloured eyes individual.
• Variable Y indicates colour of hair, such that 𝑦1 is light-coloured hair and 𝑦2 is
dark-coloured hair individual.
6. Bidimensional frequency distribution

Example 2:

In a study involving 30 patients from a psychiatric hospital with two types of


neural problems (strong and light), we want to compare the effects of a new
drug with an old one on the patients.

We observe the data structured as follows:

Individuals type a – individuals with strong problem and old treatment = 10


Individuals type b – individuals with light problem and old treatment = 4
Individuals type c – individuals with strong problem and new treatment = 5
Individuals type d – individuals with light problem and new treatment = 11

How can we represent this situation?


How can we see if the treatment of new drug is preferable to the previous
treatment?
6. Bidimensional frequency distribution

Example 2 (cont.)

We denote variables as follows:

• Variable X indicates the type of the treatment, such that 𝑥1 is an old


treatment and 𝑥2 is a new one.

• Variable Y indicates the neural problem, such that 𝑦1 is a strong problem and
𝑦2 is a light one.

Then, we count and classify patients according to their problems and the type of
drug they are taking:
6. Bidimensional frequency distribution

Treatment Neuronal problems

Strong Light

Old a b

New c d

X Y

𝑦1 𝑦2

𝑥1 10 4

𝑥2 5 11

These 4 values are called double absolute frequencies (n), which tell us the
number of subjects there are with specific values of the variables.
6. Bidimensional frequency distribution

Treatment(X) Neuronal problems (Y)

Strong (𝑦1 ) Light (𝑦2 )

Old (𝑥1 ) a = 10 (𝑛11 ) b = 4 (𝑛12 )

New (𝑥2 ) c = 5 (𝑛21 ) d = 11 (𝑛22 )

How many of them have light neuronal problems and take new drugs (that is,
what is the absolute frequency of individuals with light neuronal problems
taking new drugs )?

𝑛22 = 11
6. Bidimensional frequency distribution

Now, to find relative frequencies (f), which tell us the number of subjects there are, with
specific values of the variables, out of the total amount of responders, divide the value of
each cell by the total number of patients. If we multiply them by 100, we obtain the
values in percentage terms.
What is the percentage of patients with strong problems taking new drugs?
0,167·100=16,7 %

Treatment (X) Neuronal problems (Y)


Strong (𝑦1 ) Light (𝑦2 )

Old (𝑥1 ) a = 10/30=0,333 (𝑓11 ) b = 4/30=0,133 (𝑓12 )

New (𝑥2 ) c = 5/30=0,167 (𝑓21 ) d = 11/30=0,367 (𝑓22 )


6. Bidimensional frequency distribution
Marginal frequency and distribution

In the table from above we perform some changes.

We add a column in the right hand side and add a row in the bottom of the
table and call them “Total”.

The column “Total” is the marginal distribution of X and the row is the
marginal distribution of Y.

Treatment (X) Neuronal problems (Y)


Strong (𝑦1 ) Light (𝑦2 ) Total

Old (𝑥1 ) a = 10 b=4 14

New (𝑥2 ) c=5 d = 11 16

Total 15 15 30
6. Bidimensional frequency distribution

Treatment(X) Neuronal problems (Y)


Strong (𝑦1 ) Light (𝑦2 ) Total

Old (𝑥1 ) a = 10 (𝑛11 ) b = 4 (𝑛12 ) 14 (𝒏𝒙𝟏 )

New (𝑥2 ) c = 5 (𝑛21 ) d = 11 (𝑛22 ) 16 (𝒏𝒙𝟐 )

Total 15 (𝒏𝒚𝟏 ) 16 (𝒏𝒚𝟐 ) 30 (n)

Now we look at conditional distributions of 𝑥1 , 𝑥2 , 𝑦1 , and 𝑦2 which we compute by


dividing the absolute frequency by the total number of observations for a specific
category.
A conditional distribution lists the relative frequency of each category of variable, given a
specific value of the other variable in the contingency table.
6. Bidimensional frequency distribution

Treatment(X) Neuronal problems (Y)


Strong (𝑦1 ) Light (𝑦2 ) Total

New (𝑥2 ) c=5 d = 11 16

Conditional distributions are:

Treatment(X) Neuronal problems (Y)


Total
Strong (𝑦1 ) Light (𝑦2 )

New (𝑥2 ) c = 5/16=0,312 d = 11/16=0,687


h(𝑛21 /𝑛𝑥2 ) h(𝑛22 /𝑛𝑥2 ) 1

31,2% of patients taking new drugs have strong neuronal problems,


67,7% of patients taking new drugs have light neuronal problems
6. Bidimensional frequency distribution
Covariance

Covariance is a measure of how two random variables vary together.

It’s similar to variance. However, the variance is a measure of the variation of


one variable, while the covariance is a measure of the variation of two
variables.

Covariance indicates the existence of linear relationship between variables.


6. Bidimensional frequency distribution
Covariance

Covariance is a measure of how two random variables vary together.

• It’s similar to variance. However, the variance is a measure of the variation of one
variable, while the covariance is a measure of the variation of two variables.
• Covariance indicates the existence of linear relationship between variables.

1 1
𝐶𝑜𝑣 𝑥, 𝑦 = 𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ = σ𝑛𝑖=1(𝑥𝑖 −𝑥)(𝑦
ҧ 𝑖 − 𝑦)

𝑛 𝑛

If 𝑆𝑥𝑦 > 0 the relation b/w var.s is linear and positive

If 𝑆𝑥𝑦 < 0 the relation b/w var.s is linear and negative

If 𝑆𝑥𝑦 = 0 there is no linear relation b/w var.s


6. Bidimensional frequency distribution
Covariance

Example.
x\y 1 2 3
2 1 4 1
3 2 4 2
4 1 2 1

Compute the covariance for the given dataset

Solution. We first compute the average values of x and y and then use the
formula of covariance

1
𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ =
𝑛

(2·1·1+2·2·4+2·3·1+3·1·2+3·2·4+3·3·2+4·1·1+4·2·2+4·3·1)
= − 3 · 2 = 5.1
18
6. Bidimensional frequency distribution
Covariance

Example. x\y 1 2 3 Total


2 1 4 1 6
3 2 4 2 8
4 1 2 1 4
Total 4 10 4 18

Are X and Y independent?


Solution. If we want to know whether two variables are independent, we have to
check if the following expression is satisfied:

𝑛𝑖𝑗 𝑛𝑥𝑖 𝑛𝑦𝑗


= · ∀ij
𝑛 𝑛 𝑛
𝑛21 𝑛𝑥2 𝑛𝑦1 2 8 4
= · , ≠ · ,
𝑛 18 18 18 18 18

hence, variables X and Y are dependent


6. Bidimensional frequency distribution
Variance and Covariance

Unidimensional case
𝑉𝑎𝑟 𝑎𝑋 = 𝑎2 · 𝑉𝑎𝑟 𝑋

Bidimensional case,
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + 𝑏2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · 𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + (−𝑏)2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · −𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
If 𝑿 𝒂𝒏𝒅 𝒀 𝒂𝒓𝒆 𝒊𝒏𝒅𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕:
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 + 𝑏2 𝑉𝑎𝑟 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 +(−𝑏)2 𝑉𝑎𝑟 𝑌
6. Bidimensional frequency distribution
Examples

Unidimensional case
𝑉𝑎𝑟 3𝑋 = 32 · 𝑉𝑎𝑟 𝑋 = 9 · Sx2
𝑉𝑎𝑟 −4𝑋 = (−4)2 · 𝑉𝑎𝑟 𝑋 = 16 · 𝑆𝑥2

Bidimensional case,
𝑉𝑎𝑟 3𝑋 + 2𝑌 = 32 · 𝑉𝑎𝑟 𝑋 + 22 · 𝑉𝑎𝑟 𝑌 + 2 · 3 · 2 · 𝐶𝑜𝑣 𝑋, 𝑌
= 9 · 𝑉𝑎𝑟 𝑋 + 4 · 𝑉𝑎𝑟 𝑌 + 12 · 𝐶𝑜𝑣 𝑋, 𝑌 = 9 · 𝑆𝑥2 + 4 · 𝑆𝑥2 + 12 · 𝑆𝑥𝑦

𝑉𝑎𝑟 2𝑋 − 3𝑌 = 22 · 𝑉𝑎𝑟 𝑥 + (−3)2 · 𝑉𝑎𝑟 𝑦 + 2 · 2 · −3 · 𝐶𝑜𝑣 𝑥, 𝑦


= 4 · 𝑉𝑎𝑟(𝑥) + 9 · 𝑉𝑎𝑟 𝑦 + 2 · 2 · −3 · 𝐶𝑜𝑣 𝑥, 𝑦 =4 · 𝑆𝑥2 + 9 · 𝑆𝑥2 − 12 · 𝑆𝑥𝑦

𝑉𝑎𝑟 𝑋 − 𝑌 = 12 · 𝑉𝑎𝑟 𝑋 + (−1)2 · 𝑉𝑎𝑟 𝑌 + 2 · 1 · −1 · 𝐶𝑜𝑣 𝑋, 𝑌


= 𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑦 − 2 · 𝐶𝑜𝑣(𝑥, 𝑦)=𝑆𝑥2 + 𝑆𝑥2 − 2 · 𝑆𝑥𝑦

You might also like