Professional Documents
Culture Documents
UNIT 1
FREQUENCY DISTRIBUTION
Variables provide characteristics of the population that are not constant under
the elements of such a population
Population
Sample
Example:
1. undergraduate students Example:
2. BMW/Chanel sales 1. UCAM students, UCLA students, etc.
2. Spanish BMW/Chanel sales
1. Unidimensional frequency distributions
Variable
Variable
In order to ensure that the sample we work with is a representative sample of the whole
population, we take a random sample.
In a random sample all elements of the population are equally likely to be selected for
inclusion into the sample, e.i., the elements are randomly drawn.
For example:
We are interested in conducting a study on the amount of physical exercise undertaken
by the general public. Then, if we survey people entering and leaving a gymnasium, the
sample would provide a biased sample of the population, and the results obtained would
not be generalised to the population at large. Hence, this sample is not a representative
sample.
1. Unidimensional frequency distributions
• A set of data on its own is very hard to interpret. There is lots of information contained in
the data, which it is hard to see. We need ways of understanding important features of
the data, and to summarise it in a meaningful way.
• The use of graphs and summary statistics for understanding data is very useful and is the
first step we take in the statistical analysis.
We start with the frequency distribution of the variable which provides an ordered
exposition of the set of its observations.
1. Unidimensional frequency distributions
𝑥𝑖 𝑛𝑖
Green 1
Blue 4
Black 3
Brown 2
1. Unidimensional frequency distributions
The relative frequency of the class i, denoted by fi, is its absolute frequency
divided by the total number of the elements in the sample.
𝑓𝑖 = 𝑛𝑖 /𝑁, σ𝑛𝑖=1 𝑓𝑖 = 1
1. Unidimensional frequency distributions
Qualitative data
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%)
Green 1 1/10 10
Blue 4 4/10 40
Black 3 3/10 30
Brown 2 2/10 20
1. Unidimensional frequency distributions
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 (%) 𝑁𝑖 𝐹𝑖 (%)
Green 1 1/10 10 1 10
Blue 4 4/10 40 5 50
Black 3 3/10 30 8 80
Brown 2 2/10 20 10 100
1. Unidimensional frequency distributions
Qualitative data
𝑥𝑖 𝑛𝑖 𝑓𝑖 𝑓𝑖 𝑁𝑖 𝐹𝑖 (%) 𝑤𝑖 =𝑓𝑖 *360·
(%)
Green 1 1/10 10 1 10 36·
𝑤𝑖 = 360 ·
𝑛𝑖 𝑓𝑖 (%) 𝑖=1
4 40
3 30
2 20
1 10
𝑤𝑖
𝑥𝑖 𝑥𝑖
Pie chart / sector graph
Bar charts of frequency distribution
1. Unidimensional frequency distributions
Quantitative data. DISCRETE
0,5
0,4
0,2 𝑤𝑖
0 1 2 3 4 𝑥𝑖 0 1 2 3 4 𝑥𝑖
Bar chart Cumulative frequency distribution Pie chart
1. Unidimensional frequency distributions
Quantitative data. CONTINUOUS
Example: X= weight
Xi= {80, 81, 85.3, 70, 89, 93, 100, 82, 85.5, 91}
N=10
How to group the data?
• Nª of intervals = 𝑁 = 3,16≈3 intervals (round the number toward the
lowest integer number)
𝑀𝑎𝑥 𝑥𝑖 −𝑀𝑖𝑛 𝑥𝑖 100 −70
• Amplitude 𝑎𝑖 = = = 10
𝑛º 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 3
𝐿𝑖+1 −𝐿𝑖
• Class mark 𝑥𝑖 ( 𝑎𝑙𝑠𝑜 𝑑𝑒𝑛𝑜𝑑𝑒𝑑 𝑏𝑦 𝑐𝑖 )= 𝐿𝑖 + 2
𝑛 𝑓𝑖 (%)
• Density 𝑑𝑖 = 𝑎𝑖=
𝑖 𝑎𝑖
1. Unidimensional frequency distributions
Quantitative data. CONTINUOUS
Interval 𝒙𝒊 𝒂𝒊 𝒏𝒊 𝒇𝒊 𝒇𝒊 (%) 𝒘𝒊 𝑵𝒊 𝑭𝒊 𝑭𝒊 (%) 𝒅𝒊
𝐿𝑖 − 𝐿𝑖+1 (𝒄𝒊 )
10 1 100 360
𝒅𝒊 𝑭𝒊
1
0,60
0,70
0,30
0,10 0,10
𝑤𝑖
70 80 90 100
70 80 90 100 𝒙𝒊 𝒙𝒊
• Mean
• Median
• Mode
• Mean, x :
The Mean of a quantitative dataset is the sum of the observations in the dataset
divided by the number of observations in the dataset.
• Median, 𝑀𝑑 :
The Median of a quantitative dataset is the number in the middle of the
observations arranged in ascending order.
• Mode, 𝑀𝑜 :
The Mode of a dataset is the observation that occurs most frequently in the
dataset.
2. Measures of location/position
Measures of central location
Mean:
ഥ.
There are two means, the population mean, μ, and the sample mean, 𝒙
The calculation of both is the same, except that μ is calculated/estimated for the
ഥ is calculated for a sample taken from that population.
entire population and 𝒙
Mean:
x i
x1 x2 x3 xn 1 xn 𝑐 𝑖 𝑛𝑖
x i 1
or ҧ σ𝑛𝑖=1
𝑥=
𝑛
n n
Example:
Compute the mean for the following sample: x = {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60}
The mean is calculated by adding all the values and dividing the sum by the number
of observations, which equals 56.6 years.
Properties:….
2. Measures of location/position
Measures of central location
Median:
The median is the middle value of the distribution of all observations in the
sample arranged in ascending or descending order.
The median divides the distribution in half (there are 50% of observations on
either side of the median value).
For the continuous variables, we look for the value of the observation which
accumulates 50% of the relative frequency (𝐹𝑖 ). Since the data is grouped into
intervals, we might need to apply linear interpolation technique.
2. Measures of location/position
Measures of central location
Median:
• the number of observations is odd: {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the middle value, which is 57 years
• the number of observations is even: {52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60},
the median is the mean of the two middle values, which equals 56.5 years
2. Measures of location/position
Measures of central location
Mode:
In a given relative frequency distribution with class intervals, the mode is the
mid-point of the class interval which has the highest relative frequency. The
class interval of the highest relative frequency is called the Modal Class.
The mode measures data concentration and, then, can be used to locate the
region in a large dataset where much of the data is concentrated.
2. Measures of location/position
Measures of central location
Mode:
In some distributions, the mode may not exactly represent the centre of the
distribution.
For example, in the sample {54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60} the centre of
the distribution is 57 years, but the mode is lower, at 54 years.
It is also possible to have more than one mode for the same distribution of data,
(bi-modal, or multi-modal). The presence of more than one mode can limit the
ability of the mode in describing the centre or typical value of the distribution,
because the single value which describes the centre of the distribution cannot be
identified.
2. Measures of location/position
Measures of central location
Plot A Plot B
2. Measures of location/position
Measures of central location
2. Measures of location/position
Measures of non-central location
• Quartiles
• Percentiles
Percentiles
are those values that divide the frequency distribution into 100 equal parts.
The 𝑝𝑡ℎ percentile is a value, such that at least p percent of the observations in
the frequency distribution are ≤ than this value and at least (100- p) percent of
the observations are ≥ than this value.
3. Measures of dispersion
Measures of variability
Terms like variability, spread, and dispersion are synonyms. They refer to
how the distribution is spread out, that is we will look for measures of the
variability of a distribution.
• Range
• Inter-Quartile Range
• Standard deviation and Variance
3. Measures of dispersion
Measures of variability
Range R:
The range is the most obvious measure of dispersion and is the difference
between the lowest/minimum and the highest/maximum value in a dataset:
R=maximum-minimum
R is useful to evaluate the whole dataset, to show the spread within a dataset
and to compare the spread between similar datasets.
Since the range is based solely on the two most extreme values within the
dataset, if one of these is either extremely high or low (sometimes referred to
as outlier) it will result in a range that is not typical for the variability within the
dataset.
3. Measures of dispersion
Measures of variability
In the same way that the median divides a dataset into two halves, it can be
further divided into quarters by identifying the upper and lower quartiles.
The lower quartile is found one quarter of the way along a dataset and the
upper quartile is found three quarters along the dataset.
The IQR is the difference between upper and lower quartiles and is not
affected by extreme values. It is thus a resistant measure of variability.
3. Measures of dispersion
Measures of variability
Standard deviation S:
is a measure that summarises the amount by which every value within a dataset
varies from the mean.
Effectively it indicates how tightly the values in the dataset are bunched around
the mean value.
It is the most robust and widely used measure of dispersion since, unlike the range
and inter-quartile range, it takes into account every observation in the dataset.
When the values in a dataset are pretty tightly bunched together the standard
deviation is small. When the values are spread apart, the standard deviation will
be relatively large. The standard deviation is usually presented in conjunction with
the mean and is measured in the same units.
3. Measures of dispersion
Measures of variability
Standard deviation S:
Two vending machines A and B drop candies when a quarter is inserted. The number
of pieces of candy one gets is random. The following data are recorded for six trials at
each vending machine:
Dotplots for the pieces of candy from vending machine A and vending machine B:
They have the same values of the center measures, but what about the spread of the
distribution? One way to look at the spread is to compute the standard deviations.
3. Measures of dispersion
Measures of variability
Standard deviation S,
• Sample:
σ𝑛 ҧ 2 σ𝑛 2
𝑖=1 𝑥 −𝑛𝑥ҧ
2
𝑖=1(𝑥𝑖 −𝑥) 𝑖
Sx = =
𝑛−1 𝑛−1
The difference between each score and the mean is squared and then added
together. This sum is then divided by the number of scores minus one. Finally,
compute the square root of the expression.
• Population
σ𝑁
𝑖=1(𝑥𝑖 − μ)
2
σ𝑥 =
𝑁
Variance, 𝑺𝟐 :
Variability can also be defined in terms of how close the scores in the distribution are to
the middle of the distribution. Using the mean as the measure of the middle of the
distribution, the variance is defined as the average squared difference of the scores from
the mean.
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
• Sample variance, Var(x) 𝑺𝟐𝒙 =
𝑛−1
σ𝑁
𝑖=1(𝑥𝑖 −μ)
2
• Population variance, Var(x) σ2𝑥 =
𝑁
Since the variance represents the squared differences, the standard deviation represents
the true differences and is, therefore, easier to interpret and much more commonly
used.
Relative measures of variability are measures of variability adjusted by the measure of the
central location/position
𝑅𝐼𝑄
𝑅𝐼𝑄𝑅𝑥 =
𝑀𝑑
The relative measure of the variability is expressed in units of the central position
measure
3. Measures of dispersion
This method consists in subtracting the average and divide by the standard
deviation each and every observation in the dataset.
𝒙−ഥ𝒙
z= 𝑺
The obtained dataset is the set of data with the average equal to 0 and
standard deviation equal to 1.
3. Measures of dispersion
In absolute terms, Juan got the highest grade in Literature. However, if we want
to know in what subject did Juan get the highest grade with respect to others, we
have to compute standardized grades.
3. Measures of dispersion
Then, with respect to other students, Juan got the highest grade in Mathematics
4. Measures of Shape: Skewness and Kurtosis
The first thing you usually notice about a distribution’s shape is whether it has one
mode (peak) or more than one. If it’s unimodal (has just one peak), like most data
sets, the next thing you notice is whether it’s symmetric or skewed to one side.
Skewness is the tendency for the values to be more frequent around high or low
ends of the x-axis.
Distributions that have the same shape on both sides of the centre are called
symmetric.
(A symmetric distribution with only one peak is referred to as a normal distribution.)
4. Measures of Shape: Skewness and Kurtosis
𝑛 (𝑥𝑖 −𝑥 ഥ )3
σ𝑖=1
𝑛
𝑔1 = 2 3/2
(𝑥 −𝑥ഥ )
(σ𝑛
𝑖=1
𝑖
𝑛
)
If the bulk of the data is at the left and the right tail is longer, we say that the
distribution is skewed right or positively skewed;
if the peak is towards right and the left tail is longer, we say that the
distribution is skewed left or negatively skewed.
4. Measures of Shape: Skewness and Kurtosis
Kurtosis refers to how scores are concentrated in the centre of the distribution, the upper
and lower tails (ends), and the shoulders (between the centre and tails) of a distribution.
𝑚4
The moment coefficient of kurtosis of a data set is: 𝑔2 =
𝑚22
,
where 𝑚𝑘 is the k-th sample central moment, hence,
(𝑥 − ഥ
𝑥 ) 4
σ𝑛𝑖=1 𝑛
𝑖
𝑔2 = 2 2
(𝑥 − ഥ
𝑥 )
(σ𝑛
𝑖=1 𝑛
𝑖 )
Platykurtic - Starting with a mesokurtic distribution and moving scores from both
centre and tails into the shoulders, the distribution flattens out and is referred to as
platykurtic.
𝐗𝐢 ഥ
𝑿𝒊 − 𝑿 ഥ )𝟐
(𝑿𝒊 − 𝑿 ഥ )𝟑
(𝑿𝒊 − 𝑿 ഥ )𝟒
(𝑿𝒊 − 𝑿
The average tells us about the central value of the distribution and the measure of
dispersion tell us about the concentration of the items around a central value.
These measures do not reveal whether the dispersal of value on either side of an
average is symmetric or not.
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion,
study the characteristics of a frequency distribution.
Thus, Skewness is a measure that studies the degree and direction of departure from
the symmetry.
Overview (cont.)
On the other hand, in an asymmetric distribution, values of mean, median and mode
are not equal.
When two or more symmetric distributions are compared, the difference between them
is studied by means of Kurtosis.
On the other hand, when two or more symmetrical distributions are compared, they
will give different degrees of Skewness.
These measures are mutually exclusive i.e. the presence of Skewness implies absence of
Kurtosis and vice-versa.
5. Measures of concentration
The analysis of the concentration takes into account the degree of the
inequality in the distribution of the variable. It is often used in economic
series and studies of the income/wealth inequality analysis.
σ𝑛−1
𝑖=1 (𝑝𝑖 −𝑞𝑖 )
𝐼𝑔 = σ𝑛−1
,
𝑖=1 𝑝𝑖
𝑥𝑖 ∗𝑛𝑖
where 𝑝𝑖 = 𝐹𝑖 ∗ 100 and 𝑞𝑖 = σ𝑛 *100
𝑖=1 𝑥𝑖 ∗𝑛𝑖
5. Measures of concentration
Gini index =0
represents a distribution
where the Lorenz curve is just
the ‘Line of Equality’ and
income is perfectly equally
distributed and there is a
minimum concentration of the
income.
Gini index =1
maximal inequality and the
maximum concentration of
the income (one person has all
income and all others receive
no income).
5. Measures of concentration
𝑥𝑖 𝑛𝑖 𝑁𝑖 𝑝𝑖 𝑥𝑖 𝑛𝑖 𝑞𝑖
The value of the index is close to zero, therefore, the income is equally
distributed and there is a minimum concentration of the income.
6. Bidimensional frequency distribution
Contingency and correlation tables
We often wish to measure the degree to which one variable affects the value of
another or we want to study the association between two or more features of
the variable for each element of the population (bivariate data).
We use:
Example 1:
We want to study the relation between the colour of eyes and the colour of hair.
Example 2:
Example 2 (cont.)
• Variable Y indicates the neural problem, such that 𝑦1 is a strong problem and
𝑦2 is a light one.
Then, we count and classify patients according to their problems and the type of
drug they are taking:
6. Bidimensional frequency distribution
Strong Light
Old a b
New c d
X Y
𝑦1 𝑦2
𝑥1 10 4
𝑥2 5 11
These 4 values are called double absolute frequencies (n), which tell us the
number of subjects there are with specific values of the variables.
6. Bidimensional frequency distribution
How many of them have light neuronal problems and take new drugs (that is,
what is the absolute frequency of individuals with light neuronal problems
taking new drugs )?
𝑛22 = 11
6. Bidimensional frequency distribution
Now, to find relative frequencies (f), which tell us the number of subjects there are, with
specific values of the variables, out of the total amount of responders, divide the value of
each cell by the total number of patients. If we multiply them by 100, we obtain the
values in percentage terms.
What is the percentage of patients with strong problems taking new drugs?
0,167·100=16,7 %
We add a column in the right hand side and add a row in the bottom of the
table and call them “Total”.
The column “Total” is the marginal distribution of X and the row is the
marginal distribution of Y.
Total 15 15 30
6. Bidimensional frequency distribution
• It’s similar to variance. However, the variance is a measure of the variation of one
variable, while the covariance is a measure of the variation of two variables.
• Covariance indicates the existence of linear relationship between variables.
1 1
𝐶𝑜𝑣 𝑥, 𝑦 = 𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ = σ𝑛𝑖=1(𝑥𝑖 −𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത
𝑛 𝑛
Example.
x\y 1 2 3
2 1 4 1
3 2 4 2
4 1 2 1
Solution. We first compute the average values of x and y and then use the
formula of covariance
1
𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) − 𝑥ҧ 𝑦ഥ =
𝑛
(2·1·1+2·2·4+2·3·1+3·1·2+3·2·4+3·3·2+4·1·1+4·2·2+4·3·1)
= − 3 · 2 = 5.1
18
6. Bidimensional frequency distribution
Covariance
Unidimensional case
𝑉𝑎𝑟 𝑎𝑋 = 𝑎2 · 𝑉𝑎𝑟 𝑋
Bidimensional case,
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + 𝑏2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · 𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 · 𝑉𝑎𝑟 𝑋 + (−𝑏)2 · 𝑉𝑎𝑟 𝑌 + 2 · 𝑎 · −𝑏 · 𝐶𝑜𝑣 𝑋, 𝑌
If 𝑿 𝒂𝒏𝒅 𝒀 𝒂𝒓𝒆 𝒊𝒏𝒅𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕:
𝑉𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 + 𝑏2 𝑉𝑎𝑟 𝑌
𝑉𝑎𝑟 𝑎𝑋 − 𝑏𝑌 = 𝑎2 𝑉𝑎𝑟 𝑋 +(−𝑏)2 𝑉𝑎𝑟 𝑌
6. Bidimensional frequency distribution
Examples
Unidimensional case
𝑉𝑎𝑟 3𝑋 = 32 · 𝑉𝑎𝑟 𝑋 = 9 · Sx2
𝑉𝑎𝑟 −4𝑋 = (−4)2 · 𝑉𝑎𝑟 𝑋 = 16 · 𝑆𝑥2
Bidimensional case,
𝑉𝑎𝑟 3𝑋 + 2𝑌 = 32 · 𝑉𝑎𝑟 𝑋 + 22 · 𝑉𝑎𝑟 𝑌 + 2 · 3 · 2 · 𝐶𝑜𝑣 𝑋, 𝑌
= 9 · 𝑉𝑎𝑟 𝑋 + 4 · 𝑉𝑎𝑟 𝑌 + 12 · 𝐶𝑜𝑣 𝑋, 𝑌 = 9 · 𝑆𝑥2 + 4 · 𝑆𝑥2 + 12 · 𝑆𝑥𝑦