Professional Documents
Culture Documents
LU0 Descriptive Statistics
LU0 Descriptive Statistics
Distributions
◼ Learning Units
▪ LU0: Descriptive statistics
▪ LU1: Introduction to probability theory
▪ LU2: Probability axioms
▪ LU3: Random variables and distribution functions
▪ LU4: Mathematical expectation and moments
▪ LU5: Specific probability distributions
▪ LU6: Joint distributions
Math I
• Solving equation systems
• Real functions Statistics and
• Derivatives Probability ALL courses and
• Integral calculation Distributions whenever it is
necessary to analyse
• Descriptive statistics data!
Math II • Probability theory
• Partial derivatives
• Integral calculation in IR2
Statistical Inference
• Point estimation
• Confidence intervals
• Hypothesis tests
(parametric)
◼ Introduction to Statistics
◼ Organization of the information
◼ Frequency distributions
◼ Descriptive measures
◼ Introduction to Probability Theory
LU0: Descriptive statistics
◼ Topics
▪ Introduction to Statistics ▪ Descriptive measures
❑ Concepts ❑ Location
❑ Data types and classification of ❑ Dispersion
statistical variables ❑ Association
❑ Measurement scales ❑ Outlier analysis (self-study)
▪ Frequency distributions
❑ Discrete variables
❑ Continuous variables
▪ Newbold, P., Carlson, W. L., Thorne, B. (2013). Statistics for Business and
Economics. 8th Edition, Boston: Pearson, chapters 1 and 2. (requires VPN connection)
▪ Jon Acampora (2015) Interactive Histogram Chart That Uncovers The Details.
https://www.excelcampus.com/charts/interactive-histogram-with-group-
details/ (access: Jan 2022)
◼ Population
▪ Entire set of elements having one or more common characteristics
Example: Portuguese population, employees, all cars in circulation, Portuguese
SMEs
◼ Statistical Unit
▪ Individual element of the population
▪ Each unit may have one or more characteristics
◼ Sample
▪ Subset of elements from a population for which certain characteristics
are studied
◼ Variable
▪ Each characteristic of a statistical unit corresponds to a variable
❑ A variable is an attribute that can be used to describe a person, place, or
thing
▪ The values that a characteristic can assume are the values that the variable can
take
❑ Different types of variables are analysed using different tools and statistical
techniques
❑ Notation: X, Y, Z
▪ Variables are characteristics of those elements or the attributes that you are
measuring
▪ Data are sometimes stored in a table, where the rows are statistical units and
the columns are variables
❑ What variables might you associate with a person?
❑ What variables might you associate with a city?
❑ What variables might you associate with a stock?
❑ What variables might you associate with a car?
DATA TYPES
QUALITATIVE QUANTITATIVE
Nominal type When values are only identified by Classification of individuals by gender
a name, label or code that (female, male)
designates a category or modality,
Classification of regions (urban,
which cannot be sorted in a logical
suburban and rural)
fashion. Hence, the categories
cannot be assigned a numerical Type of vegetation; class of soil
value.
Ordinal type Differ from the nominal values by Grades from an elementary school
the possibility of sorting the test (insufficient, sufficient, good)
categories, and assigning numerical
Teacher’s evaluation (excellent, good,
values.
average, poor)
Discrete type: When variables take only a finite, or Number of accidents per hour
discrete a countable infinite, number of
variables values. Typically, values are
Number of workers in a company
Fire frequency
Number of buildings
Profit margin
▪ Equal intervals: scale units along the scale are equal to one another. This
means, for example, that the difference between 1 and 2 would be equal to
the difference between 19 and 20.
Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 15
Introduction to Statistics
◼ Measurement scales
▪ Nominal: only satisfies the identity property of measurement. Values assigned
to variables represent a descriptive category, but have no inherent numerical
value with respect to magnitude.
▪ Ordinal: has the property of both identity and magnitude. Each value on the
ordinal scale has a unique meaning, and it has an ordered relationship to every
other value on the scale.
Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 16
Introduction to Statistics
◼ Measurement scales
▪ Nominal scale
❑ Numbers are used as labels (names or categories) to identify the measured
objects
❑ The assignment of the numbers to the measured objects is agreed - they do
not reflect the quantity of the observed characteristic but rather their
quality
❑ Examples: car registrations, zip codes, marital status, sex, eye colour,
article code
◼ Measurement scales
▪ Ordinal scale
❑ Maintains the characteristics of the nominal scale, but has the ability to
sort the data
❑ Any series of numbers can be used as long as it preserves the order of
relationships between measured objects (numeric values are irrelevant)
❑ Examples: social level, salary tier, scales used to measure opinions (Likert
scales)
◼ Measurement scales
▪ Interval scale
❑ To the characteristics of the previous scales, it adds the possibility of
determining the distance between the different points of the scale, but
with an arbitrary zero point
❑ The location of 0 is agreed (0 does not mean "absence of")
❑ The numbers used by the interval scales do not allow establishing
proportionality relationships
➢ Sums and differences can be made with the values of these measurement
scales (mean, standard deviation, etc.)
◼ Measurement scales
▪ Ratio scale
❑ It has the same properties as the interval scale, but includes an absolute 0
(0 means "absence of")
❑ Allows to establish proportionality relationships
➢ It is possible to do all the arithmetic operations ("all" statistical procedures
are allowed)
➢ Allows the conversion of units of measure (e.g., from km to miles)
Statistics
Descriptive Inferential
Statistical Inference
POPULATION
Descriptive Statistics
◼ Tables
▪ Etc.
◼ Simple table
TOTAL 4695
◼ Simple table
Men 2699.6
Women 1995.4
TOTAL 4695
◼ Example 1
Using the data in the Students sheet of the LU0_Examples Excel file and
PivotTables,
▪ Produce a simple table showing how many students are enrolled in each
subject
➢ How many students are enrolled in Math?
▪ Produce a simple table showing how many students are enrolled in each high
school
➢ How many students are enrolled in Columbus East?
▪ Title indicating in a precise and synthetic way the subject of the information
▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information
◼ Graphs
▪ Graphs are used to illustrate in a simple and intuitive way the distribution of
information
◼ Graphs
▪ Direct comparison between graphs can only be done as long as the
scale is the same !
▪ It is common to find graphs in 3D where depth does not describe any variable.
Because volume is what brings the greater difficulties in terms of perception,
3D graphs should be avoided !
◼ Scatterplot
▪ The data are displayed as a collection of points (or another symbol) , each
having the value of one variable (X) determining the position on the horizontal
axis and the value of the other variable (Y) determining the position on the
vertical axis
◼ Example 2
▪ Using the data in the NationalAccounts sheet of the LU0_Examples Excel file,
produce a scatterplot of the consumption and income data, and use the Add
Trend Line option
➢ What can you conclude from this chart?
◼ Line chart
▪ Displays information as a series of data points connected by straight line
segments
! A dashed line is visually less important than a solid line
▪ Show trends and changes of one variable by another continuous variable that
is represented on the horizontal axis
▪ Time series graph: allows to analyse trends and changes of one or more
statistical variables over a period of time
◼ Line chart
▪ No more than three lines per chart should be included, otherwise they make
the chart difficult to read
❑ A different line style should be used for each chart, using color, shape, size,
or value
◼ Example 3
▪ Using the data in the ActivePop sheet of the LU0_Examples Excel file, produce
a line chart of the employment and unemployment data
➢ Is this an appropriate chart for these data?
◼ Example 3
▪ The magnitude of the values of the variables is very different from each other
▪ Both variables have very low relative variation rates, so variations over time
are not very perceptible
✓ Simple solution: produce two separate charts
◼ Example 3
✓ Elaborate solution: include two vertical axis in a single chart
◼ Bar chart
▪ The values of the variable are represented by bars whose height (or
length), represents the numeric value of the variable(s)
▪ Neither the area nor the width of the bars are important (they have no
relation to the values of the variable)
▪ In order not to mislead and / or make it difficult to read the graph, the
bars must be all the same width
▪ The gap between the bars should be approximately equal to the width
of the bars
◼ Bar chart
▪ Should be used to represent discrete or qualitative data in absolute or
relative terms, or to compare categories of quantitative variables
▪ Bar charts can replace time series graphs in cases where the data
series is very short
❑ They are also recommended when importance is given to the value of the
variable in each period and we mostly want to compare individual
quantities
❑ For more than one data series, line charts are clearly preferable
▪ Simple bar chart: represents only one variable and each bar is
associated with a value
◼ Example 4
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Using a double entry PivotTable, produce a bar chart of the number of
students enrolled in each subject, grouped by high school. Suggestion:
use the PivotChart capability.
◼ Example 5
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Using a double entry PivotTable, produce stacked bar charts of the number of
students enrolled in each subject by high school. Suggestion: use the PivotChart
capability.
◼ Pie chart
▪ Circular graph in which the circle represents the total value of the
aggregate, and each section represents a component
◼ Example 6
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Produce pie charts of the distribution of students by high school and by subject
▪ Legend when the graph shows data from more than one variable
▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information
◼ Frequency distribution
▪ Set of all values, or modalities, of a variable and the number of
corresponding occurrences
▪ Set of sorted values of the variable and cumulative sum of the previous
frequencies
◼ Frequency table
x1 n1 f1 N1 F1
… … … … …
xk nk fk Nk = n Fk = 1
n 1
◼ Notation
x1, x2, …, xk → values that the variable X assumes
◼ Bar's diagram
▪ Graph on which the X axis is
indicated the values of the
variable and, on the Y axis, the
respective simple frequencies
[absolute or relative]
▪ In practice, it is common to
use bar charts
◼ Ladder diagram
▪ Ladder-shaped chart
representing the distribution
of cumulative frequencies
(absolute or relative)
▪ The classes should be exhaustive, i.e., they must cover the entire range of the
data
▪ The first class should contain the lowest value, just as the last class should
contain the highest value
▪ The number of classes and the width of each class should neither be too small
nor too large
▪ The classes should, preferably, be of equal width, and no class should have
zero frequency
▪ Open-end classes: It may be so that some values in the data set are
extremely small compared to the other values of the data set and
similarly some values are extremely large in comparison. Then what we
do is we do not specify the lower limit of the first class and the upper
limit of the last class. Such classes are called open end classes.
◼ Construction of classes
Sturges’ rule (logarithm) Common sense (rule of thumb)
Nr. of observations (n) Nr. of classes (k) Nr. of observations (n) Nr. of classes (k)
1 1 Less than 50 5
2 2 50 – 100 6–8
3–5 3
100 – 200 8 – 10
6 – 11 4
200 – 300 10 – 12
12 – 23 5
300 – 500 12 – 15
24 – 46 6
500 – 1000 15 – 20
47 – 93 7
More than 1000 20
94 – 187 8
188 – 376 9
❖ Alternatives:
377 – 756 10 https://en.wikipedia.org/wiki/Histogram#Num
ber_of_bins_and_width
◼ Construction of classes
k = 5 if n 25
▪ k = Number of classes:
k n if n 25
◼ Histogram
▪ Graphical representation of the frequency distribution of a continuous
variable by means of rectangles whose widths represent class intervals
and whose areas are proportional to the corresponding frequencies
❑ The height of the rectangles is equal to the (absolute ore relative)
frequency divided by the class interval’s width
▪ The area of the rectangle of each class is equal to its frequency and the
sum of the areas is equal to N or 1 if it represents absolute frequencies
or relative frequencies, respectively
◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Skewness: distributions are skewed to the side of the long tail
◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Modality: prominent peaks determine modality
◼ Histogram
▪ May depict extreme values and possible outliers
Moderate
extremes
Heavy
extremes
Atypical values
Outliers ?
Errors ?
◼ Histogram
▪ How does the number of classes (thus, the width) affect the
histogram?
➢ https://courses.lumenlearning.com/wmopen-concepts-
statistics/chapter/histograms-2-of-4/
➢ http://www.shodor.org/interactivate/activities/Histogram/
✓ The major difference is that a histogram is only used to plot the frequency
of a continuous data set that has been divided into classes (sometimes
named bins). Bar charts, on the other hand, can be used for a great deal of
other types of variables including ordinal and nominal data sets.
◼ Frequency Polygon
▪ Graph resulting from successively joining, by line segments, the midpoints of
the upper sides of the rectangles of the histogram
▪ The first point is on the x-axis and is
placed in the middle of the interval
which precedes the first bar of the
histogram (frequency=0). The last point
is located on the x-axis in the middle of
the interval immediately following the
last bar of the histogram (frequency=0).
◼ Frequency Polygon
▪ Emphasizes the overall
pattern in the data
Source: European Centre for Disease Prevention and Control (n.a.) FEM Wiki:
Frequency polygons. (https://wiki.ecdc.europa.eu/fem/w/wiki/frequency-polygons;
accessed: 20 Nov 2017)
◼ Frequency Curve
▪ A smooth curve which corresponds to the limiting case of a histogram
computed for a frequency distribution of a continuous variable as the
number of observations becomes very large
❑ It is roughly a smoothed Frequency Polygon
Source: Weisstein, Eric W. "Frequency Curve." From MathWorld--A Wolfram Web Resource.
(http://mathworld.wolfram.com/FrequencyCurve.html; accessed: 18 Nov 2017)
◼ Frequency Curve
◼ Descriptive statistics
▪ Synthesize important information characteristics through a single
number
◼ Location measures
▪ Describe where the data is located (on the x-axis)
❑ Central tendency measures attempt to describe the typical or central value
that best describes the data. The focus is on where the data is centred or
clustered.
◼ Symmetry measures
▪ Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same
to the left and right of the center point.
◼ Kurtosis measure
▪ Kurtosis refers to how peaked a distribution is or conversely how flat it
is relative to a Normal distribution, which is symmetrical and bell-
shaped
❑ A positive value tells you that you have heavy-tails (i.e., a lot of data in your tails)
❑ A negative value means that you have light-tails (i.e., little data in your tails)
◼ Association measures
▪ Describe the degree of association between two variables
Mean Quartiles
Median Deciles
Mode Percentiles
◼ Mean
▪ Can be thought of as the centre of mass of the values of the
observations, i.e. the point of equilibrium after we have the
observations on a ruler
▪ Extreme values may “push away” the mean from the most typical values
▪ As the data becomes skewed the mean loses its ability to provide the best central
location for the data because the skewed data is dragging it away from the typical values
10 11 15 12 17 14
15 12 16 15 12 14
11 17 16 14 15 15
16 10 17 13 13 13
16 13 11 16 18 14
15 16 13 16 14 16
◼ Mean
▪ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑋ത = 𝑋𝑖
𝑛
𝑖=1
𝑋ത = 𝑓𝑖 𝑋𝑖
𝑖=1
𝑋ത = 𝑓𝑖 𝐶𝑖
𝑖=1
◼ Mode
▪ It is easily affected by small changes in frequency of discrete data, or
class construction in the continuous case
▪ It is possible for a set of data values to have none or more than one
mode
▪ Middle value of the data set when it has been arranged in ascending
order
❑ If the sample has an odd size, it coincides with the central observation
❑ If the sample has even size, the median takes the value of the average of
the two most central observations
0 0 1 2 2 3
❑ n = 6 (even)
xi Fi
0 0.44
1 0.56 Median = 1
2 0.78
3 0.89
4 1
▪ Percentiles → values that split a sorted data set into 100 equal parts
❑ The first quartile (Q1) is the 25th percentile: 25% of the observations are below Q1
❑ The second quartile (Q2 = Median) is the 50th percentile
❑ The third quartile (Q3) is the 75th percentile: 75% of the observations are below Q3
50% 50%
Q2
Q1 Q3
Xmin median Xmax
25% 75%
75% 25%
Ana Cristina Costa ! Different software use slightly different formulations to compute these values 92
Descriptive measures
▪ A symmetric distribution is one where the left- and right-hand sides of the
distribution are roughly equally balanced around the mean
Source: https://www.siyavula.com/read/maths/grade-11/statistics/11-
Ana Cristina Costa 93
statistics-05 ; accessed: 17 Jun 2020
Descriptive measures
mode mode
median median
Frequency
Frequency
mean mean
Positive skew: Mean > Median Negative skew: Mean < Median
▪ Median
→ Does not use all data, but it is robust
→ The quartiles, including the median, are robust location measures because
they are not affected by extreme values
▪ Mode
→ Easily affected by small changes in frequency
➢ Do not use the MODE Excel function with continuous data
(1) Median, or
Ana Cristina Costa Adapted from: What to Report When There is an Outlier by Robert 96
G. Kelley, www.miracosta.edu/home/rkelley (accessed 2018)
Descriptive measures
◼ Dispersion measures
◼ Range
▪ Difference between the highest and the lowest value
❑ It is measured in the same units as the data
❑ It is most useful in representing the dispersion of small data sets
R = Xmax – Xmin
◼ Interquartile Range
▪ Difference between the 3rd quartile and the 1st quartile (encompasses
50% of the central observations)
IQ = Q3 – Q1
𝑓𝑖 𝑋𝑖 − 𝑋ത
𝑖=1
𝑓𝑖 𝐶𝑖 − 𝑋ത
𝑖=1
𝑆
𝐶𝑉 = × 100
𝑋ത
▪ It can only be calculated when the variable takes values of a single signal, i.e.
the values are all positive or are all negative
▪ CV > 50% indicates a small representativeness of the mean, thus the median
should [also] be used to characterise the typical values
❑ The mean will be the more representative the lower the CV value
▪ In the continuous case, the different values are obtained for the
statistics when they are calculated using raw data (i.e., individual
observations) or aggregated data (i.e., frequencies).
➢ In the calculation of the statistics, the midpoint of each class (Ci) is used to
approximate all observations from that class
◼ Association measures
Covariance
[Pearson’s] Correlation coefficient
Spearman’s correlation coefficient
◼ Covariance
▪ Measures the magnitude of the linear association between two
variables by measuring the joint variation of X and Y around their
means
❑ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑆𝑋𝑌 = 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑛−1
𝑖=1
◼ Example 7
Consider the data in the Stocks sheet of the LU0_Examples Excel file.
▪ Are the monthly returns of Microsoft, GE, Intel, GM and CISCO correlated?
◼ Exercise
Play the game: guess the value of the correlation coefficient at
https://en.wikipedia.org/wiki/Guess_the_Correlation
Ana Cristina Costa 108
Descriptive measures
▪ The rank assigned to each set of duplicates is the average of the ranks
that those tied values would have if they were different from each
other
i=1
R( x i )R( y i ) − n
2
r' =
2
n 2
R( x i ) − n n + 1 +
n
R( y ) − n
n 1
2 2
2 i=1 i
2
i=1
r' = 1 −
(
n n2 − 1 )
Ana Cristina Costa 112
Descriptive measures
◼ Example 8
Data are in the Example8 sheet of the LU0_Examples Excel file
❑ Compute the Spearman’s rho (considering few ties)
Xi Yi
550 80
620 60
580 10
580 20
540 30
◼ Example 8 - solution
6(22 .5)
r' = 1 − = −0.125
(
5 5 −1
2
)
Ana Cristina Costa 114
Descriptive measures
◼ Outlier analysis
▪ Outlier: discordant or extreme value
◼ Boxplot
𝐐𝟑 − 𝐐𝟐 > 𝐐𝟐 − 𝐐𝟏 𝐐𝟐 − 𝐐𝟏 = 𝐐𝟑 − 𝐐𝟐 𝐐𝟐 − 𝐐𝟏 > 𝐐𝟑 − 𝐐𝟐
Descriptive measures
◼ Boxplot
▪ Useful for comparing different distributions in the same graph
◼ Outlier analysis
▪ “The simple elimination of a potential outlier should be done with caution and
the most advisable is to carry out the analysis with and without the presence
of that observation. If the conclusions are discordant one should at least be
aware that the outlier significantly affects the conclusions, and so it is best to
report this fact, leaving to the third party the possibility to choose their own
path.”
◼ Example 9
The Example9 sheet of the LU0_Examples Excel file has data on the
duration of extracorporeal circulation (in minutes) of 94 patients
undergoing a heart intervention, between May 1980 and December 1988
at the Hospital de Santa Cruz (Source: Murteira, 1993, pp. 97-98)
❑ Use the graphics facility in Excel 2016 (or later) to produce box plots
1. Select your data – either a single data series, or multiple data series
◼ Example 9
Outlier analysis using fences
❑ In this case, the lower fences, inner (BII) and outer (BEI), are irrelevant
because they are less than zero and the variable is positive
✓ Mean = 139.72
✓ Min = 30 OUTLIERS
✓ Q1 = 95.5 Moderate
295 minutes
✓ Median = Q2 = 120
300 minutes
✓ Q3 = 167.5
✓ Max = 403 Severe
✓ IQ = 72 402 minutes
✓ BIS = 167.5 + 1.5x72 = 275.5 403 minutes
✓ BES = 167.5 + 3x72 = 383.5
◼ Example 10
Consider the data in the Grades sheet of the LU0_Examples Excel file
▪ Use the Analysis Toolpak add-in to compute descriptive statistics of the grades
from each school
▪ Use the graphics facility in Excel 2016 (or later) to produce a histogram of the
grades from Columbus East
▪ Use the graphics facility in Excel 2016 (or later) to produce a box plot of the
grades by school and subject
Probabilistic model
2
1 x − 830
−
1 2 24
e
24 2
–– Frequency
Curva de frequências
curve
✓ 23.4% of the cigarettes in the sample weight between 840 and 860 mg
✓ The probability of any cigarette to weight between 840 and 860 mg is 23.16%
INFERENCE
POPULATION
Mean:
ഥ
Sample mean: X
Sample