LU0 Descriptive Statistics

Statistics and Probability
Distributions
Ana Cristina Costa

ccosta@novaims.unl.pt
Bachelor’s degree in Information Management

Bachelor’s degree in Information Systems February 2022
Bachelor’s degree in Data Science
Syllabus
◼ Learning Units
▪ LU0: Descriptive statistics
▪ LU1: Introduction to probability theory
▪ LU2: Probability axioms
▪ LU3: Random variables and distribution functions
▪ LU4: Mathematical expectation and moments
▪ LU5: Specific probability distributions
▪ LU6: Joint distributions
Ana Cristina Costa 2

Framework of the Curricular Unit
Math I
• Solving equation systems
• Real functions Statistics and
• Derivatives Probability ALL courses and
• Integral calculation Distributions whenever it is
necessary to analyse
• Descriptive statistics data!
Math II • Probability theory
• Partial derivatives
• Integral calculation in IR2
Statistical Inference
• Point estimation
• Confidence intervals
• Hypothesis tests
(parametric)

LU0: Descriptive statistics
◼ Introduction to Statistics
◼ Organization of the information
◼ Frequency distributions
◼ Descriptive measures
◼ Introduction to Probability Theory
◼ Topics
▪ Introduction to Statistics ▪ Descriptive measures
❑ Concepts ❑ Location
❑ Data types and classification of ❑ Dispersion
statistical variables ❑ Association
❑ Measurement scales ❑ Outlier analysis (self-study)
▪ Organization of the information ▪ Introduction to Probability

❑ Tables Theory
❑ Graphs
▪ Frequency distributions
❑ Discrete variables
❑ Continuous variables

◼ At the end of this learning unit students should be able to

▪ Distinguish the various types of data and their scales of measurement
▪ Organize information on charts and graphs
▪ Build and interpret frequency tables, histograms and boxplots
▪ Calculate and interpret measures of descriptive statistics (location,
dispersion and association)
▪ Identify potential outliers using fences (Tukey boxplots)
▪ Explore and summarize data from different application areas
depending on its different characteristics

◼ Resources on the Internet
▪ Newbold, P., Carlson, W. L., Thorne, B. (2013). Statistics for Business and
Economics. 8th Edition, Boston: Pearson, chapters 1 and 2. (requires VPN connection)
▪ Excel Easy: http://www.excel-easy.com/
▪ Jon Acampora (2015) Interactive Histogram Chart That Uncovers The Details.
https://www.excelcampus.com/charts/interactive-histogram-with-group-
details/ (access: Jan 2022)

Introduction to Statistics
◼ Statistics: discipline whose main purpose is the collection,

compilation, analysis and interpretation of data
◼ Statistics help decision-makers to create order and simplicity from the

complexity and chaos of large volumes of data at a time when the amount
of information increases so rapidly

◼ Population
▪ Entire set of elements having one or more common characteristics
Example: Portuguese population, employees, all cars in circulation, Portuguese
SMEs
◼ Statistical Unit
▪ Individual element of the population
▪ Each unit may have one or more characteristics
◼ Sample
▪ Subset of elements from a population for which certain characteristics
are studied

◼ Variable
▪ Each characteristic of a statistical unit corresponds to a variable
❑ A variable is an attribute that can be used to describe a person, place, or
thing
▪ The values that a characteristic can assume are the values that the variable can
take
❑ Different types of variables are analysed using different tools and statistical
techniques
▪ Statistical variables are those that only assume numeric values
❑ Notation: X, Y, Z

◼ Statistical units and variables

▪ Statistical units, or population elements, are the objects of interest in your
study, in other words, what you are collecting your information or data on
▪ Variables are characteristics of those elements or the attributes that you are
measuring
▪ Data are sometimes stored in a table, where the rows are statistical units and
the columns are variables
❑ What variables might you associate with a person?
❑ What variables might you associate with a city?
❑ What variables might you associate with a stock?
❑ What variables might you associate with a car?

◼ Data types and classification of statistical variables
DATA TYPES
QUALITATIVE QUANTITATIVE
Nominal Ordinal Discrete Continuous

Nominal type When values are only identified by Classification of individuals by gender
a name, label or code that (female, male)
designates a category or modality,
Classification of regions (urban,
which cannot be sorted in a logical
suburban and rural)
fashion. Hence, the categories
cannot be assigned a numerical Type of vegetation; class of soil
value.
Ordinal type Differ from the nominal values by Grades from an elementary school
the possibility of sorting the test (insufficient, sufficient, good)
categories, and assigning numerical
Teacher’s evaluation (excellent, good,
values.
average, poor)
Classification of workers (unskilled,

specialised, very specialised)

Discrete type: When variables take only a finite, or Number of accidents per hour
discrete a countable infinite, number of
variables values. Typically, values are
Number of workers in a company
obtained by counting. Number of children
Fire frequency
Number of buildings
Continuous When variables can take an infinite Weight and height

type: non-countable number of values.
continuous Typically, values are obtained by Time spent on the phone
variables measuring, and may take any value
within a range. Market share
Profit margin

◼ Properties of measurement scales

▪ Identity: each value on the measurement scale has a unique meaning.
▪ Magnitude: values on the measurement scale have an ordered relationship to

one another. That is, some values are larger and some are smaller.
▪ Equal intervals: scale units along the scale are equal to one another. This
means, for example, that the difference between 1 and 2 would be equal to
the difference between 19 and 20.
▪ A minimum value of zero: the zero of the scale corresponds to a meaningful

(unique and non-arbitrary) zero value.
❑ The zero of the scale corresponds to the absence of the characteristic that we are
measuring. This means, the scale has a true zero point, below which no values exist.
Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 15
◼ Measurement scales
▪ Nominal: only satisfies the identity property of measurement. Values assigned
to variables represent a descriptive category, but have no inherent numerical
value with respect to magnitude.
▪ Ordinal: has the property of both identity and magnitude. Each value on the
ordinal scale has a unique meaning, and it has an ordered relationship to every
other value on the scale.
▪ Interval: has the properties of identity, magnitude, and equal intervals.
▪ Ratio: satisfies the properties of identity, magnitude, equal intervals, and a

minimum value of zero.
Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 16
▪ Nominal scale
❑ Numbers are used as labels (names or categories) to identify the measured
objects
❑ The assignment of the numbers to the measured objects is agreed - they do
not reflect the quantity of the observed characteristic but rather their
quality
❑ Variables expressed on the nominal scale can be only "equal" or "different"

from each other
➢ The only mathematical operation allowed is counting (frequencies and
mode statistic)
❑ Examples: car registrations, zip codes, marital status, sex, eye colour,
article code

▪ Ordinal scale
❑ Maintains the characteristics of the nominal scale, but has the ability to
sort the data
❑ Any series of numbers can be used as long as it preserves the order of
relationships between measured objects (numeric values are irrelevant)
➢ In addition to the counting operation, it is possible to identify "positions"

(maximum, minimum, median, etc.)
❑ Examples: social level, salary tier, scales used to measure opinions (Likert
scales)

▪ Interval scale
❑ To the characteristics of the previous scales, it adds the possibility of
determining the distance between the different points of the scale, but
with an arbitrary zero point
❑ The location of 0 is agreed (0 does not mean "absence of")
❑ The numbers used by the interval scales do not allow establishing
proportionality relationships
➢ Sums and differences can be made with the values of these measurement
scales (mean, standard deviation, etc.)
❑ Examples: Celsius and Fahrenheit temperature scales (water freezes at 0°C,

but temperatures get colder than that; ratios are not meaningful since 20°C
cannot be said to be "twice as hot" as 10°C); dates when measured from an
arbitrary period (such as CE – Common Era or Christian Era)

▪ Ratio scale
❑ It has the same properties as the interval scale, but includes an absolute 0
(0 means "absence of")
❑ Allows to establish proportionality relationships
➢ It is possible to do all the arithmetic operations ("all" statistical procedures
are allowed)
➢ Allows the conversion of units of measure (e.g., from km to miles)
❑ Examples: age, salary, price, sales volume, distances

❑ If individual A is 80 kg and individual B is 40 kg it can be stated that
individual A weighs twice as much as individual B, since this ratio remains
constant if we change the unit of measure (or scale) to grams. The same
individual A weighs in the new scale 80000g and individual B 40000g, thus A
continues to weigh twice as much as B.

◼ All data types can be transformed into statistical variables
▪ It is always possible to move from a richer scale to a less sophisticated

scale
▪ But using statistical methodologies in which data has been assumed on

a ratio scale when the data is merely ordinal, for example, is a source
of much nonsense!

Statistics
Descriptive Inferential
It consists of the collection, It allows to draw

presentation, analysis and conclusions about a
interpretation of data given population
through the creation of from the collected
appropriate instruments: data, particularly from
tables, graphs and a sample
numerical indicators

◼ Statistical inference process
Statistical Inference
POPULATION
Tables, graphs, numerical

Sample indicators
Descriptive Statistics

Organization of the information
◼ The presentation of statistical information must be clear and

rigorous
▪ Before being organized and analysed, statistical information is

designated as raw information to mean that it has not yet been
processed by statistical methods
▪ Success in using statistical data depends on how they are presented.

The methods of presentation and description of the data are
fundamental so that the users of the statistical information can
understand it easily and quickly.

◼ Tables
▪ Simple table: represents information relating to only one attribute
▪ Double entry table: represents information relating to two attributes
▪ Etc.

◼ Simple table
Table 1 – Civil employment in Portugal in 1990 by sector of activity
Sector of activity Nr. individuals (in thousands)
Primary Sector 845.1

Secondary Sector 1624.5
Tertiary Sector 2225.4
TOTAL 4695
Source: INE; Inquérito ao Emprego, citado em INE, Portugal Social, p. 41

◼ Simple table
Table 2 – Civil employment in Portugal in 1990 by gender
Gender Nr. individuals (in thousands)
Men 2699.6
Women 1995.4
TOTAL 4695
Source: INE; Inquérito ao Emprego
• How many men work in the secondary sector?

◼ Double entry table
Table 3 – Civil employment in Portugal in 1990 by sector of activity and gender
Sector of activity Men Women TOTAL
Primary Sector 427.2 417.9 845.1

Secondary Sector 1108.0 516.5 1624.5
Tertiary Sector 1164.4 1061.0 2225.4
TOTAL 2699.6 1995.4 4695
Source: INE; Inquérito ao Emprego
• Is it possible to build this table from the previous ones?

◼ Example 1
Using the data in the Students sheet of the LU0_Examples Excel file and
PivotTables,
▪ Produce a simple table showing how many students are enrolled in each
subject
➢ How many students are enrolled in Math?
▪ Produce a simple table showing how many students are enrolled in each high
school
➢ How many students are enrolled in Columbus East?
➢ How many students are enrolled in Math at Columbus East?

! Produce a double entry table to answer this question

◼ Principles of tables construction
▪ Title indicating in a precise and synthetic way the subject of the information
▪ Unit of measurement and the period to which the information relates
▪ Designation for rows and columns
▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information

◼ Graphs
▪ Graphs are used to illustrate in a simple and intuitive way the distribution of
information
▪ Initial issues to consider

❑ Is a graph really the best option?
❑ Who is the target audience?
❑ What is the purpose of the graph?
❑ What type of graph should be used?
❑ How should the graph be displayed?
❑ What should be the size of the graph?
❑ Should only one graph be used?

◼ Graphs
▪ Direct comparison between graphs can only be done as long as the
scale is the same !
▪ It is common to find graphs in 3D where depth does not describe any variable.
Because volume is what brings the greater difficulties in terms of perception,
3D graphs should be avoided !
▪ Each type of graph is designed for different situations !

❑ Scatterplot
❑ Line chart
❑ Bar chart
❑ Pie chart

◼ Scatterplot
▪ The data are displayed as a collection of points (or another symbol) , each
having the value of one variable (X) determining the position on the horizontal
axis and the value of the other variable (Y) determining the position on the
vertical axis
▪ Purpose: analyse the relationship between two quantitative variables

❑ The strength of the relationship depends on the spread of points around the
imagined line. The type of relationship depends on the shape of the imagined line.

◼ Example 2
▪ Using the data in the NationalAccounts sheet of the LU0_Examples Excel file,
produce a scatterplot of the consumption and income data, and use the Add
Trend Line option
➢ What can you conclude from this chart?
Source: INE | BP, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

◼ Line chart
▪ Displays information as a series of data points connected by straight line
segments
! A dashed line is visually less important than a solid line
▪ Show trends and changes of one variable by another continuous variable that
is represented on the horizontal axis
▪ Time series graph: allows to analyse trends and changes of one or more
statistical variables over a period of time

◼ Line chart
▪ No more than three lines per chart should be included, otherwise they make
the chart difficult to read
❑ It is preferable to replace with several graphics
❑ A different line style should be used for each chart, using color, shape, size,
or value
❑ A dashed line is visually less important than a full line!

◼ Example 3
▪ Using the data in the ActivePop sheet of the LU0_Examples Excel file, produce
a line chart of the employment and unemployment data
➢ Is this an appropriate chart for these data?
Source: INE, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

◼ Example 3
▪ The magnitude of the values of the variables is very different from each other
▪ Both variables have very low relative variation rates, so variations over time
are not very perceptible
✓ Simple solution: produce two separate charts

◼ Example 3
✓ Elaborate solution: include two vertical axis in a single chart
✓ Decide which solution is more appropriate based on target

audience!

◼ Bar chart
▪ The values of the variable are represented by bars whose height (or
length), represents the numeric value of the variable(s)
▪ Neither the area nor the width of the bars are important (they have no
relation to the values of the variable)
▪ In order not to mislead and / or make it difficult to read the graph, the
bars must be all the same width
▪ The gap between the bars should be approximately equal to the width
of the bars

◼ Bar chart
▪ Should be used to represent discrete or qualitative data in absolute or
relative terms, or to compare categories of quantitative variables
▪ Bar charts can replace time series graphs in cases where the data
series is very short
❑ They are also recommended when importance is given to the value of the
variable in each period and we mostly want to compare individual
quantities
❑ For more than one data series, line charts are clearly preferable

◼ Bar chart variants
▪ Simple bar chart: represents only one variable and each bar is
associated with a value
▪ Grouped [multiple or overlapping] bar chart: used to represent values

of different variables in the same category / period
▪ Stacked bar chart / bar chart with components: shows the

decomposition of an aggregate in its components or categories
▪ Stacked bar chart in percentage / bar chart with components in

percentage: places emphasis on change in the structure of the
aggregate and not on the absolute values of the components

◼ Simple bar charts

▪ Categories should have the same order across all charts in a report
❑ Organize categories in ascending or descending order of values
❑ Sort alphabetically (or geographically) the names of the categories
▪ The horizontal bar chart is recommended for variables whose

categories have long designations
▪ The representation of negative values is discouraged in horizontal bar

charts
❑ Conventionally, the negative values are associated with a bar in a
downward position

◼ Grouped bar charts
▪ Used to describe simultaneously two or more categories for a given

variable, and when you want to highlight the value of categories
instead of the total value of the variables or the contribution of
categories to the total
▪ In cases where there are several variables composed of various

categories, it is preferable to build different graphs instead of
accumulating the information in a single graph

◼ Example 4
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Using a double entry PivotTable, produce a bar chart of the number of
students enrolled in each subject, grouped by high school. Suggestion:
use the PivotChart capability.

◼ Stacked bar chart [in percentage]

▪ Stacked bar charts are used in situations analogous to grouped bar
charts, that is, when the data set contains two or more categories
▪ The graph with absolute values is more appropriate when we want to

draw attention to the total value of the variables
▪ The graph with percentage values is more appropriate when we want

to draw attention to the weight of the categories within the
aggregated value

◼ Example 5
▪ Using a double entry PivotTable, produce stacked bar charts of the number of
students enrolled in each subject by high school. Suggestion: use the PivotChart
capability.

◼ Pie chart
▪ Circular graph in which the circle represents the total value of the
aggregate, and each section represents a component
▪ It should only be used to represent nominal data, because it does not

assume any ordering of values or modalities
▪ Too many slices or narrow slices are difficult to interpret, and it is

therefore necessary to supplement the graph with the corresponding
values. Alternatively, a bar chart can be used.

◼ Example 6
▪ Produce pie charts of the distribution of students by high school and by subject

◼ Principles of graphs construction

▪ Title indicating in a precise and synthetic way the subject of the information
▪ Unit of measurement and the period to which the information relates
▪ Designation for axis
▪ Legend when the graph shows data from more than one variable
▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information

◼ Principles of graphs construction

▪ A chart can correctly represent variables, contain all the necessary
elements and be neither attractive nor easy to read
▪ Final issues to consider

❑ Is the graphic easy to read?
❑ Can the graphic be misinterpreted?
❑ Is the graphic the right size and shape?
❑ Does the graphic benefit from being in colour?
❑ Has understanding of the graphic been tested with anyone?

Frequency distributions
◼ Frequency distribution
▪ Set of all values, or modalities, of a variable and the number of
corresponding occurrences
❑ Can be computed for qualitative or quantitative data
◼ Cumulative frequency distribution
▪ Set of sorted values of the variable and cumulative sum of the previous
frequencies
❑ Can only be computed for ordinal or quantitative data

◼ Frequency table
Simple frequencies Cumulative frequencies
Absolute (ni) Relative (fi) Absolute (Ni) Relative (Fi)
x1 n1 f1 N1 F1
… … … … …
xk nk fk Nk = n Fk = 1
 n 1

◼ Notation
x1, x2, …, xk → values that the variable X assumes
n → total number of elements in the data collection
ni → (Simple) Absolute frequency – count of xi distinct values
fi = ni / n → (Simple) Relative frequency – proportion of xi distinct

values
Ni = n1 + ... + ni → Cumulative absolute frequency – sum of the counts of

values equal to or less than xi
Fi = f1 + ... + fi → Cumulative relative frequency – sum of the proportions of

values equal to or less than xi

◼ Bar's diagram
▪ Graph on which the X axis is
indicated the values of the
variable and, on the Y axis, the
respective simple frequencies
[absolute or relative]
▪ Frequency values are

represented by points and
joined by lines to the X axis
▪ In practice, it is common to
use bar charts

◼ Ladder diagram
▪ Ladder-shaped chart
representing the distribution
of cumulative frequencies
(absolute or relative)

◼ Frequency distribution of a continuous variable
▪ Because it can take an infinite number of non-countable values, a

continuous variable forces us to create class intervals that become the
modalities of the characteristic under study
❑ The whole range of variable values is classified in some groups in the form
of intervals
▪ There is no rule that is scientifically grounded and universally accepted

for the construction of classes

◼ Some guidelines for dividing continuous data into classes

▪ The classes should be mutually exclusive, i.e., non-overlapping. No two classes
should contain the same interval of values of the variable.
▪ The classes should be exhaustive, i.e., they must cover the entire range of the
data
▪ The first class should contain the lowest value, just as the last class should
contain the highest value
▪ The number of classes and the width of each class should neither be too small
nor too large
▪ The classes should, preferably, be of equal width, and no class should have
zero frequency

◼ Some guidelines for dividing continuous data into classes
▪ Method of left inclusion: if the value of an observation is equal to the

upper limit of one class and to the lower limit of the following class, it
should be included in the latter. Mathematically, it corresponds to the
interval [Llow , Lupp[
▪ Open-end classes: It may be so that some values in the data set are
extremely small compared to the other values of the data set and
similarly some values are extremely large in comparison. Then what we
do is we do not specify the lower limit of the first class and the upper
limit of the last class. Such classes are called open end classes.

◼ Construction of classes
Sturges’ rule (logarithm) Common sense (rule of thumb)
Nr. of observations (n) Nr. of classes (k) Nr. of observations (n) Nr. of classes (k)
1 1 Less than 50 5
2 2 50 – 100 6–8
3–5 3
100 – 200 8 – 10
6 – 11 4
200 – 300 10 – 12
12 – 23 5
300 – 500 12 – 15
24 – 46 6
500 – 1000 15 – 20
47 – 93 7
More than 1000 20
94 – 187 8
188 – 376 9
❖ Alternatives:
377 – 756 10 https://en.wikipedia.org/wiki/Histogram#Num
ber_of_bins_and_width

◼ Construction of classes
k = 5 if n  25
▪ k = Number of classes: 
k  n if n  25
▪ w = Class width: length of the interval (a.k.a. class size)

❑ Xmax – maximum observed value X max − X min
w=
❑ Xmin – minimum observed value k
▪ C = Class mark: midpoint of a class interval, which is the representative value of

the entire class
❑ Llow – lower class limit Llow + Lupp
C=
❑ Lupp – upper class limit 2

◼ Histogram
▪ Graphical representation of the frequency distribution of a continuous
variable by means of rectangles whose widths represent class intervals
and whose areas are proportional to the corresponding frequencies
❑ The height of the rectangles is equal to the (absolute ore relative)
frequency divided by the class interval’s width
▪ The area of the rectangle of each class is equal to its frequency and the
sum of the areas is equal to N or 1 if it represents absolute frequencies
or relative frequencies, respectively
▪ In the case of distributions with classes of equal width there is no

inconvenience that the height of the rectangles is equal to the
frequency

◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Skewness: distributions are skewed to the side of the long tail
Source: AI hubs (2015) Visualizing Numerical Data. (http://researchhubs.com/post/ai/data-analysis-and-

statistical-inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)

◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Modality: prominent peaks determine modality
Source: AI hubs (2015) Visualizing Numerical Data. (http://researchhubs.com/post/ai/data-analysis-and-

statistical-inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)

◼ Histogram
▪ May depict extreme values and possible outliers
Moderate
extremes
Heavy
extremes
Atypical values
Outliers ?
Errors ?

◼ Histogram
▪ How does the number of classes (thus, the width) affect the
histogram?
➢ https://courses.lumenlearning.com/wmopen-concepts-
statistics/chapter/histograms-2-of-4/
➢ http://www.shodor.org/interactivate/activities/Histogram/
▪ What is the difference between a bar chart and a histogram?
✓ The major difference is that a histogram is only used to plot the frequency
of a continuous data set that has been divided into classes (sometimes
named bins). Bar charts, on the other hand, can be used for a great deal of
other types of variables including ordinal and nominal data sets.

◼ Frequency Polygon
▪ Graph resulting from successively joining, by line segments, the midpoints of
the upper sides of the rectangles of the histogram
▪ The first point is on the x-axis and is
placed in the middle of the interval
which precedes the first bar of the
histogram (frequency=0). The last point
is located on the x-axis in the middle of
the interval immediately following the
last bar of the histogram (frequency=0).
▪ By joining the mid point of each bar and

the x-axis at each end, the surface under
the frequency polygon is exactly the
same as the surface of the histogram.
Therefore, the principle of the histogram
is respected.

◼ Frequency Polygon
▪ Emphasizes the overall
pattern in the data
▪ Especially useful for

comparing sets of data,
and displaying
cumulative frequency
distributions
Source: European Centre for Disease Prevention and Control (n.a.) FEM Wiki:
Frequency polygons. (https://wiki.ecdc.europa.eu/fem/w/wiki/frequency-polygons;
accessed: 20 Nov 2017)

◼ Frequency Curve
▪ A smooth curve which corresponds to the limiting case of a histogram
computed for a frequency distribution of a continuous variable as the
number of observations becomes very large
❑ It is roughly a smoothed Frequency Polygon
Source: Weisstein, Eric W. "Frequency Curve." From MathWorld--A Wolfram Web Resource.
(http://mathworld.wolfram.com/FrequencyCurve.html; accessed: 18 Nov 2017)

◼ Frequency Curve

◼ Cumulative frequency function

▪ Graph that results from successively joining, by straight segments, the
right upper sides of the rectangles of the “histogram" of accumulated
frequencies

Descriptive measures
◼ Descriptive statistics
▪ Synthesize important information characteristics through a single
number
▪ These measures are classified as

❑ Location
❑ Dispersion
❑ Symmetry
❑ Kurtosis
❑ Association

◼ Location measures
▪ Describe where the data is located (on the x-axis)
❑ Central tendency measures attempt to describe the typical or central value
that best describes the data. The focus is on where the data is centred or
clustered.

◼ Dispersion measures / spread measures

▪ Describe the variability of data in relation to the central value of the
distribution (i.e., how much values differ from the central value of the distribution)
❑ Describe how r spread-out data values are around the central tendency

◼ Symmetry measures
▪ Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same
to the left and right of the center point.

◼ Kurtosis measure
▪ Kurtosis refers to how peaked a distribution is or conversely how flat it
is relative to a Normal distribution, which is symmetrical and bell-
shaped
❑ A positive value tells you that you have heavy-tails (i.e., a lot of data in your tails)
❑ A negative value means that you have light-tails (i.e., little data in your tails)

◼ Association measures
▪ Describe the degree of association between two variables
Negative linear Positive linear No correlation

correlation correlation
No correlation Non-linear relation Non-linear relation

Quadratic model? Exponential or power model?

◼ Some location measures
Central tendency Non-central tendency
Mean Quartiles
Median Deciles
Mode Percentiles

◼ Mean
▪ Can be thought of as the centre of mass of the values of the
observations, i.e. the point of equilibrium after we have the
observations on a ruler
▪ Extreme values may “push away” the mean from the most typical values
▪ As the data becomes skewed the mean loses its ability to provide the best central
location for the data because the skewed data is dragging it away from the typical values

◼ Mean – Example 1/4

▪ Suppose you asked a graduate if she was a good student in the degree.
She answers with the final grades of each course:
10 11 15 12 17 14
15 12 16 15 12 14
11 17 16 14 15 15
16 10 17 13 13 13
16 13 11 16 18 14
15 16 13 16 14 16
▪ What is the average grade?

✓ The student obtained an average of 14 values in the courses


▪ Suppose she answers with the frequency distribution of the grades:
Grades Nr. of courses
10 2
11 3
12 3
13 5
14 5
15 6
16 8
17 3
18 1
▪ What is the average grade?

✓ The student obtained an average of 14 values in the courses


▪ Suppose she answers with the time she studied for each course (in hours):
40.50 54.50 60.00 75.00 86.00 92.25

45.25 52.75 68.00 75.75 88.50 90.00
45.00 55.00 70.50 78.00 90.25 94.00
47.00 60.75 70.00 80.00 90.75 93.25
40.75 58.00 70.25 75.50 95.00 95.00
52.00 61.25 72.50 82.00 88.50 95.00
▪ What is the average hours of study?

✓ The student studied on average 71.91 hours per course


▪ Suppose she answers with the frequency distribution of the time she
studied for each course (in hours):
Hours of study Nr. of courses

40.5 − 50.5 5
50.5 − 60.5 6
60.5 − 70.5 5
70.5 − 80.5 7
80.5 − 90.5 6
90.5 − 100.5 7
▪ What is the average hours of study?

✓ The student studied on average approximately 72.17 hours per course

◼ Mean
▪ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑋ത = ෍ 𝑋𝑖
𝑛
𝑖=1
▪ Frequency data (discrete case)

𝑘
𝑋ത = ෍ 𝑓𝑖 𝑋𝑖
𝑖=1
▪ Frequency data (continuous case)

𝑘
𝑋ത = ෍ 𝑓𝑖 𝐶𝑖
𝑖=1

◼ Mode – discrete case

▪ The mode, or modal value, is the most frequent value of the data set
❑ It is the only central location measure that can be used for nominal data
❑ It may have no meaning in discrete data with few repeated observations

◼ Mode – continuous case

▪ It represents a value in the highest bar on a histogram. It can be
determined by a formula, but only if the classes are of equal width.
▪ King’s formula
1. Identify the Modal Class (i.e., interval with the highest frequency)
2. Calculate the following expression
f ∗∗
Mode = Llow + ∗ 𝐰
f + f ∗∗ i
Llow → lower limit of the modal class
wi → width of the modal class
f* → frequency of the class before the modal class
f** → frequency of the class after the modal class

◼ Mode
▪ It is easily affected by small changes in frequency of discrete data, or
class construction in the continuous case
▪ It is possible for a set of data values to have none or more than one
mode
Fig. source: AI hubs (2015) Visualizing Numerical Data.

Ana Cristina Costa (http://researchhubs.com/post/ai/data-analysis-and-statistical- 87
inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)
◼ Median – Raw data (discrete or continuous case)
▪ Middle value of the data set when it has been arranged in ascending
order
❑ If the sample has an odd size, it coincides with the central observation
❑ If the sample has even size, the median takes the value of the average of
the two most central observations
▪ 50% of the observations are less than or equal to the median

◼ Median – Raw data

▪ Examples
1.2 1.7 2.1 2.2 2.4
❑ n = 5 (odd)
❑ Median = observation located Median = 2.1

in position (n+1)/2
0 0 1 2 2 3
❑ n = 6 (even)
❑ Median = average of the Median = (1+2)/2 = 1.5

values in positions n/2 and
n/2+1

◼ Median – Frequency of discrete data

▪ Example
❑ 50% of the observations are less than or equal to the median
xi Fi
0 0.44
1 0.56 Median = 1
2 0.78
3 0.89
4 1

◼ Median – Frequency of discrete data

▪ Examples
n odd n even
xi ni Fi xi ni Fi
0 4 0.4 0 4 0.4
1 1 0.5 1 1 0.5
2 2 0.7 2 2 0.7
3 1 0.8 3 1 0.8
4 1 1 4 2 1
Total 9 Total 10
Median = 1 Median = (1+2)/2 = 1.5
✓ Median = average of the values

in position n/2 (50%) and n/2+1

◼ Quartiles and percentiles

▪ Quartiles → values that split a sorted data set into 4 equal parts
▪ Percentiles → values that split a sorted data set into 100 equal parts
❑ The first quartile (Q1) is the 25th percentile: 25% of the observations are below Q1
❑ The second quartile (Q2 = Median) is the 50th percentile
❑ The third quartile (Q3) is the 75th percentile: 75% of the observations are below Q3
50% 50%
Q2
Q1 Q3
Xmin median Xmax
25% 75%
75% 25%
Ana Cristina Costa ! Different software use slightly different formulations to compute these values 92
◼ Central tendency measures
▪ A symmetric distribution is one where the left- and right-hand sides of the
distribution are roughly equally balanced around the mean
Source: https://www.siyavula.com/read/maths/grade-11/statistics/11-
statistics-05 ; accessed: 17 Jun 2020

▪ These rules on comparing the mean and median values can fail in multimodal
distributions, or in distributions where one tail is long but the other is heavy
mode mode
median median
Frequency
Frequency
mean mean
Positive skew: Mean > Median Negative skew: Mean < Median
Source: Frost, J. (2015) S1: Chapter 4

Representation of Data. www.drfrostmaths.com, 20th September 2015

▪ Mean
→ Uses all data, but sensitive to extreme values
▪ Median
→ Does not use all data, but it is robust
→ The quartiles, including the median, are robust location measures because
they are not affected by extreme values
▪ Mode
→ Easily affected by small changes in frequency
➢ Do not use the MODE Excel function with continuous data


When there is a very
extreme value (outlier), it
should be reported:
(1) Median, or
(2) Median and Mean
If you think the outlier does not

belong in the data set (i.e., was
an error)… then consider also
reporting the mean without the
outlier.
Ana Cristina Costa Adapted from: What to Report When There is an Outlier by Robert 96
G. Kelley, www.miracosta.edu/home/rkelley (accessed 2018)
◼ Dispersion measures
Some dispersion measures
Variance and Standard

Range
deviation
Interquartile range
Coefficient of variation
Mean absolute deviation

◼ Range
▪ Difference between the highest and the lowest value
❑ It is measured in the same units as the data
❑ It is most useful in representing the dispersion of small data sets
R = Xmax – Xmin
◼ Interquartile Range
▪ Difference between the 3rd quartile and the 1st quartile (encompasses
50% of the central observations)
IQ = Q3 – Q1

◼ Mean Absolute Deviation

▪ Measures the degree of dispersion of the values around the mean
▪ To prevent positive deviations from cancelling negative deviations, it is

1 𝑛
computed as the average of the absolute deviations: σ 𝑋𝑖 − 𝑋ത
𝑛 𝑖=1

◼ Mean Absolute Deviation

𝑛
1
෍ 𝑋𝑖 − 𝑋ത
𝑛
𝑖=1

𝑘
෍ 𝑓𝑖 𝑋𝑖 − 𝑋ത
𝑖=1

𝑘
෍ 𝑓𝑖 𝐶𝑖 − 𝑋ത
𝑖=1

◼ Sample Variance (S2)
𝑛 𝑛
1 1
𝑆2 = ෍ 𝑋𝑖 − 𝑋ത 2 = ෍ 𝑋𝑖 2 − 𝑛𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

𝑘 𝑘
𝑛 𝑛
𝑆2 = ෍ 𝑓𝑖 𝑋𝑖 − 𝑋ത 2
= ෍ 𝑓𝑖 𝑋𝑖 2 − 𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

𝑘 𝑘
𝑛 𝑛
2
𝑆 = ෍ 𝑓𝑖 𝐶𝑖 − 𝑋ത 2
= ෍ 𝑓𝑖 𝐶𝑖 2 − 𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

◼ Sample Standard Deviation (S)

▪ It is equal to the square root of the variance, thus it also measures the
degree of dispersion of the values around the mean
▪ The variance presents the disadvantage of translating into the square

of the units in which the variable
▪ The standard deviation is defined in the same units as the variable
▪ Similarly to the mean, standard deviation and variance can be strongly

affected by extreme values

◼ Coefficient of Variation (a.k.a. relative standard deviation)

▪ Measure of relative variability. It allows comparing the degree of dispersion,
around the mean, of different distributions.
𝑆
𝐶𝑉 = × 100
𝑋ത
▪ It can only be calculated when the variable takes values of a single signal, i.e.
the values are all positive or are all negative
❑ It may not have any meaning for data on an interval scale
▪ CV > 50% indicates a small representativeness of the mean, thus the median
should [also] be used to characterise the typical values
❑ The mean will be the more representative the lower the CV value

◼ Statistics computed from raw data vs frequencies

▪ In the discrete case, the same values are obtained for the statistics
when they are calculated using raw data (i.e., individual observations)
or aggregated data (i.e., frequencies).
▪ In the continuous case, the different values are obtained for the
statistics when they are calculated using raw data (i.e., individual
observations) or aggregated data (i.e., frequencies).
➢ In the calculation of the statistics, the midpoint of each class (Ci) is used to
approximate all observations from that class
➢ Different frequency distributions  different midpoints  different

approximations to the (true) values of the descriptive statistics

◼ Association measures
Some association measures
Covariance
[Pearson’s] Correlation coefficient
Spearman’s correlation coefficient

◼ Covariance
▪ Measures the magnitude of the linear association between two
variables by measuring the joint variation of X and Y around their
means
❑ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑆𝑋𝑌 = ෍ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑛−1
𝑖=1
❑ Depends on the units in which the variables X and Y are measured
❑ If the variables X and Y are independent, then the covariance is zero.

However, the inverse is not true in general.
❑ The sign indicates whether the relationship is positive or negative

◼ [Pearson’s] Correlation coefficient

▪ Measures the strength of the linear association between two
continuous variables X and Y
𝑆𝑋𝑌
𝑟= , −1 ≤ 𝑟 ≤ 1
𝑆𝑋 𝑆𝑌
❑ When the variables X and Y are independent, the correlation coefficient

is zero. However, the inverse is not true in general.
❑ The sign indicates whether the relationship is positive or negative

◼ Example 7
Consider the data in the Stocks sheet of the LU0_Examples Excel file.
▪ Are the monthly returns of Microsoft, GE, Intel, GM and CISCO correlated?
Microsoft GE Intel GM CISCO

Microsoft 1
GE 0.4450 1
Intel 0.5165 0.3235 1
GM 0.0688 0.3796 0.3174 1
CISCO 0.5128 0.3755 0.4885 0.1593 1
◼ Exercise
Play the game: guess the value of the correlation coefficient at
https://en.wikipedia.org/wiki/Guess_the_Correlation
◼ Spearman’s correlation coefficient

▪ Measures the strength of a monotonic relationship between two
variables X and Y (discrete or continuous)
▪ The measurement scale of the variables X and Y must be at least

ordinal
▪ It is a measure of association (not correlation), but its interpretation is

similar to that of Pearson’s
❑ −1 ≤ 𝑟’ ≤ 1
❑ When the variables X and Y are independent, the coefficient is zero.
However, the inverse is not true in general.
▪ It is also known as Spearman’s rho

◼ Spearman’s correlation coefficient

▪ Monotonic function
❑ A monotonic function is a function which is either entirely nonincreasing or
nondecreasing

◼ Spearman’s correlation coefficient – calculation

▪ Paired sample : (x1, y1), (x2, y2) , …, (xn, yn)
▪ Sort the n observations of each variable, separately, and associate

them with their corresponding rank
❑ R(xi) → rank of the value xi (corrected for ties)
❑ R(yi) → rank of the value yi (corrected for ties)
▪ The rank assigned to each set of duplicates is the average of the ranks
that those tied values would have if they were different from each
other

◼ Spearman’s correlation coefficient – calculation

▪ The Spearman’s rank correlation is the Pearson’s correlation coefficient
on the ranks of the data; i.e., the formulation is applied to the paired
ranks [R(xi), R(yi)]
2
 n + 1
n
 i=1
R( x i )R( y i ) − n
 2


r' =
 2
 n 2

R( x i ) − n n + 1  + 
n
 R( y ) − n
 
 n 1
 
2 2
  2   i=1 i
 2  
 i=1 
▪ Samples without ties (or with few ties)

n
6  R( x ) − R( y )
i=1
i i
2
r' = 1 −
(
n n2 − 1 )
◼ Example 8
Data are in the Example8 sheet of the LU0_Examples Excel file
❑ Compute the Spearman’s rho (considering few ties)
Xi Yi
550 80
620 60
580 10
580 20
540 30

◼ Example 8 - solution
Square of the difference

xi Rank of xi yi Rank of yi between ranks
(di2)
550 2 80 5 (2 – 5)2
620 5 60 4 (5 – 4)2
580 3.5 10 1 (3.5 – 1)2
580 3.5 20 2 (3.5 – 2)2
540 1 30 3 (1 – 3)2
Sum = 22.5
6(22 .5)
r' = 1 − = −0.125
(
5 5 −1
2
)
◼ Outlier analysis
▪ Outlier: discordant or extreme value
▪ We have to explain this value by further analysis of its cause or origin

❑ If it was due to human/sensor errors when measuring or recording data,
then it should be corrected or removed
❑ If extreme values are a characteristic of the attribute (high skewness),
some authors do not consider them as outliers
❑ Outliers may indicate data points that belong to a different population
than the rest of the sample set
▪ In cases of extreme observations, the typical values must be analysed

using robust statistics (median and IQ) instead of the usual ones (mean
and standard deviation)
◼ Boxplot (a.k.a. box and whisker diagram)

▪ Graphical representation of location measures
▪ Useful for describing dispersion and skewness using robust statistics
◼ Boxplot
Positive skewness Symmetry Negative skewness
𝐐𝟑 − 𝐐𝟐 > 𝐐𝟐 − 𝐐𝟏 𝐐𝟐 − 𝐐𝟏 = 𝐐𝟑 − 𝐐𝟐 𝐐𝟐 − 𝐐𝟏 > 𝐐𝟑 − 𝐐𝟐
◼ Boxplot
▪ Useful for comparing different distributions in the same graph
▪ Potential outliers may be plotted as individual points (Tukey boxplot)

◼ Outliers detection using fences

▪ Lower inner fence: BII = Q1 – 1,5 IQ v1 → lower adjacent value to Q1
▪ Upper inner fence: BIS = Q3 + 1,5 IQ v2 → upper adjacent value to Q3
▪ Lower outer fence: BEI = Q1 – 3 IQ Moderate outlier
▪ Upper outer fence: BES = Q3 + 3 IQ Severe outlier

◼ Outlier analysis
▪ “The simple elimination of a potential outlier should be done with caution and
the most advisable is to carry out the analysis with and without the presence
of that observation. If the conclusions are discordant one should at least be
aware that the outlier significantly affects the conclusions, and so it is best to
report this fact, leaving to the third party the possibility to choose their own
path.”
▪ “The elimination of a potential outlier is inappropriate when the observed

variable has a distribution with heavy tails, in the framework of which the
outliers are natural. For some authors, almost certain identification of outliers
is generally only possible for samples with 500 or more observations; thus,
when working with small samples, the most prudent policy is to isolate some
values to pay them or ask to be given them special attention.”
(Murteira 1993, p. 100)
◼ Example 9
The Example9 sheet of the LU0_Examples Excel file has data on the
duration of extracorporeal circulation (in minutes) of 94 patients
undergoing a heart intervention, between May 1980 and December 1988
at the Hospital de Santa Cruz (Source: Murteira, 1993, pp. 97-98)
❑ Investigate the existence of outliers using fences
❑ Use the graphics facility in Excel 2016 (or later) to produce box plots
1. Select your data – either a single data series, or multiple data series
2. Click Insert > Insert Statistic Chart >Box and Whisker
Murteira, B., Ribeiro, C.S., Silva, J.A., Pimenta, C. (2010).

Introdução à Estatística. Lisboa: Escolar Editora
◼ Example 9
Outlier analysis using fences
❑ In this case, the lower fences, inner (BII) and outer (BEI), are irrelevant
because they are less than zero and the variable is positive
✓ Mean = 139.72
✓ Min = 30 OUTLIERS
✓ Q1 = 95.5 Moderate
295 minutes
✓ Median = Q2 = 120
300 minutes
✓ Q3 = 167.5
✓ Max = 403 Severe
✓ IQ = 72 402 minutes
✓ BIS = 167.5 + 1.5x72 = 275.5 403 minutes
✓ BES = 167.5 + 3x72 = 383.5

◼ Example 10
Consider the data in the Grades sheet of the LU0_Examples Excel file
▪ Use the Analysis Toolpak add-in to compute descriptive statistics of the grades
from each school
▪ Use the graphics facility in Excel 2016 (or later) to produce a histogram of the
grades from Columbus East
▪ Use the graphics facility in Excel 2016 (or later) to produce a box plot of the
grades by school and subject
Source: Microsoft (2017) Create a box and whisker chart.

Ana Cristina Costa https://support.office.com/en-us/article/Create-a-box-and-whisker-chart- 123
62f4219f-db4b-4754-aca8-4743f6190f0d (accessed: 20 Nov 2017)
Introduction to Probability Theory
◼ Empirical distributions and probability distributions

▪ Distribution of the weight of 500 cigarettes "SG Filter"
❑ Mean  Mode  Median  830 mg; Standard deviation  23.63 mg
Weight Nr. of Proportion

(mg) cigarettes cigarettes
760 – 780 4 0.008
780 – 800 43 0.086
800 – 820 118 0.236
820 – 840 168 0.336
840 – 860 117 0.234
860 – 880 39 0.078
880 – 900 11 0.022
Total 500 1
Source: Murteira, B. (1993). Análise Exploratória de Dados 124

Ana Cristina Costa
– Estatística Descritiva. Portugal: McGraw-Hill, p. 34.

▪ From the frequency distribution can we conclude that
❑ There are no cigarettes weighting less than 760 mg?
❑ The average weight of cigarettes on the market is equal to 830 mg?
❑ If another sample of 500 cigarettes was taken, the average weight of

cigarettes would still be exactly 830 mg?
▪ Solution: obtain a mathematical model of the frequency distribution

❑ It is an algebraic expression that describes the relative frequency (height of
the frequencies curve) for all possible values of the variable
❑ It is named probabilistic model or probability distribution


▪ If we assume that the true mean weight of cigarettes is  = 830 mg and the
standard deviation is  = 24 mg, then we can formulate the following
probabilistic model: Normal distribution with parameters  = 830 and  = 24
Probabilistic model
2
1  x − 830 
−  
1 2  24 
e
24 2
770 790 810 830 850 870 890

peso (mgrs)
–– Frequency
Curva de frequências
curve


▪ The random variable X represents the weight of "SG Filter“ cigarettes. We may
assume that the probability distribution of X is the Normal distribution with
parameters =830 and =24.
❑ The parameters of probability distributions, also called population parameters, are
generally represented by Greek letters (, , , )
❑ If we change the values of the parameters of a distribution, the appearance of the

model’s graph changes. They allow the same distribution to be used to describe a
vast set of real phenomena.
✓ 23.4% of the cigarettes in the sample weight between 840 and 860 mg
✓ The probability of any cigarette to weight between 840 and 860 mg is 23.16%


▪ The frequency distribution is an empirical concept that, in most cases,
concerns a sample. Hence, it is also named the empirical distribution of
X.
▪ The probability distribution is a theoretical concept, regarding the

population, and should be considered a mathematical model of the
reality.
❑ The probability of an event can be understood as the relative frequency of

this event in a theoretical population model
❑ We use a lowercase letter x to designate a specific amount of the

population X

◼ Statistical inference process
Parameter – a number that describes the population
INFERENCE
POPULATION
Mean: 
ഥ
Sample mean: X
Sample
Statistic – a number that describes the sample

LU0 Descriptive Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LU0 Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Statistics and Probability

Ana Cristina Costa

Bachelor’s degree in Information Management

Ana Cristina Costa 2

Ana Cristina Costa 3

▪ Organization of the information ▪ Introduction to Probability

Ana Cristina Costa 5

◼ At the end of this learning unit students should be able to

Ana Cristina Costa 6

◼ Resources on the Internet

▪ Excel Easy: http://www.excel-easy.com/

Ana Cristina Costa 7

◼ Statistics: discipline whose main purpose is the collection,

◼ Statistics help decision-makers to create order and simplicity from the

Ana Cristina Costa 8

Ana Cristina Costa 9

▪ Statistical variables are those that only assume numeric values

Ana Cristina Costa 10

◼ Statistical units and variables

Ana Cristina Costa 11

◼ Data types and classification of statistical variables

Nominal Ordinal Discrete Continuous

Ana Cristina Costa 12

Classification of workers (unskilled,

Ana Cristina Costa 13

obtained by counting. Number of children

Continuous When variables can take an infinite Weight and height

Ana Cristina Costa 14

◼ Properties of measurement scales

▪ Magnitude: values on the measurement scale have an ordered relationship to

▪ A minimum value of zero: the zero of the scale corresponds to a meaningful

▪ Interval: has the properties of identity, magnitude, and equal intervals.

▪ Ratio: satisfies the properties of identity, magnitude, equal intervals, and a

❑ Variables expressed on the nominal scale can be only "equal" or "different"

Ana Cristina Costa 17

➢ In addition to the counting operation, it is possible to identify "positions"

Ana Cristina Costa 18

❑ Examples: Celsius and Fahrenheit temperature scales (water freezes at 0°C,

Ana Cristina Costa 19

❑ Examples: age, salary, price, sales volume, distances

Ana Cristina Costa 20

◼ All data types can be transformed into statistical variables

▪ It is always possible to move from a richer scale to a less sophisticated

▪ But using statistical methodologies in which data has been assumed on

Ana Cristina Costa 21

It consists of the collection, It allows to draw

Ana Cristina Costa 22

◼ Statistical inference process

Tables, graphs, numerical

Ana Cristina Costa 23

◼ The presentation of statistical information must be clear and

▪ Before being organized and analysed, statistical information is

▪ Success in using statistical data depends on how they are presented.

Ana Cristina Costa 24

▪ Simple table: represents information relating to only one attribute

▪ Double entry table: represents information relating to two attributes

Ana Cristina Costa 25

Table 1 – Civil employment in Portugal in 1990 by sector of activity

Sector of activity Nr. individuals (in thousands)

Primary Sector 845.1

Source: INE; Inquérito ao Emprego, citado em INE, Portugal Social, p. 41

Ana Cristina Costa 26

Table 2 – Civil employment in Portugal in 1990 by gender