You are on page 1of 129

Statistics and Probability

Distributions

Ana Cristina Costa


ccosta@novaims.unl.pt

Bachelor’s degree in Information Management


Bachelor’s degree in Information Systems February 2022
Bachelor’s degree in Data Science
Syllabus

◼ Learning Units
▪ LU0: Descriptive statistics
▪ LU1: Introduction to probability theory
▪ LU2: Probability axioms
▪ LU3: Random variables and distribution functions
▪ LU4: Mathematical expectation and moments
▪ LU5: Specific probability distributions
▪ LU6: Joint distributions

Ana Cristina Costa 2


Framework of the Curricular Unit

Math I
• Solving equation systems
• Real functions Statistics and
• Derivatives Probability ALL courses and
• Integral calculation Distributions whenever it is
necessary to analyse
• Descriptive statistics data!
Math II • Probability theory
• Partial derivatives
• Integral calculation in IR2
Statistical Inference
• Point estimation
• Confidence intervals
• Hypothesis tests
(parametric)

Ana Cristina Costa 3


LU0: Descriptive statistics

◼ Introduction to Statistics
◼ Organization of the information
◼ Frequency distributions
◼ Descriptive measures
◼ Introduction to Probability Theory
LU0: Descriptive statistics

◼ Topics
▪ Introduction to Statistics ▪ Descriptive measures
❑ Concepts ❑ Location
❑ Data types and classification of ❑ Dispersion
statistical variables ❑ Association
❑ Measurement scales ❑ Outlier analysis (self-study)

▪ Organization of the information ▪ Introduction to Probability


❑ Tables Theory
❑ Graphs

▪ Frequency distributions
❑ Discrete variables
❑ Continuous variables

Ana Cristina Costa 5


LU0: Descriptive statistics

◼ At the end of this learning unit students should be able to


▪ Distinguish the various types of data and their scales of measurement
▪ Organize information on charts and graphs
▪ Build and interpret frequency tables, histograms and boxplots
▪ Calculate and interpret measures of descriptive statistics (location,
dispersion and association)
▪ Identify potential outliers using fences (Tukey boxplots)
▪ Explore and summarize data from different application areas
depending on its different characteristics

Ana Cristina Costa 6


LU0: Descriptive statistics

◼ Resources on the Internet

▪ Newbold, P., Carlson, W. L., Thorne, B. (2013). Statistics for Business and
Economics. 8th Edition, Boston: Pearson, chapters 1 and 2. (requires VPN connection)

▪ Excel Easy: http://www.excel-easy.com/

▪ Jon Acampora (2015) Interactive Histogram Chart That Uncovers The Details.
https://www.excelcampus.com/charts/interactive-histogram-with-group-
details/ (access: Jan 2022)

Ana Cristina Costa 7


Introduction to Statistics

◼ Statistics: discipline whose main purpose is the collection,


compilation, analysis and interpretation of data

◼ Statistics help decision-makers to create order and simplicity from the


complexity and chaos of large volumes of data at a time when the amount
of information increases so rapidly

Ana Cristina Costa 8


Introduction to Statistics

◼ Population
▪ Entire set of elements having one or more common characteristics
Example: Portuguese population, employees, all cars in circulation, Portuguese
SMEs

◼ Statistical Unit
▪ Individual element of the population
▪ Each unit may have one or more characteristics

◼ Sample
▪ Subset of elements from a population for which certain characteristics
are studied

Ana Cristina Costa 9


Introduction to Statistics

◼ Variable
▪ Each characteristic of a statistical unit corresponds to a variable
❑ A variable is an attribute that can be used to describe a person, place, or
thing

▪ The values that a characteristic can assume are the values that the variable can
take
❑ Different types of variables are analysed using different tools and statistical
techniques

▪ Statistical variables are those that only assume numeric values

❑ Notation: X, Y, Z

Ana Cristina Costa 10


Introduction to Statistics

◼ Statistical units and variables


▪ Statistical units, or population elements, are the objects of interest in your
study, in other words, what you are collecting your information or data on

▪ Variables are characteristics of those elements or the attributes that you are
measuring

▪ Data are sometimes stored in a table, where the rows are statistical units and
the columns are variables
❑ What variables might you associate with a person?
❑ What variables might you associate with a city?
❑ What variables might you associate with a stock?
❑ What variables might you associate with a car?

Ana Cristina Costa 11


Introduction to Statistics

◼ Data types and classification of statistical variables

DATA TYPES

QUALITATIVE QUANTITATIVE

Nominal Ordinal Discrete Continuous

Ana Cristina Costa 12


Introduction to Statistics

Nominal type When values are only identified by Classification of individuals by gender
a name, label or code that (female, male)
designates a category or modality,
Classification of regions (urban,
which cannot be sorted in a logical
suburban and rural)
fashion. Hence, the categories
cannot be assigned a numerical Type of vegetation; class of soil
value.

Ordinal type Differ from the nominal values by Grades from an elementary school
the possibility of sorting the test (insufficient, sufficient, good)
categories, and assigning numerical
Teacher’s evaluation (excellent, good,
values.
average, poor)

Classification of workers (unskilled,


specialised, very specialised)

Ana Cristina Costa 13


Introduction to Statistics

Discrete type: When variables take only a finite, or Number of accidents per hour
discrete a countable infinite, number of
variables values. Typically, values are
Number of workers in a company

obtained by counting. Number of children

Fire frequency

Number of buildings

Continuous When variables can take an infinite Weight and height


type: non-countable number of values.
continuous Typically, values are obtained by Time spent on the phone
variables measuring, and may take any value
within a range. Market share

Profit margin

Ana Cristina Costa 14


Introduction to Statistics

◼ Properties of measurement scales


▪ Identity: each value on the measurement scale has a unique meaning.

▪ Magnitude: values on the measurement scale have an ordered relationship to


one another. That is, some values are larger and some are smaller.

▪ Equal intervals: scale units along the scale are equal to one another. This
means, for example, that the difference between 1 and 2 would be equal to
the difference between 19 and 20.

▪ A minimum value of zero: the zero of the scale corresponds to a meaningful


(unique and non-arbitrary) zero value.
❑ The zero of the scale corresponds to the absence of the characteristic that we are
measuring. This means, the scale has a true zero point, below which no values exist.

Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 15
Introduction to Statistics

◼ Measurement scales
▪ Nominal: only satisfies the identity property of measurement. Values assigned
to variables represent a descriptive category, but have no inherent numerical
value with respect to magnitude.

▪ Ordinal: has the property of both identity and magnitude. Each value on the
ordinal scale has a unique meaning, and it has an ordered relationship to every
other value on the scale.

▪ Interval: has the properties of identity, magnitude, and equal intervals.

▪ Ratio: satisfies the properties of identity, magnitude, equal intervals, and a


minimum value of zero.

Ana Cristina Costa Adapted from: StatTrek.com (2017) Scales of Measurement in Statistics. 16
Introduction to Statistics

◼ Measurement scales
▪ Nominal scale
❑ Numbers are used as labels (names or categories) to identify the measured
objects
❑ The assignment of the numbers to the measured objects is agreed - they do
not reflect the quantity of the observed characteristic but rather their
quality

❑ Variables expressed on the nominal scale can be only "equal" or "different"


from each other
➢ The only mathematical operation allowed is counting (frequencies and
mode statistic)

❑ Examples: car registrations, zip codes, marital status, sex, eye colour,
article code

Ana Cristina Costa 17


Introduction to Statistics

◼ Measurement scales
▪ Ordinal scale
❑ Maintains the characteristics of the nominal scale, but has the ability to
sort the data
❑ Any series of numbers can be used as long as it preserves the order of
relationships between measured objects (numeric values are irrelevant)

➢ In addition to the counting operation, it is possible to identify "positions"


(maximum, minimum, median, etc.)

❑ Examples: social level, salary tier, scales used to measure opinions (Likert
scales)

Ana Cristina Costa 18


Introduction to Statistics

◼ Measurement scales
▪ Interval scale
❑ To the characteristics of the previous scales, it adds the possibility of
determining the distance between the different points of the scale, but
with an arbitrary zero point
❑ The location of 0 is agreed (0 does not mean "absence of")
❑ The numbers used by the interval scales do not allow establishing
proportionality relationships
➢ Sums and differences can be made with the values of these measurement
scales (mean, standard deviation, etc.)

❑ Examples: Celsius and Fahrenheit temperature scales (water freezes at 0°C,


but temperatures get colder than that; ratios are not meaningful since 20°C
cannot be said to be "twice as hot" as 10°C); dates when measured from an
arbitrary period (such as CE – Common Era or Christian Era)

Ana Cristina Costa 19


Introduction to Statistics

◼ Measurement scales
▪ Ratio scale
❑ It has the same properties as the interval scale, but includes an absolute 0
(0 means "absence of")
❑ Allows to establish proportionality relationships
➢ It is possible to do all the arithmetic operations ("all" statistical procedures
are allowed)
➢ Allows the conversion of units of measure (e.g., from km to miles)

❑ Examples: age, salary, price, sales volume, distances


❑ If individual A is 80 kg and individual B is 40 kg it can be stated that
individual A weighs twice as much as individual B, since this ratio remains
constant if we change the unit of measure (or scale) to grams. The same
individual A weighs in the new scale 80000g and individual B 40000g, thus A
continues to weigh twice as much as B.

Ana Cristina Costa 20


Introduction to Statistics

◼ All data types can be transformed into statistical variables

▪ It is always possible to move from a richer scale to a less sophisticated


scale

▪ But using statistical methodologies in which data has been assumed on


a ratio scale when the data is merely ordinal, for example, is a source
of much nonsense!

Ana Cristina Costa 21


Introduction to Statistics

Statistics

Descriptive Inferential

It consists of the collection, It allows to draw


presentation, analysis and conclusions about a
interpretation of data given population
through the creation of from the collected
appropriate instruments: data, particularly from
tables, graphs and a sample
numerical indicators

Ana Cristina Costa 22


Introduction to Statistics

◼ Statistical inference process

Statistical Inference

POPULATION

Tables, graphs, numerical


Sample indicators

Descriptive Statistics

Ana Cristina Costa 23


Organization of the information

◼ The presentation of statistical information must be clear and


rigorous

▪ Before being organized and analysed, statistical information is


designated as raw information to mean that it has not yet been
processed by statistical methods

▪ Success in using statistical data depends on how they are presented.


The methods of presentation and description of the data are
fundamental so that the users of the statistical information can
understand it easily and quickly.

Ana Cristina Costa 24


Organization of the information

◼ Tables

▪ Simple table: represents information relating to only one attribute

▪ Double entry table: represents information relating to two attributes

▪ Etc.

Ana Cristina Costa 25


Organization of the information

◼ Simple table

Table 1 – Civil employment in Portugal in 1990 by sector of activity

Sector of activity Nr. individuals (in thousands)

Primary Sector 845.1


Secondary Sector 1624.5
Tertiary Sector 2225.4

TOTAL 4695

Source: INE; Inquérito ao Emprego, citado em INE, Portugal Social, p. 41

Ana Cristina Costa 26


Organization of the information

◼ Simple table

Table 2 – Civil employment in Portugal in 1990 by gender

Gender Nr. individuals (in thousands)

Men 2699.6
Women 1995.4

TOTAL 4695

Source: INE; Inquérito ao Emprego

• How many men work in the secondary sector?

Ana Cristina Costa 27


Organization of the information

◼ Double entry table

Table 3 – Civil employment in Portugal in 1990 by sector of activity and gender

Sector of activity Men Women TOTAL

Primary Sector 427.2 417.9 845.1


Secondary Sector 1108.0 516.5 1624.5
Tertiary Sector 1164.4 1061.0 2225.4

TOTAL 2699.6 1995.4 4695

Source: INE; Inquérito ao Emprego

• Is it possible to build this table from the previous ones?

Ana Cristina Costa 28


Organization of the information

◼ Example 1
Using the data in the Students sheet of the LU0_Examples Excel file and
PivotTables,
▪ Produce a simple table showing how many students are enrolled in each
subject
➢ How many students are enrolled in Math?

▪ Produce a simple table showing how many students are enrolled in each high
school
➢ How many students are enrolled in Columbus East?

➢ How many students are enrolled in Math at Columbus East?


! Produce a double entry table to answer this question

Ana Cristina Costa 29


Organization of the information

◼ Principles of tables construction

▪ Title indicating in a precise and synthetic way the subject of the information

▪ Unit of measurement and the period to which the information relates

▪ Designation for rows and columns

▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information

Ana Cristina Costa 30


Organization of the information

◼ Graphs
▪ Graphs are used to illustrate in a simple and intuitive way the distribution of
information

▪ Initial issues to consider


❑ Is a graph really the best option?
❑ Who is the target audience?
❑ What is the purpose of the graph?
❑ What type of graph should be used?
❑ How should the graph be displayed?
❑ What should be the size of the graph?
❑ Should only one graph be used?

Ana Cristina Costa 31


Organization of the information

◼ Graphs
▪ Direct comparison between graphs can only be done as long as the
scale is the same !

▪ It is common to find graphs in 3D where depth does not describe any variable.
Because volume is what brings the greater difficulties in terms of perception,
3D graphs should be avoided !

▪ Each type of graph is designed for different situations !


❑ Scatterplot
❑ Line chart
❑ Bar chart
❑ Pie chart

Ana Cristina Costa 32


Organization of the information

◼ Scatterplot
▪ The data are displayed as a collection of points (or another symbol) , each
having the value of one variable (X) determining the position on the horizontal
axis and the value of the other variable (Y) determining the position on the
vertical axis

▪ Purpose: analyse the relationship between two quantitative variables


❑ The strength of the relationship depends on the spread of points around the
imagined line. The type of relationship depends on the shape of the imagined line.

Ana Cristina Costa 33


Organization of the information

◼ Example 2
▪ Using the data in the NationalAccounts sheet of the LU0_Examples Excel file,
produce a scatterplot of the consumption and income data, and use the Add
Trend Line option
➢ What can you conclude from this chart?

Source: INE | BP, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

Ana Cristina Costa 34


Organization of the information

◼ Line chart
▪ Displays information as a series of data points connected by straight line
segments
! A dashed line is visually less important than a solid line

▪ Show trends and changes of one variable by another continuous variable that
is represented on the horizontal axis

▪ Time series graph: allows to analyse trends and changes of one or more
statistical variables over a period of time

Ana Cristina Costa 35


Organization of the information

◼ Line chart
▪ No more than three lines per chart should be included, otherwise they make
the chart difficult to read

❑ It is preferable to replace with several graphics

❑ A different line style should be used for each chart, using color, shape, size,
or value

❑ A dashed line is visually less important than a full line!

Ana Cristina Costa 36


Organization of the information

◼ Example 3
▪ Using the data in the ActivePop sheet of the LU0_Examples Excel file, produce
a line chart of the employment and unemployment data
➢ Is this an appropriate chart for these data?

Source: INE, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

Ana Cristina Costa 37


Organization of the information

◼ Example 3
▪ The magnitude of the values of the variables is very different from each other
▪ Both variables have very low relative variation rates, so variations over time
are not very perceptible
✓ Simple solution: produce two separate charts

Source: INE, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

Ana Cristina Costa 38


Organization of the information

◼ Example 3
✓ Elaborate solution: include two vertical axis in a single chart

✓ Decide which solution is more appropriate based on target


audience!

Source: INE, PORDATA (www.pordata.pt). Accessed: 16 Nov 2017

Ana Cristina Costa 39


Organization of the information

◼ Bar chart
▪ The values of the variable are represented by bars whose height (or
length), represents the numeric value of the variable(s)

▪ Neither the area nor the width of the bars are important (they have no
relation to the values of the variable)

▪ In order not to mislead and / or make it difficult to read the graph, the
bars must be all the same width

▪ The gap between the bars should be approximately equal to the width
of the bars

Ana Cristina Costa 40


Organization of the information

◼ Bar chart
▪ Should be used to represent discrete or qualitative data in absolute or
relative terms, or to compare categories of quantitative variables

▪ Bar charts can replace time series graphs in cases where the data
series is very short

❑ They are also recommended when importance is given to the value of the
variable in each period and we mostly want to compare individual
quantities

❑ For more than one data series, line charts are clearly preferable

Ana Cristina Costa 41


Organization of the information

◼ Bar chart variants

▪ Simple bar chart: represents only one variable and each bar is
associated with a value

▪ Grouped [multiple or overlapping] bar chart: used to represent values


of different variables in the same category / period

▪ Stacked bar chart / bar chart with components: shows the


decomposition of an aggregate in its components or categories

▪ Stacked bar chart in percentage / bar chart with components in


percentage: places emphasis on change in the structure of the
aggregate and not on the absolute values of the components

Ana Cristina Costa 42


Organization of the information

◼ Simple bar charts


▪ Categories should have the same order across all charts in a report
❑ Organize categories in ascending or descending order of values

❑ Sort alphabetically (or geographically) the names of the categories

▪ The horizontal bar chart is recommended for variables whose


categories have long designations

▪ The representation of negative values is discouraged in horizontal bar


charts
❑ Conventionally, the negative values are associated with a bar in a
downward position

Ana Cristina Costa 43


Organization of the information

◼ Grouped bar charts

▪ Used to describe simultaneously two or more categories for a given


variable, and when you want to highlight the value of categories
instead of the total value of the variables or the contribution of
categories to the total

▪ In cases where there are several variables composed of various


categories, it is preferable to build different graphs instead of
accumulating the information in a single graph

Ana Cristina Costa 44


Organization of the information

◼ Example 4
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Using a double entry PivotTable, produce a bar chart of the number of
students enrolled in each subject, grouped by high school. Suggestion:
use the PivotChart capability.

Ana Cristina Costa 45


Organization of the information

◼ Stacked bar chart [in percentage]


▪ Stacked bar charts are used in situations analogous to grouped bar
charts, that is, when the data set contains two or more categories

▪ The graph with absolute values is more appropriate when we want to


draw attention to the total value of the variables

▪ The graph with percentage values is more appropriate when we want


to draw attention to the weight of the categories within the
aggregated value

Ana Cristina Costa 46


Organization of the information

◼ Example 5
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Using a double entry PivotTable, produce stacked bar charts of the number of
students enrolled in each subject by high school. Suggestion: use the PivotChart
capability.

Ana Cristina Costa 47


Organization of the information

◼ Pie chart
▪ Circular graph in which the circle represents the total value of the
aggregate, and each section represents a component

▪ It should only be used to represent nominal data, because it does not


assume any ordering of values or modalities

▪ Too many slices or narrow slices are difficult to interpret, and it is


therefore necessary to supplement the graph with the corresponding
values. Alternatively, a bar chart can be used.

Ana Cristina Costa 48


Organization of the information

◼ Example 6
Consider the data in the Students sheet of the LU0_Examples Excel file
▪ Produce pie charts of the distribution of students by high school and by subject

Ana Cristina Costa 49


Organization of the information

◼ Principles of graphs construction


▪ Title indicating in a precise and synthetic way the subject of the information

▪ Unit of measurement and the period to which the information relates

▪ Designation for axis

▪ Legend when the graph shows data from more than one variable

▪ Source of information
❑ To give credit to the author(s)
❑ To enable the reader to control the reliability of the information
❑ To enable the reader to know where he/she can get additional information

Ana Cristina Costa 50


Organization of the information

◼ Principles of graphs construction


▪ A chart can correctly represent variables, contain all the necessary
elements and be neither attractive nor easy to read

▪ Final issues to consider


❑ Is the graphic easy to read?
❑ Can the graphic be misinterpreted?
❑ Is the graphic the right size and shape?
❑ Does the graphic benefit from being in colour?
❑ Has understanding of the graphic been tested with anyone?

Ana Cristina Costa 51


Frequency distributions

◼ Frequency distribution
▪ Set of all values, or modalities, of a variable and the number of
corresponding occurrences

❑ Can be computed for qualitative or quantitative data

◼ Cumulative frequency distribution

▪ Set of sorted values of the variable and cumulative sum of the previous
frequencies

❑ Can only be computed for ordinal or quantitative data

Ana Cristina Costa 52


Frequency distributions

◼ Frequency table

Simple frequencies Cumulative frequencies

Absolute (ni) Relative (fi) Absolute (Ni) Relative (Fi)

x1 n1 f1 N1 F1

… … … … …

xk nk fk Nk = n Fk = 1

 n 1

Ana Cristina Costa 53


Frequency distributions

◼ Notation
x1, x2, …, xk → values that the variable X assumes

n → total number of elements in the data collection

ni → (Simple) Absolute frequency – count of xi distinct values

fi = ni / n → (Simple) Relative frequency – proportion of xi distinct


values

Ni = n1 + ... + ni → Cumulative absolute frequency – sum of the counts of


values equal to or less than xi

Fi = f1 + ... + fi → Cumulative relative frequency – sum of the proportions of


values equal to or less than xi

Ana Cristina Costa 54


Frequency distributions

◼ Bar's diagram
▪ Graph on which the X axis is
indicated the values of the
variable and, on the Y axis, the
respective simple frequencies
[absolute or relative]

▪ Frequency values are


represented by points and
joined by lines to the X axis

▪ In practice, it is common to
use bar charts

Ana Cristina Costa 55


Frequency distributions

◼ Ladder diagram

▪ Ladder-shaped chart
representing the distribution
of cumulative frequencies
(absolute or relative)

Ana Cristina Costa 56


Frequency distributions

◼ Frequency distribution of a continuous variable

▪ Because it can take an infinite number of non-countable values, a


continuous variable forces us to create class intervals that become the
modalities of the characteristic under study
❑ The whole range of variable values is classified in some groups in the form
of intervals

▪ There is no rule that is scientifically grounded and universally accepted


for the construction of classes

Ana Cristina Costa 57


Frequency distributions

◼ Some guidelines for dividing continuous data into classes


▪ The classes should be mutually exclusive, i.e., non-overlapping. No two classes
should contain the same interval of values of the variable.

▪ The classes should be exhaustive, i.e., they must cover the entire range of the
data

▪ The first class should contain the lowest value, just as the last class should
contain the highest value

▪ The number of classes and the width of each class should neither be too small
nor too large

▪ The classes should, preferably, be of equal width, and no class should have
zero frequency

Ana Cristina Costa 58


Frequency distributions

◼ Some guidelines for dividing continuous data into classes

▪ Method of left inclusion: if the value of an observation is equal to the


upper limit of one class and to the lower limit of the following class, it
should be included in the latter. Mathematically, it corresponds to the
interval [Llow , Lupp[

▪ Open-end classes: It may be so that some values in the data set are
extremely small compared to the other values of the data set and
similarly some values are extremely large in comparison. Then what we
do is we do not specify the lower limit of the first class and the upper
limit of the last class. Such classes are called open end classes.

Ana Cristina Costa 59


Frequency distributions

◼ Construction of classes
Sturges’ rule (logarithm) Common sense (rule of thumb)
Nr. of observations (n) Nr. of classes (k) Nr. of observations (n) Nr. of classes (k)

1 1 Less than 50 5
2 2 50 – 100 6–8
3–5 3
100 – 200 8 – 10
6 – 11 4
200 – 300 10 – 12
12 – 23 5
300 – 500 12 – 15
24 – 46 6
500 – 1000 15 – 20
47 – 93 7
More than 1000 20
94 – 187 8
188 – 376 9
❖ Alternatives:
377 – 756 10 https://en.wikipedia.org/wiki/Histogram#Num
ber_of_bins_and_width

Ana Cristina Costa 60


Frequency distributions

◼ Construction of classes
k = 5 if n  25
▪ k = Number of classes: 
k  n if n  25

▪ w = Class width: length of the interval (a.k.a. class size)


❑ Xmax – maximum observed value X max − X min
w=
❑ Xmin – minimum observed value k

▪ C = Class mark: midpoint of a class interval, which is the representative value of


the entire class
❑ Llow – lower class limit Llow + Lupp
C=
❑ Lupp – upper class limit 2

Ana Cristina Costa 61


Frequency distributions

◼ Histogram
▪ Graphical representation of the frequency distribution of a continuous
variable by means of rectangles whose widths represent class intervals
and whose areas are proportional to the corresponding frequencies
❑ The height of the rectangles is equal to the (absolute ore relative)
frequency divided by the class interval’s width

▪ The area of the rectangle of each class is equal to its frequency and the
sum of the areas is equal to N or 1 if it represents absolute frequencies
or relative frequencies, respectively

▪ In the case of distributions with classes of equal width there is no


inconvenience that the height of the rectangles is equal to the
frequency

Ana Cristina Costa 62


Frequency distributions

◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Skewness: distributions are skewed to the side of the long tail

Source: AI hubs (2015) Visualizing Numerical Data. (http://researchhubs.com/post/ai/data-analysis-and-


statistical-inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)

Ana Cristina Costa 63


Frequency distributions

◼ Histogram
▪ Especially useful for describing the shape of the distribution
❑ Modality: prominent peaks determine modality

Source: AI hubs (2015) Visualizing Numerical Data. (http://researchhubs.com/post/ai/data-analysis-and-


statistical-inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)

Ana Cristina Costa 64


Frequency distributions

◼ Histogram
▪ May depict extreme values and possible outliers

Moderate
extremes

Heavy
extremes
Atypical values

Outliers ?
Errors ?

Ana Cristina Costa 65


Frequency distributions

◼ Histogram

▪ How does the number of classes (thus, the width) affect the
histogram?
➢ https://courses.lumenlearning.com/wmopen-concepts-
statistics/chapter/histograms-2-of-4/
➢ http://www.shodor.org/interactivate/activities/Histogram/

▪ What is the difference between a bar chart and a histogram?

✓ The major difference is that a histogram is only used to plot the frequency
of a continuous data set that has been divided into classes (sometimes
named bins). Bar charts, on the other hand, can be used for a great deal of
other types of variables including ordinal and nominal data sets.

Ana Cristina Costa 66


Frequency distributions

◼ Frequency Polygon
▪ Graph resulting from successively joining, by line segments, the midpoints of
the upper sides of the rectangles of the histogram
▪ The first point is on the x-axis and is
placed in the middle of the interval
which precedes the first bar of the
histogram (frequency=0). The last point
is located on the x-axis in the middle of
the interval immediately following the
last bar of the histogram (frequency=0).

▪ By joining the mid point of each bar and


the x-axis at each end, the surface under
the frequency polygon is exactly the
same as the surface of the histogram.
Therefore, the principle of the histogram
is respected.

Ana Cristina Costa 67


Frequency distributions

◼ Frequency Polygon
▪ Emphasizes the overall
pattern in the data

▪ Especially useful for


comparing sets of data,
and displaying
cumulative frequency
distributions

Source: European Centre for Disease Prevention and Control (n.a.) FEM Wiki:
Frequency polygons. (https://wiki.ecdc.europa.eu/fem/w/wiki/frequency-polygons;
accessed: 20 Nov 2017)

Ana Cristina Costa 68


Frequency distributions

◼ Frequency Curve
▪ A smooth curve which corresponds to the limiting case of a histogram
computed for a frequency distribution of a continuous variable as the
number of observations becomes very large
❑ It is roughly a smoothed Frequency Polygon

Source: Weisstein, Eric W. "Frequency Curve." From MathWorld--A Wolfram Web Resource.
(http://mathworld.wolfram.com/FrequencyCurve.html; accessed: 18 Nov 2017)

Ana Cristina Costa 69


Frequency distributions

◼ Frequency Curve

Ana Cristina Costa 70


Frequency distributions

◼ Cumulative frequency function


▪ Graph that results from successively joining, by straight segments, the
right upper sides of the rectangles of the “histogram" of accumulated
frequencies

Ana Cristina Costa 71


Descriptive measures

◼ Descriptive statistics
▪ Synthesize important information characteristics through a single
number

▪ These measures are classified as


❑ Location
❑ Dispersion
❑ Symmetry
❑ Kurtosis
❑ Association

Ana Cristina Costa 72


Descriptive measures

◼ Location measures
▪ Describe where the data is located (on the x-axis)
❑ Central tendency measures attempt to describe the typical or central value
that best describes the data. The focus is on where the data is centred or
clustered.

Ana Cristina Costa 73


Descriptive measures

◼ Dispersion measures / spread measures


▪ Describe the variability of data in relation to the central value of the
distribution (i.e., how much values differ from the central value of the distribution)
❑ Describe how r spread-out data values are around the central tendency

Ana Cristina Costa 74


Descriptive measures

◼ Symmetry measures
▪ Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same
to the left and right of the center point.

Ana Cristina Costa 75


Descriptive measures

◼ Kurtosis measure
▪ Kurtosis refers to how peaked a distribution is or conversely how flat it
is relative to a Normal distribution, which is symmetrical and bell-
shaped
❑ A positive value tells you that you have heavy-tails (i.e., a lot of data in your tails)
❑ A negative value means that you have light-tails (i.e., little data in your tails)

Ana Cristina Costa 76


Descriptive measures

◼ Association measures
▪ Describe the degree of association between two variables

Negative linear Positive linear No correlation


correlation correlation

No correlation Non-linear relation Non-linear relation


Quadratic model? Exponential or power model?

Ana Cristina Costa 77


Descriptive measures

◼ Some location measures

Central tendency Non-central tendency

Mean Quartiles
Median Deciles
Mode Percentiles

Ana Cristina Costa 78


Descriptive measures

◼ Mean
▪ Can be thought of as the centre of mass of the values of the
observations, i.e. the point of equilibrium after we have the
observations on a ruler

▪ Extreme values may “push away” the mean from the most typical values

▪ As the data becomes skewed the mean loses its ability to provide the best central
location for the data because the skewed data is dragging it away from the typical values

Ana Cristina Costa 79


Descriptive measures

◼ Mean – Example 1/4


▪ Suppose you asked a graduate if she was a good student in the degree.
She answers with the final grades of each course:

10 11 15 12 17 14
15 12 16 15 12 14
11 17 16 14 15 15
16 10 17 13 13 13
16 13 11 16 18 14
15 16 13 16 14 16

▪ What is the average grade?


✓ The student obtained an average of 14 values in the courses

Ana Cristina Costa 80


Descriptive measures

◼ Mean – Example 2/4


▪ Suppose she answers with the frequency distribution of the grades:
Grades Nr. of courses
10 2
11 3
12 3
13 5
14 5
15 6
16 8
17 3
18 1

▪ What is the average grade?


✓ The student obtained an average of 14 values in the courses

Ana Cristina Costa 81


Descriptive measures

◼ Mean – Example 3/4


▪ Suppose she answers with the time she studied for each course (in hours):

40.50 54.50 60.00 75.00 86.00 92.25


45.25 52.75 68.00 75.75 88.50 90.00
45.00 55.00 70.50 78.00 90.25 94.00
47.00 60.75 70.00 80.00 90.75 93.25
40.75 58.00 70.25 75.50 95.00 95.00
52.00 61.25 72.50 82.00 88.50 95.00

▪ What is the average hours of study?


✓ The student studied on average 71.91 hours per course

Ana Cristina Costa 82


Descriptive measures

◼ Mean – Example 4/4


▪ Suppose she answers with the frequency distribution of the time she
studied for each course (in hours):

Hours of study Nr. of courses


40.5 − 50.5 5
50.5 − 60.5 6
60.5 − 70.5 5
70.5 − 80.5 7
80.5 − 90.5 6
90.5 − 100.5 7

▪ What is the average hours of study?


✓ The student studied on average approximately 72.17 hours per course

Ana Cristina Costa 83


Descriptive measures

◼ Mean
▪ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑋ത = ෍ 𝑋𝑖
𝑛
𝑖=1

▪ Frequency data (discrete case)


𝑘

𝑋ത = ෍ 𝑓𝑖 𝑋𝑖
𝑖=1

▪ Frequency data (continuous case)


𝑘

𝑋ത = ෍ 𝑓𝑖 𝐶𝑖
𝑖=1

Ana Cristina Costa 84


Descriptive measures

◼ Mode – discrete case


▪ The mode, or modal value, is the most frequent value of the data set
❑ It is the only central location measure that can be used for nominal data

❑ It may have no meaning in discrete data with few repeated observations

Ana Cristina Costa 85


Descriptive measures

◼ Mode – continuous case


▪ It represents a value in the highest bar on a histogram. It can be
determined by a formula, but only if the classes are of equal width.
▪ King’s formula
1. Identify the Modal Class (i.e., interval with the highest frequency)
2. Calculate the following expression
f ∗∗
Mode = Llow + ∗ 𝐰
f + f ∗∗ i
Llow → lower limit of the modal class
wi → width of the modal class
f* → frequency of the class before the modal class
f** → frequency of the class after the modal class

Ana Cristina Costa 86


Descriptive measures

◼ Mode
▪ It is easily affected by small changes in frequency of discrete data, or
class construction in the continuous case
▪ It is possible for a set of data values to have none or more than one
mode

Fig. source: AI hubs (2015) Visualizing Numerical Data.


Ana Cristina Costa (http://researchhubs.com/post/ai/data-analysis-and-statistical- 87
inference/visualizing-numerical-data.html; accessed: 18 Nov 2017)
Descriptive measures

◼ Median – Raw data (discrete or continuous case)

▪ Middle value of the data set when it has been arranged in ascending
order

❑ If the sample has an odd size, it coincides with the central observation

❑ If the sample has even size, the median takes the value of the average of
the two most central observations

▪ 50% of the observations are less than or equal to the median

Ana Cristina Costa 88


Descriptive measures

◼ Median – Raw data


▪ Examples
1.2 1.7 2.1 2.2 2.4
❑ n = 5 (odd)

❑ Median = observation located Median = 2.1


in position (n+1)/2

0 0 1 2 2 3
❑ n = 6 (even)

❑ Median = average of the Median = (1+2)/2 = 1.5


values in positions n/2 and
n/2+1

Ana Cristina Costa 89


Descriptive measures

◼ Median – Frequency of discrete data


▪ Example
❑ 50% of the observations are less than or equal to the median

xi Fi
0 0.44
1 0.56 Median = 1
2 0.78
3 0.89
4 1

Ana Cristina Costa 90


Descriptive measures

◼ Median – Frequency of discrete data


▪ Examples
n odd n even
xi ni Fi xi ni Fi
0 4 0.4 0 4 0.4
1 1 0.5 1 1 0.5
2 2 0.7 2 2 0.7
3 1 0.8 3 1 0.8
4 1 1 4 2 1
Total 9 Total 10

Median = 1 Median = (1+2)/2 = 1.5

✓ Median = average of the values


in position n/2 (50%) and n/2+1

Ana Cristina Costa 91


Descriptive measures

◼ Quartiles and percentiles


▪ Quartiles → values that split a sorted data set into 4 equal parts

▪ Percentiles → values that split a sorted data set into 100 equal parts
❑ The first quartile (Q1) is the 25th percentile: 25% of the observations are below Q1
❑ The second quartile (Q2 = Median) is the 50th percentile
❑ The third quartile (Q3) is the 75th percentile: 75% of the observations are below Q3

50% 50%
Q2
Q1 Q3
Xmin median Xmax
25% 75%
75% 25%

Ana Cristina Costa ! Different software use slightly different formulations to compute these values 92
Descriptive measures

◼ Central tendency measures

▪ A symmetric distribution is one where the left- and right-hand sides of the
distribution are roughly equally balanced around the mean

Source: https://www.siyavula.com/read/maths/grade-11/statistics/11-
Ana Cristina Costa 93
statistics-05 ; accessed: 17 Jun 2020
Descriptive measures

◼ Central tendency measures


▪ These rules on comparing the mean and median values can fail in multimodal
distributions, or in distributions where one tail is long but the other is heavy

mode mode
median median
Frequency

Frequency
mean mean

Positive skew: Mean > Median Negative skew: Mean < Median

Source: Frost, J. (2015) S1: Chapter 4


Ana Cristina Costa 94
Representation of Data. www.drfrostmaths.com, 20th September 2015
Descriptive measures

◼ Central tendency measures


▪ Mean
→ Uses all data, but sensitive to extreme values

▪ Median
→ Does not use all data, but it is robust
→ The quartiles, including the median, are robust location measures because
they are not affected by extreme values

▪ Mode
→ Easily affected by small changes in frequency
➢ Do not use the MODE Excel function with continuous data

Ana Cristina Costa 95


Descriptive measures

◼ Central tendency measures


When there is a very
extreme value (outlier), it
should be reported:

(1) Median, or

(2) Median and Mean

If you think the outlier does not


belong in the data set (i.e., was
an error)… then consider also
reporting the mean without the
outlier.

Ana Cristina Costa Adapted from: What to Report When There is an Outlier by Robert 96
G. Kelley, www.miracosta.edu/home/rkelley (accessed 2018)
Descriptive measures

◼ Dispersion measures

Some dispersion measures

Variance and Standard


Range
deviation
Interquartile range
Coefficient of variation
Mean absolute deviation

Ana Cristina Costa 97


Descriptive measures

◼ Range
▪ Difference between the highest and the lowest value
❑ It is measured in the same units as the data
❑ It is most useful in representing the dispersion of small data sets

R = Xmax – Xmin

◼ Interquartile Range
▪ Difference between the 3rd quartile and the 1st quartile (encompasses
50% of the central observations)

IQ = Q3 – Q1

Ana Cristina Costa 98


Descriptive measures

◼ Mean Absolute Deviation


▪ Measures the degree of dispersion of the values around the mean

▪ To prevent positive deviations from cancelling negative deviations, it is


1 𝑛
computed as the average of the absolute deviations: σ 𝑋𝑖 − 𝑋ത
𝑛 𝑖=1

Ana Cristina Costa 99


Descriptive measures

◼ Mean Absolute Deviation


▪ Raw data (individual observations, discrete or continuous)
𝑛
1
෍ 𝑋𝑖 − 𝑋ത
𝑛
𝑖=1

▪ Frequency data (discrete case)


𝑘

෍ 𝑓𝑖 𝑋𝑖 − 𝑋ത
𝑖=1

▪ Frequency data (continuous case)


𝑘

෍ 𝑓𝑖 𝐶𝑖 − 𝑋ത
𝑖=1

Ana Cristina Costa 100


Descriptive measures
◼ Sample Variance (S2)
▪ Raw data (individual observations, discrete or continuous)
𝑛 𝑛
1 1
𝑆2 = ෍ 𝑋𝑖 − 𝑋ത 2 = ෍ 𝑋𝑖 2 − 𝑛𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

▪ Frequency data (discrete case)


𝑘 𝑘
𝑛 𝑛
𝑆2 = ෍ 𝑓𝑖 𝑋𝑖 − 𝑋ത 2
= ෍ 𝑓𝑖 𝑋𝑖 2 − 𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

▪ Frequency data (continuous case)


𝑘 𝑘
𝑛 𝑛
2
𝑆 = ෍ 𝑓𝑖 𝐶𝑖 − 𝑋ത 2
= ෍ 𝑓𝑖 𝐶𝑖 2 − 𝑋ത 2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

Ana Cristina Costa 101


Descriptive measures

◼ Sample Standard Deviation (S)


▪ It is equal to the square root of the variance, thus it also measures the
degree of dispersion of the values around the mean

▪ The variance presents the disadvantage of translating into the square


of the units in which the variable

▪ The standard deviation is defined in the same units as the variable

▪ Similarly to the mean, standard deviation and variance can be strongly


affected by extreme values

Ana Cristina Costa 102


Descriptive measures

◼ Coefficient of Variation (a.k.a. relative standard deviation)


▪ Measure of relative variability. It allows comparing the degree of dispersion,
around the mean, of different distributions.

𝑆
𝐶𝑉 = × 100
𝑋ത

▪ It can only be calculated when the variable takes values of a single signal, i.e.
the values are all positive or are all negative

❑ It may not have any meaning for data on an interval scale

▪ CV > 50% indicates a small representativeness of the mean, thus the median
should [also] be used to characterise the typical values
❑ The mean will be the more representative the lower the CV value

Ana Cristina Costa 103


Descriptive measures

◼ Statistics computed from raw data vs frequencies


▪ In the discrete case, the same values are obtained for the statistics
when they are calculated using raw data (i.e., individual observations)
or aggregated data (i.e., frequencies).

▪ In the continuous case, the different values are obtained for the
statistics when they are calculated using raw data (i.e., individual
observations) or aggregated data (i.e., frequencies).
➢ In the calculation of the statistics, the midpoint of each class (Ci) is used to
approximate all observations from that class

➢ Different frequency distributions  different midpoints  different


approximations to the (true) values of the descriptive statistics

Ana Cristina Costa 104


Descriptive measures

◼ Association measures

Some association measures

Covariance
[Pearson’s] Correlation coefficient
Spearman’s correlation coefficient

Ana Cristina Costa 105


Descriptive measures

◼ Covariance
▪ Measures the magnitude of the linear association between two
variables by measuring the joint variation of X and Y around their
means
❑ Raw data (individual observations, discrete or continuous)
𝑛
1
𝑆𝑋𝑌 = ෍ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑛−1
𝑖=1

❑ Depends on the units in which the variables X and Y are measured

❑ If the variables X and Y are independent, then the covariance is zero.


However, the inverse is not true in general.

❑ The sign indicates whether the relationship is positive or negative

Ana Cristina Costa 106


Descriptive measures

◼ [Pearson’s] Correlation coefficient


▪ Measures the strength of the linear association between two
continuous variables X and Y
𝑆𝑋𝑌
𝑟= , −1 ≤ 𝑟 ≤ 1
𝑆𝑋 𝑆𝑌

❑ When the variables X and Y are independent, the correlation coefficient


is zero. However, the inverse is not true in general.

❑ The sign indicates whether the relationship is positive or negative

Ana Cristina Costa 107


Descriptive measures

◼ Example 7
Consider the data in the Stocks sheet of the LU0_Examples Excel file.
▪ Are the monthly returns of Microsoft, GE, Intel, GM and CISCO correlated?

Microsoft GE Intel GM CISCO


Microsoft 1
GE 0.4450 1
Intel 0.5165 0.3235 1
GM 0.0688 0.3796 0.3174 1
CISCO 0.5128 0.3755 0.4885 0.1593 1

◼ Exercise
Play the game: guess the value of the correlation coefficient at
https://en.wikipedia.org/wiki/Guess_the_Correlation
Ana Cristina Costa 108
Descriptive measures

◼ Spearman’s correlation coefficient


▪ Measures the strength of a monotonic relationship between two
variables X and Y (discrete or continuous)

▪ The measurement scale of the variables X and Y must be at least


ordinal

▪ It is a measure of association (not correlation), but its interpretation is


similar to that of Pearson’s
❑ −1 ≤ 𝑟’ ≤ 1
❑ When the variables X and Y are independent, the coefficient is zero.
However, the inverse is not true in general.

▪ It is also known as Spearman’s rho

Ana Cristina Costa 109


Descriptive measures

◼ Spearman’s correlation coefficient


▪ Monotonic function
❑ A monotonic function is a function which is either entirely nonincreasing or
nondecreasing

Ana Cristina Costa 110


Descriptive measures

◼ Spearman’s correlation coefficient – calculation


▪ Paired sample : (x1, y1), (x2, y2) , …, (xn, yn)

▪ Sort the n observations of each variable, separately, and associate


them with their corresponding rank

❑ R(xi) → rank of the value xi (corrected for ties)

❑ R(yi) → rank of the value yi (corrected for ties)

▪ The rank assigned to each set of duplicates is the average of the ranks
that those tied values would have if they were different from each
other

Ana Cristina Costa 111


Descriptive measures

◼ Spearman’s correlation coefficient – calculation


▪ The Spearman’s rank correlation is the Pearson’s correlation coefficient
on the ranks of the data; i.e., the formulation is applied to the paired
ranks [R(xi), R(yi)]
2
 n + 1
n

 i=1
R( x i )R( y i ) − n
 2


r' =
 2
 n 2

R( x i ) − n n + 1  + 
n
 R( y ) − n
 
 n 1
 
2 2
  2   i=1 i
 2  
 i=1 

▪ Samples without ties (or with few ties)


n
6  R( x ) − R( y )
i=1
i i
2

r' = 1 −
(
n n2 − 1 )
Ana Cristina Costa 112
Descriptive measures

◼ Example 8
Data are in the Example8 sheet of the LU0_Examples Excel file
❑ Compute the Spearman’s rho (considering few ties)

Xi Yi

550 80
620 60
580 10
580 20
540 30

Ana Cristina Costa 113


Descriptive measures

◼ Example 8 - solution

Square of the difference


xi Rank of xi yi Rank of yi between ranks
(di2)
550 2 80 5 (2 – 5)2
620 5 60 4 (5 – 4)2
580 3.5 10 1 (3.5 – 1)2
580 3.5 20 2 (3.5 – 2)2
540 1 30 3 (1 – 3)2
Sum = 22.5

6(22 .5)
r' = 1 − = −0.125
(
5 5 −1
2
)
Ana Cristina Costa 114
Descriptive measures

◼ Outlier analysis
▪ Outlier: discordant or extreme value

▪ We have to explain this value by further analysis of its cause or origin


❑ If it was due to human/sensor errors when measuring or recording data,
then it should be corrected or removed
❑ If extreme values are a characteristic of the attribute (high skewness),
some authors do not consider them as outliers
❑ Outliers may indicate data points that belong to a different population
than the rest of the sample set

▪ In cases of extreme observations, the typical values must be analysed


using robust statistics (median and IQ) instead of the usual ones (mean
and standard deviation)
Ana Cristina Costa 115
Descriptive measures

◼ Boxplot (a.k.a. box and whisker diagram)


▪ Graphical representation of location measures
▪ Useful for describing dispersion and skewness using robust statistics
Descriptive measures

◼ Boxplot

Positive skewness Symmetry Negative skewness

𝐐𝟑 − 𝐐𝟐 > 𝐐𝟐 − 𝐐𝟏 𝐐𝟐 − 𝐐𝟏 = 𝐐𝟑 − 𝐐𝟐 𝐐𝟐 − 𝐐𝟏 > 𝐐𝟑 − 𝐐𝟐
Descriptive measures

◼ Boxplot
▪ Useful for comparing different distributions in the same graph

▪ Potential outliers may be plotted as individual points (Tukey boxplot)

Ana Cristina Costa 118


Descriptive measures

◼ Outliers detection using fences


▪ Lower inner fence: BII = Q1 – 1,5 IQ v1 → lower adjacent value to Q1

▪ Upper inner fence: BIS = Q3 + 1,5 IQ v2 → upper adjacent value to Q3

▪ Lower outer fence: BEI = Q1 – 3 IQ Moderate outlier

▪ Upper outer fence: BES = Q3 + 3 IQ Severe outlier

Ana Cristina Costa 119


Descriptive measures

◼ Outlier analysis
▪ “The simple elimination of a potential outlier should be done with caution and
the most advisable is to carry out the analysis with and without the presence
of that observation. If the conclusions are discordant one should at least be
aware that the outlier significantly affects the conclusions, and so it is best to
report this fact, leaving to the third party the possibility to choose their own
path.”

▪ “The elimination of a potential outlier is inappropriate when the observed


variable has a distribution with heavy tails, in the framework of which the
outliers are natural. For some authors, almost certain identification of outliers
is generally only possible for samples with 500 or more observations; thus,
when working with small samples, the most prudent policy is to isolate some
values to pay them or ask to be given them special attention.”
(Murteira 1993, p. 100)
Descriptive measures

◼ Example 9
The Example9 sheet of the LU0_Examples Excel file has data on the
duration of extracorporeal circulation (in minutes) of 94 patients
undergoing a heart intervention, between May 1980 and December 1988
at the Hospital de Santa Cruz (Source: Murteira, 1993, pp. 97-98)

❑ Investigate the existence of outliers using fences

❑ Use the graphics facility in Excel 2016 (or later) to produce box plots
1. Select your data – either a single data series, or multiple data series

2. Click Insert > Insert Statistic Chart >Box and Whisker

Murteira, B., Ribeiro, C.S., Silva, J.A., Pimenta, C. (2010).


Ana Cristina Costa 121
Introdução à Estatística. Lisboa: Escolar Editora
Descriptive measures

◼ Example 9
Outlier analysis using fences
❑ In this case, the lower fences, inner (BII) and outer (BEI), are irrelevant
because they are less than zero and the variable is positive
✓ Mean = 139.72
✓ Min = 30 OUTLIERS

✓ Q1 = 95.5 Moderate
295 minutes
✓ Median = Q2 = 120
300 minutes
✓ Q3 = 167.5
✓ Max = 403 Severe
✓ IQ = 72 402 minutes
✓ BIS = 167.5 + 1.5x72 = 275.5 403 minutes
✓ BES = 167.5 + 3x72 = 383.5

Ana Cristina Costa 122


Descriptive measures

◼ Example 10
Consider the data in the Grades sheet of the LU0_Examples Excel file
▪ Use the Analysis Toolpak add-in to compute descriptive statistics of the grades
from each school

▪ Use the graphics facility in Excel 2016 (or later) to produce a histogram of the
grades from Columbus East

▪ Use the graphics facility in Excel 2016 (or later) to produce a box plot of the
grades by school and subject

Source: Microsoft (2017) Create a box and whisker chart.


Ana Cristina Costa https://support.office.com/en-us/article/Create-a-box-and-whisker-chart- 123
62f4219f-db4b-4754-aca8-4743f6190f0d (accessed: 20 Nov 2017)
Introduction to Probability Theory

◼ Empirical distributions and probability distributions


▪ Distribution of the weight of 500 cigarettes "SG Filter"
❑ Mean  Mode  Median  830 mg; Standard deviation  23.63 mg

Weight Nr. of Proportion


(mg) cigarettes cigarettes
760 – 780 4 0.008
780 – 800 43 0.086
800 – 820 118 0.236
820 – 840 168 0.336
840 – 860 117 0.234
860 – 880 39 0.078
880 – 900 11 0.022
Total 500 1

Source: Murteira, B. (1993). Análise Exploratória de Dados 124


Ana Cristina Costa
– Estatística Descritiva. Portugal: McGraw-Hill, p. 34.
Introduction to Probability Theory

◼ Empirical distributions and probability distributions


▪ From the frequency distribution can we conclude that
❑ There are no cigarettes weighting less than 760 mg?

❑ The average weight of cigarettes on the market is equal to 830 mg?

❑ If another sample of 500 cigarettes was taken, the average weight of


cigarettes would still be exactly 830 mg?

▪ Solution: obtain a mathematical model of the frequency distribution


❑ It is an algebraic expression that describes the relative frequency (height of
the frequencies curve) for all possible values ​of the variable

❑ It is named probabilistic model or probability distribution

Ana Cristina Costa 125


Introduction to Probability Theory

◼ Empirical distributions and probability distributions


▪ If we assume that the true mean weight of cigarettes is  = 830 mg and the
standard deviation is  = 24 mg, then we can formulate the following
probabilistic model: Normal distribution with parameters  = 830 and  = 24

Probabilistic model
2
1  x − 830 
−  
1 2  24 
e
24 2

770 790 810 830 850 870 890


peso (mgrs)

–– Frequency
Curva de frequências
curve

Ana Cristina Costa 126


Introduction to Probability Theory

◼ Empirical distributions and probability distributions


▪ The random variable X represents the weight of "SG Filter“ cigarettes. We may
assume that the probability distribution of X is the Normal distribution with
parameters =830 and =24.
❑ The parameters of probability distributions, also called population parameters, are
generally represented by Greek letters (, , , )

❑ If we change the values of the parameters of a distribution, the appearance of the


model’s graph changes. They allow the same distribution to be used to describe a
vast set of real phenomena.

✓ 23.4% of the cigarettes in the sample weight between 840 and 860 mg

✓ The probability of any cigarette to weight between 840 and 860 mg is 23.16%

Ana Cristina Costa 127


Introduction to Probability Theory

◼ Empirical distributions and probability distributions


▪ The frequency distribution is an empirical concept that, in most cases,
concerns a sample. Hence, it is also named the empirical distribution of
X.

▪ The probability distribution is a theoretical concept, regarding the


population, and should be considered a mathematical model of the
reality.

❑ The probability of an event can be understood as the relative frequency of


this event in a theoretical population model

❑ We use a lowercase letter x to designate a specific amount of the


population X

Ana Cristina Costa 128


Introduction to Probability Theory

◼ Statistical inference process

Parameter – a number that describes the population

INFERENCE

POPULATION
Mean: 


Sample mean: X
Sample

Statistic – a number that describes the sample

Ana Cristina Costa 129

You might also like