You are on page 1of 19

1

STATISTICS
TABLE OF CONTENTS
Statistics .............................................................................................................................................................................. 3
The statistical process ..................................................................................................................................................... 3
Study design................................................................................................................................................................ 3
Introduction to data ........................................................................................................................................................... 4
Data................................................................................................................................................................................. 4
Variables ..................................................................................................................................................................... 4
Data collection principles ............................................................................................................................................... 4
Population, samples and censuses ............................................................................................................................. 4
Observational studies ..................................................................................................................................................... 5
Sampling methods .......................................................................................................................................................... 5
Experiments .................................................................................................................................................................... 5
Examining numerical data .............................................................................................................................................. 6
Scatterplots ................................................................................................................................................................. 6
Dot plots ..................................................................................................................................................................... 6
Histograms .................................................................................................................................................................. 6
Variance and the standard deviation.......................................................................................................................... 7
Box plots, quartiles and the median ........................................................................................................................... 7
Outliers ....................................................................................................................................................................... 8
Transforming data ...................................................................................................................................................... 9
Examining Categorical Data ............................................................................................................................................ 9
Contingency tables and bar plots ............................................................................................................................... 9
Probability......................................................................................................................................................................... 11
Definition of probability................................................................................................................................................ 11
Probability................................................................................................................................................................. 11
Sample space ............................................................................................................................................................ 11
Disjoint or mutually exclusive outcomes .................................................................................................................. 11
Complementary events............................................................................................................................................. 12
Independent events .................................................................................................................................................. 12
Conditional Probability ................................................................................................................................................. 12
Marginal and joint probability .................................................................................................................................. 12
Defining Conditional probability ............................................................................................................................... 12
Three diagrams ......................................................................................................................................................... 13
Random Variables ......................................................................................................................................................... 13
Types of random variables ........................................................................................................................................ 13
2

Expectation ............................................................................................................................................................... 14
Variability in random variables ................................................................................................................................. 14
Linear combination of random variables .................................................................................................................. 14
Continuous distribution ................................................................................................................................................ 15
Distributions of random variables .................................................................................................................................... 16
Normal distribution ...................................................................................................................................................... 16
Standardizing with Z scores ...................................................................................................................................... 16
The normal probability table .................................................................................................................................... 17
Evaluating the normal approximation .......................................................................................................................... 17
Geometric distribution ................................................................................................................................................. 17
Bernoulli distribution ................................................................................................................................................ 17
Geometric distribution ............................................................................................................................................. 18
Binomial distribution .................................................................................................................................................... 18
The choose function ................................................................................................................................................. 18
The normal approximation of the binomial distribution .......................................................................................... 18
More discrete distributions .......................................................................................................................................... 18
The negative binomial distribution ........................................................................................................................... 19
Poisson distribution .................................................................................................................................................. 19
3

STATISTICS
Scientists aim to answer questions using rigorous methods and careful observation; in statistics these so-called
observations are identified as data.
Statistics is the art of collecting and interpreting data.

THE STATISTICAL PROCESS


It is useful to consider statistics as an investigative process, which steps are the following:
1. Identification of the question, as it is the question itself which influences the type of data that are searched.
2. Collection of relevant data
3. Analysis of the data collected
4. Drawing of conclusion.

STUDY DESIGN

Objective: Evaluate the effectiveness of cognitive-behaviour


therapy for chronic fatigue syndrome.
Participant pool: 142 patients who were recruited from referrals by
primary care physicians and consultants to a hospital clinic
specializing in chronic fatigue syndrome.
Actual participants: Only 60 of the 142 referred patients entered the
study. Some were excluded because they didn’t meet the diagnostic
criteria, some had other health issues, and some refused to be a part of the study.
Patients are randomly assigned to treatment and control groups, 30 patients in each group:
• Treatment: Cognitive behaviour therapy – collaborative, educative, and with a behavioural emphasis. Patients
were shown on how activity could be increased steadily and safely without exacerbating symptoms.
• Control: Relaxation – No advice was given about how activity could be increased. Instead progressive muscle
relaxation, visualization, and rapid relaxation skills were taught.
The table shows the distribution of patients with good outcomes at 6-month follow-up. Note that 7 patients dropped
out of the study: 3 from the treatment and 4 from the control group.
A NALYSIS OF RESULTS
Given these data we can compute a summary statistic from the table; a summary statistic is a single number
summarizing a large amount of data. We choose to consider the percentage of outcomes relating to each group.
Proportion with good outcomes in treatment group: 19/27≈0.70→70%
Proportion with good outcomes in control group: 5/26≈0.19→19%
G ENERALIZATION OF RESULTS
Do the data show a real difference between the groups, or is this difference due to natural variability?
Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably
won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process and is
called natural variability.
The observed difference between the two groups (70 - 19 = 51%) may be real or may be due to natural variation. Since
the difference is quite large, it is more believable that the difference is real; however, these results cannot be generalized
to the whole population with cognitive fatigue, as the pool was not randomly selected: only a totally randomly selected
pool can ensure representativeness and the statistic can be, thus, generalized.
4

INTRODUCTION TO DATA
DATA
We consider the observed table a data matrix, which is a table that best sums all the data
gathered for an investigation. Each row of such table represents a statistical unit – or unit
of observation or case – while each column depicts a variable.
VARIABLES
Variables can be classified in different ways, such as the graph shows.

A Numerical Variable is a quantitative one:


◦ If it can only take non-negative, integer numbers, it is considered a discrete variable;
◦ Otherwise it is called a continuous variable.
A Categorical Variable is a qualitative one:
◦ It can be a categorical or regular variable, whenever there is no specific order and/or preference to it
◦ It can be an ordinal variable when it depicts an order in its elements.
R ELATIONSHIP BETWEEN VARIABLES
Furthermore, we can classify variables by their relationship with one another: if two variable are in some way linked
they are considered associated – there can be a positive or negative association – if there is no link between them they
are considered independent variables.
They can only be one at once.

DATA COLLECTION PRINCIPLES


The first step of conducting a research is to identify topics to be investigated: a clear research question is useful in
identifying which subjects or cases should be studied and what variables are important.
But how do we collect data?
POPULATION, SAMPLES AND CENSUSES
Each research question refers to a target population, while a sample refers to a fraction of that population; if we were to
take a study of the whole population we would say we are conducting a census.
Most of the times censuses are not viable for different reasons:
• complexity;
• size of the sample;
• economical costs;
• population rarely stand still:
◦ they change constantly, thus making it impossible to obtain a perfect measure
◦ there might be hard to locate individuals, which could have some peculiar characteristics that distinguish
them from the rest of the population → thus making the statistic lose in representativeness.
S AMPLING ERRORS
Thus, what we do instead of censuses is sampling, which comes natural to us. However, if not done appropriately, one
could incur in a series of errors when sampling; for instance:
 anecdotal evidence, such as when we select evidence haphazardly, which might be true, but only for
exceptional cases.
• Non-response bias, which makes unclear whether the results are representative of the whole population: when
people choose not to respond to an investigation it usually is because they have a very specific reason (i.e. fear
of retaliation by authorities by illegal immigrant in censuses)
5

• convenience sample, individuals who are easier to access are more likely to be included.
• Volunteering: people who volunteer answering to a sample have very strong opinion about the issue (i.e.
people reviewing a product/service are usually totally dissatisfied with it or completely satisfied)
◦ People are more prone to answer if their response is “Yes”, this is a problem which easily arises when
developing a referendum→ questions are put in a way that makes voters want to answer “Yes”.
E XPLANATORY AND RESPONSE VARIABLES
When we are dealing with two variables and we suspect that one is affecting another, we call the first an explanatory
variable and the second a response one; however this labelling does not guarantee a causal relationship: it only reflects
our suspect.
E XPLORATORY ANALYSIS TO INFERENCE
After a study has been conducted, the next step is that of an explanatory analysis, that is: the description of the result.
Given it, we draw an inference, or a generalization of the results from the sample to the whole population interested.
We do not need to do an inference if we are taking a census, as the explanatory analysis already describes the whole
population.

OBSERVATIONAL STUDIES
We speak of observational studies when researchers do not interfere with how data actually arises, but merely collect it;
these type of studies are often common in medical research: imagine trying to provoke cancer or another disease in
patients in order to conduct an experiment! Such studies can take the form of prospective studies, that is a study that
collects data as it unfolds, or retrospective studies, which record data after the events have taken place.
S PURIOUS ASSOCIATION
It might occur that two variables are associated by a so-called spurious association, that is: both their causality depend
on a third variable known as confounding variable.

SAMPLING METHODS
There exist four methods for sampling:
1. Simple sampling, which roughly works as a lottery: each case of the population has an equal possibility to be
drawn into the experiment
2. Stratified sampling the population is divided into group, each is called a strata; in each stratum there are similar
cases so that, when a simple sampling method between each stratum is employed, we ensure representativeness
3. Cluster sampling, similar to stratified sampling, but the population is divided into cluster, in turn some clusters
are randomly picked and will be followed for the whole experiment, making them the sample.
4. Multistage sampling, same as cluster sampling, but a random sample within each cluster is picked.

EXPERIMENTS
Studies where the researchers assign treatment to cases are called experiments; when this assignment includes
randomization, they are called randomized experiments.
P RINCIPLES OF EXPERIMENTAL DESIGN
Randomized experiments are based on four principles:
1. Controlling, that is: researchers try to control any difference within the group
2. randomization: researchers randomize patients into the treatment group to account for variables that cannot be
controlled
3. replication: the more cases researchers observe, the more accurately they can estimate the effect of the
explanatory variable on the response
◦ It is possible to replicate within a single study by collecting a large enough sample
◦ it is possible to replicate an entire study
6

4. blocking: whenever researchers suspect that another variable, other than the treatment, influences response,
they might divide individuals in blocks and then randomize cases within each block to the treatment groups.
D OUBLE - BLIND RANDOMIZED EXPERIMENTS
Reducing bias in human experiments
Within a human experiment it is possible to witness an additional effect, which might hamper the experiment’s veracity,
that is emotional effect.
In order to avoid such a bias, researchers might choose to keep patients unaware of their treatment by administering
those in control group with a placebo; that is called a blind experiment.
Sometimes, it can be useful to maintain also the doctors or even the researchers themselves unaware of the patients’
group belonging; these are called double-blind experiments.

EXAMINING NUMERICAL DATA


There exist various techniques to examine numerical data and summarize their variables.
SCATTERPLOTS
A scatter plot provides a case-by case view of data for two numerical variables; however, sometimes only a variable is
necessary, thus we use a dot plot.
DOT PLOTS
Especially when dealing with dot-plots, the mean is a common way to measure the centre of distribution of data.
M EAN
S AMPLE MEAN
∑𝒏 𝒙𝒊
The sample mean is often called 𝒙 and is computed as: 𝒙 = 𝒊 𝟏
𝒏
.
The sample mean is a sample statistic and serves as a good guess (point estimate) of the population mean. This estimate
may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.
P OPULATION MEAN
While the population mean, called μ, refers to the number of units of the population and each statistical unit is
∑𝑵 𝒙𝒊
represented by N. It is computed as: 𝝁 = 𝒊 𝟏
𝑵

HISTOGRAMS
Even if dot-plots and stacked dot plots are useful, as they show the exact value for each observation, they are only
feasible for small data sets. Whenever we work with bigger data sets,
we prefer to think of them as belonging to a bin; these binned counts
are plotted as bars in a graph known as a histogram.
S HAPE OF THE DISTRIBUTION
Histograms are particularly useful to understand data density and
describing the shape of distribution as their area is proportional to the
actual data.
When we speak of distribution, we must take into account that can be
described as:
 Relative frequency, where the sum of all the data is 1. This is represented by the frequency fi
𝒏𝒊
𝒇𝒊 = 𝒊 = 𝟏, … , 𝒌
𝒏
 Percentage, a relative frequency multiplied by 100.
 Number of data.
B IN WIDTH AND HEIGHT
If variable x is continuous, we would end up having infinite distinct values xi, resulting in an intractable distribution. For
this reason, values are grouped in bins (classes), with a were we approximate the mean to:
𝒌 𝒌 𝒌
∑𝒌𝒊 𝟏 𝒙𝒊 𝒏𝒊 𝒙𝒊 𝒏𝒊 𝒏𝒊
𝒙= = = 𝒙𝒊 = 𝒙𝒊 𝒇𝒊
𝒏 𝒏 𝒏
𝒊 𝟏 𝒊 𝟏 𝒊 𝟏

𝒙= 𝒇𝒊 𝒄𝒊
𝒊 𝟏
7

(𝒙𝒊 𝒙𝒊 )
where ci is a value in the bin (𝒙𝒊 𝟏 , 𝒙𝒊 ), typically 𝟏
𝟐
.
The height of bars is proportional to the frequency of the corresponding bin: if the frequency of the bin is 𝒙𝒋 𝟏 , 𝒙𝒋 nj
𝒏𝒋
then the corresponding bar can be fixed to have height equal to 𝒙𝒋 𝒙𝒋
. In this way we have that, being the bar a
𝟏

rectangle and its area equal to its height times its width, thus:
𝒏𝒋
𝒙𝒋 − 𝒙𝒋 𝟏 ⋅ = 𝒏𝒋
𝒙𝒋 − 𝒙𝒋 𝟏
In this case we speak of a density histogram, or a histogram which area is equal to 1.
M ODALITY
Modality describes the number of peaks of a histogram: a single peak histogram is known as a unimodal histogram; a
two peaks one is a bimodal one and any with more than two is called a multimodal histogram.
S YMMETRY
When describing this type of graph we also take into account its symmetry; a symmetric histogram shows consistency
between data on the sides of its centre, while an asymmetric one can be unbalanced on either side of the mean and is
called skewed.
Another important factor in the shape of histograms are the unusual observations or potential outliers.
VARIANCE AND THE STANDARD DEVIATION
V ARIABILITY
Variability refers to the extent to which values differ from one another; it can also be thought of as the values’
difference from their mean, that is how spread out a distribution is.
We can measure the distribution by looking at their distance from the mean; this is done by computing the square
distances from the mean (as it does not make sense to use the simple mean of these values, since their sum is always
equal to zero). Furthermore, squared distances are useful since they:
 Eliminate all negative values (a square is always positive)
 Highlight differences by making them bigger or smaller (102=100 while 0.12=0.01)
∑𝒏𝒊 𝟏(𝒙𝒊 − 𝒙)𝟐
𝑽𝒂𝒓𝒊𝒂𝒃𝒊𝒍𝒊𝒕𝒚 =
𝒏
Where (𝒙𝒊 − 𝒙) is the distance of the value form the mean.
V ARIANCE
Variance is the mean of squared differences from their mean. It can refer to a sample or to the population and can be
computed for bins or distinct values.
∑𝒏
𝒊 𝟏(𝒙𝒊 𝒙)
𝟐
Sample Variance: 𝒔𝟐 =
𝒏
where 𝒙 is the sample mean
∑𝒏
𝒊 𝟏(𝒙𝒊 𝝁)
𝟐
Population Variance: 𝝈𝟐 =
𝒏
where 𝝁 is the population mean
The unbiased sample variance is roughly the average squared deviation from the mean.
∑𝒏𝒊 𝟏(𝒙𝒊 − 𝒙)𝟐
𝒔𝟐 =
𝒏−𝟏
S TANDARD DEVIATION
The sample standard deviation is the square root of the unbiased sample variance and has the same units as the data.
𝒔= 𝒔𝟐
It describes how far away the typical observation is from the mean; we call this distance a deviation. It’s mostly useful
to consider how close the data re from the mean.
S COPE OF VARIANCE AND STANDARD DEVIATION
These two measures are a mean to and “end”, where we consider such end the ability to estimate accurately uncertainty
associated with sample statistics.
BOX PLOTS, QUARTILES AND THE MEDIAN
M EDIAN
The median is the value that splits the data in half when ordered in ascending order; since it is the midpoint of data, it is
also known as the 50th percentile.
If the number of observation is odd, we simply take the middle observation, if they are, however even, the median
becomes the average of the two numbers in the middle.
Suppose we are considering the set A of observation, so that:
𝑨 = {𝟎, 𝟏, 𝟐, 𝟑, 𝟒, 𝟓}
8

IF we take n=5 the median will be 2; IF we take n=6 the median will be:
𝟐+𝟑
𝒎𝒆𝒅𝒊𝒂𝒏 = = 𝟐. 𝟓
𝟐
Q UARTILES
Similarly to the median, we also define another three percentiles in our data, which are:
 The 25th percentile, or Q1
 The 75th percentile, or Q3
 The 100th percentile.
B OX PLOT
In particular, by means of Q1 and Q3 we are able to build another graph, known as the box plot; the box plot is used to
summarize a data set by using five statistics and plotting unusual observations as well.
In order to draw a box plot we:
1. Denote the median with a dark line, in order to split data
in two halves;
2. We draw a box that collects the middle 50% of data,
which is given by all data between Q1 and Q3, by means of the
Interquartile Range
𝑰𝑸𝑹 = 𝑸𝟑 − 𝑸𝟏
3. We, then, attempt to capture outside data by means of the
so-called whiskers; such whiskers’ length is given by:
𝒖𝒑𝒑𝒆𝒓 𝒘𝒉𝒊𝒔𝒌𝒆𝒓 = 𝑸𝟑 + (𝟏. 𝟓 ∗ 𝑰𝑸𝑹)
𝒍𝒐𝒘𝒆𝒓 𝒘𝒉𝒊𝒔𝒌𝒆𝒓 = 𝑸𝟏 − (𝟏. 𝟓 ∗ 𝑰𝑸𝑹)
4. An observation beyond the whiskers is labelled with a
dot, and is considered an outlier, that is: an observation which
appears unusually distant from the rest of the data.
OUTLIERS
Outliers are a useful tool for statistics, since they:
 Identify strong skew in distributions,
 Identify errors in data collection or its entry,
 Provide insight into interesting data.
R OBUST STATISTICS
Even if interesting, we must be aware that the presence of outliers can be dangerous to our study of data: in fact, we
must distinguish between the analysis techniques explained up to now, since some are more sensible than others to the
presence of outliers.

In particular, we consider the median and the IQR as less sensible tools, thus we call them robust estimates and use
them in the presence of outliers, as opposed to the mean and standard deviation, which are also known as non-robust
estimates.
S YMMETRIC AND ASYMMETRIC DISTRIBUTION
We can see how these characteristics affect data distribution and its measure graphically. Distribution is usually
classified as symmetric and asymmetric.
ASYMMETRIC DISTRIBUTION
When we deal with asymmetric distribution, we need to furtherly divide it in two categories:
1. A right-skewed distribution, also known as a positive skew, is affected by outliers in the lower percentiles, which
tend to draw the mean upwards with respect to the median, so that
𝒎𝒆𝒂𝒏 > 𝒎𝒆𝒅𝒊𝒂𝒏
2. A left-skewed distribution or a negative one, on the other hand, is usually affected by outliers in the upper
percentiles, so that the mean is drawn downwards with respect to the median.
𝒎𝒆𝒂𝒏 < 𝒎𝒆𝒅𝒊𝒂𝒏
9

SYMMETRIC DISTRIBUTION

On the other hand, we can see that with symmetrical distribution, the mean and the median seem to be equal to each
other.
Usually, a symmetrical distribution is represented with a bell-shape even if it can assume other shapes. However, when
we are dealing with a symmetrical bell-shaped distribution, we should check for the distribution of data; if, in your bell-
shaped distribution:
 approximately 68% of the data values are within one standard deviation of the mean;
• approximately 95% of the data values are within two standard deviations of the mean;
• almost all of the data values are within three standard deviations of the mean.
We are dealing with a normal distribution.
TRANSFORMING DATA
When data are extremely skewed, transforming them might make modelling easier; we consider a transformation any
rescaling of data by means of a function. However, even if modelling is easier with transformed data, we should be
aware that this practice makes interpretation more difficult: if we consider the log as a common transformation, what
does the log of a variable tell us about it?
M EAN OF A TRANSFORMATION
What happens to the mean of a transformation?
Let X be a variable measured on n sampling units so that:
𝑿 = 𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏
Consider the transformation 𝒈(∙)𝒐𝒇 𝑿, i.e 𝒀 = 𝒈(𝑿), so that:
𝒚𝟏 = 𝒈(𝒙𝟏 ), 𝒚𝟐 = 𝒈(𝒙𝟐 ), … , 𝒚𝒏 = 𝒈(𝒙𝒏 )
Thus, mean of Y can be computed as:
∑𝒏𝒊 𝟏 𝒚𝒊 ∑𝒏𝒊 𝟏 𝒈 𝒙𝒋
𝒚= =
𝒏 𝒏
A since 𝒀 = 𝐥𝐨 𝐠(𝑿) then:
∑𝒏
𝒊 𝟏 𝒍𝒐𝒈 𝒙𝒋
𝒚=
𝒏
.
Thus the mean of a transformation is not the transformation of the mean.
L INEAR T RANSFORMATION
Another important example of transformations is the linear transformations. Consider:
𝑿 = 𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏
A transformation 𝒈(∙)𝒐𝒇 𝑿 is called linear if 𝒀 = 𝒂𝑿 + 𝒃, with a and b as ral constants.
Then the final transformation would be:
𝒚𝒊 = 𝒂𝒙𝒊 + 𝒃 𝒘𝒊𝒕𝒉 𝒊 = 𝟏, … , 𝒏

EXAMINING CATEGORICAL DATA


Like numerical data, also categorical data can be organized and analysed; in order to do so we might use different
graphs, which are based on the so-called contingency table.
CONTINGENCY TABLES AND BAR PLOTS
C ONTINGENCY TABLE
10

A contingency table is a table that summarizes data for two categorical variables, usually by recording the number of
occurrences of a particular combination of variable outcomes. If we are, however, dealing with a single variable, the
table is called a frequency table; such tables can also be relative, if we replace the absolute value with a relative one.
B AR PLOTS
A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies
are shown is called a relative frequency bar plot.
P LOTS FOR MORE THAN ONE VARIABLE
However, if we are dealing with more than one categorical variables, we should use other tools, such as:
1. the segmented bar plot, which shows the proportion of data and can be expressed in absolute or relative values;
2. or the mosaic plot, which is similar to the bar plot, but uses the boxes area to represent the number of observations
a variable has.
11

PROBABILITY

DEFINITION OF PROBABILITY
PROBABILITY
We use probability to deal with uncertainty, i.e. to define how likely an event is to occur. An event (outcome) is a
proposition which can be either true or false – tertium non datur, its probability is the proportion of times such outcome
is observed a random process an infinite number of times.
I NTERPRETATION OF PROBABILITY
There exist two interpretations of probability:
 Frequentist interpretation:
the probability of an outcome is the proportion of times the outcome would occur if we observed the random
process an infinite number of times.
 Bayesian interpretation:
a Bayesian interprets probability as a subjective degree of belief: For the same event, two separate people
could have different viewpoints and so assign different probabilities.
Largely popularized by revolutionary advance in computational technology and methods during the last twenty
years.
These two interpretations are taken as a philosophical matter.
L AW OF L ARGE NUMBERS – LLN
The frequentist definition is best used to describe situations such as the toss of a coin; the outcome is random, however,
we can witness some regularity as the number of tosses
increases. This tendency to stabilize around a certain value p is
described by the so-called law of large numbers, which states:
as more observations are collected, the proportion of
occurrences with a particular outcome, pn say, converges to the
probability of that outcome, π say.
The common misunderstanding of the LLN is that random
processes are supposed to compensate for whatever happened in
the past; this is just not true and is also called gambler’s fallacy
(or law of averages).
SAMPLE SPACE
The set of all possible outcomes to a random process is called a sample space S, while the list of all possible outcomes
and their probability is known as the probability distribution.
P ROBABILITY DISTRIBUTION
Rules for probability distributions:
1. The events listed must be disjoint
2. Each probability must be between 0 and 1
3. The probabilities must total 1
DISJOINT OR MUTUALLY EXCLUSIVE OUTCOMES
Two events are said to be disjoint when they are mutually exclusive, that is: if one happens the other is bound not to;
the sum of the probability of two disjoint outcomes is always equal to 1.
A DDITION RULE
A DDITION RULE – DISJOINT EVENTS
The accuracy of this approach is guaranteed by the addition rule, which states that:
If A1 and A2 represent two disjoint outcome, then the probability that one of them occurs is given by:
𝑷 𝑨𝟏 𝑨𝟐 = 𝑷(𝑨𝟏 ) + 𝑷(𝑨𝟐 )
Either A1 or A2 happens
G ENERAL ADDITION RULE
On the other hand, we speak of non-disjoint events when two outcomes are not mutually exclusive, which allows us to
state the general addition rule:
If A and B are any two events, disjoint or not, the probability that at least one of them will occur is
12

𝑷 𝑨 𝑩 = 𝑷(𝑨) + 𝑷(𝑩) − 𝑷 𝑨 𝑩
Where P(A ⋂ B) is the probability that both event occur at the same time →A and B happen; note that for disjoint
events P(A ⋂ B) = 0, thus the equation simplifies to P(A ⋃ B) = P(A)+P(B)
P ROBABILITY DISTRIBUTION
A probability distribution is a list of all the possible outcomes of an event with their corresponding probabilities, given:
1. The outcomes are disjoint
2. Each probability must be between 0 and 1
3. The sum of all the probabilities must equal 1
COMPLEMENTARY EVENTS
An event is a subset of the sample space (A), while the complementary of an event is constructed according to two im-
portant properties:
1. Every possible outcome not in A is in 𝐴̅:
𝑃(𝐴 𝑜𝑟 𝐴̅) = 1
̅
2. A and 𝐴 are disjoint:
𝑃(𝐴) + 𝑃(𝐴̅) = 𝑃(𝐴 𝑜𝑟 𝐴̅)
The combination of these equations yields:
𝑃(𝐴 ) = 1 − 𝑃(𝐴̅)
INDEPENDENT EVENTS
Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the
other; i.e.: knowing that the coin landed on a head on the first toss does not provide any useful information for
determining what the coin will land on in the second toss. →Outcomes of two tosses of a coin are independent.
P RODUCT R ULE
G ENERAL P RODUCT RULE
When dealing with independent events, we use the product rule, that is:
𝑃(𝐴 𝑎𝑛𝑑 … 𝑎𝑛𝑑 𝐴 ) = 𝑃(𝐴 ) ∗ … ∗ 𝑃(𝐴 )
C HECKING FOR INDEPENDENCE
Consider events A and B .
We are interested in the probability of A given that B is true.
𝑃(𝐴|𝐵) = 𝑃(𝐴)

Saying that B is true means restricting our attention to B. In other words, to compute probabilities, we no longer
consider S as the collection of all possible outcome, but B. Therefore:
#(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) = = 𝑃(𝐴) ∗ 𝑃(𝐵)
#𝐵
If P(A occurs, given that B is true) = P(A |B) = P(A), then A and B are independent.
CONDITIONAL PROBABILITY
The conditional probability of the outcome of interest A given condition B is calculated as
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
MARGINAL AND JOINT PROBABILITY
When more variables come at stake, we must distinguish between marginal and joint probability.
M ARGINAL P ROBABILITY
We speak of marginal probability, when an outcome depends solely on one variable.
E.g. if we are given a deck of 52 cards, what is the probability that we draw a red card (A)
13
𝑃(𝐴) = = 0.5
52
J OINT PROBABILITY
Whereas, we speak of joint probability whenever an outcome depends on more than one process or variable, which are
happening at the same time.
DEFINING CONDITIONAL PROBABILITY
13

An example of joint probability is conditional probability: we speak of conditional probability whenever there is a
condition on the outcome; we could divide conditional probability in two parts: the outcome of interest and its
condition, so that a conditional probability is expressed as:
P(outcome of interest|condition)
Which is computed as:
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
G ENERAL MULTIPLICATION RULE
Given this definition, we are able to derive the General Multiplication rule, which is used for events that might not be
independent
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴|𝐵) ∗ 𝑃(𝐵)
Generically, if P(A|B) = P(A) then the events A and B are said to be independent.
I NDEPENDENCE CONSIDERATIONS
If two events are truly independent, then knowing the outcome of one doesn’t provide any information about the other
𝑃(𝐴 ∩ 𝐵) 𝑃(𝐴) ∗ 𝑃(𝐵)
𝑃(𝐴|𝐵) = = = 𝑃(𝐴)
𝑃(𝐵) 𝑃(𝐵)
THREE DIAGRAMS
The general multiplication rule is also useful to determine probability of the
branches in a tree diagram.
Such diagrams are particularly useful when we have a sequence of processes
whose outcomes are conditioned by the predecessors.
I NVERTING PROBABILITIES
In particular, by means of these tools we can invert probabilities, that is: if we are provided with the probability of event
A given the condition B, we can actually compute the probability of B using A as a condition.
B AYES ’ THEOREM
Such process is explained by means of the Bayes’ Theorem, which states:
𝑃(𝐵|𝐴 )𝑃(𝐴 )
𝑃(𝐵|𝐴 )𝑃(𝐴 ) =
𝑃(𝐵|𝐴 )𝑃(𝐴 ) + 𝑃(𝐵|𝐴 )𝑃(𝐴 ) + ⋯ + 𝑃(𝐵|𝐴 )𝑃(𝐴 )
Where A2,…,Ak represent all other possible outcomes of variable 1.
To apply it correctly there are two preparatory steps:
1. Identify marginal probability of each possible outcome: P(A1), P(A2),…, P(Ak)
2. Identify the probability of outcome B, conditioned on each possible scenario for the first variable: P(B|A1),
P(B|A2),…, P(B|Ak)

RANDOM VARIABLES
A random variable is a numeric quantity whose value depends on the outcome of a random event
• We use a capital letter, like X, to denote a random variable
• The values of a random variable are denoted with a lowercase letter, in this case x
e.g. P(X = x)
TYPES OF RANDOM VARIABLES
There are two types of random variables:
C ONTINUOUS RANDOM VARIABLES
Continuous random variables take real (decimal) values.
D ISCRETE RANDOM VARIABLES
Discrete random variables take only integer values and is denoted by X , it is supported by the set of all values it can
assume; such set is denoted as SX and is described by a probability mass function.
P ROBABILITY MASS FUNCTION
The probability mass function of a discrete random variable X is a function defined on SX that gives the probability that X
is exactly equal to x ∈ SX , formally:
𝑝(𝑥) = 𝑃(𝑋 = 𝑥) 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑥 ∈ 𝑆
And its properties are:
1. p(x) ≥ 0
2. ∑ ∈ 𝑝(𝑥) = 1
14

Any function that takes on values in SX and that fulfils the two properties above is a probability mass function for X.
C UMULATIVE DISTRIBUTION FUNCTION
For every real value 𝑥 ∈ ℝ the cumulative distribution function (cdf) of X is defined as:
𝐹(𝑥) = 𝑃(𝑋 ≤ 𝑥) = 𝑝(𝑦)
∈ ;
Properties of the distribution function:
1. F(x) is (not necessarily strictly) non-decreasing;
2. lim 𝐹(𝑥) = 0 and lim 𝐹(𝑥) = 1;
→ →
3. F(x) is right-continuous, that is lim 𝐹(𝑥) = 𝐹(𝑥 ).

Every function that satisfies the three properties above is a distribution function.
This function represents the probability that the chosen event x does not happen, while one of the lower ones does.
EXPECTATION
We are often interested in the average outcome of a random variable; this value is called the expectation and is com-
puted through the mean or, the weighted average of all the possible outcomes.
𝜇 = 𝐸(𝑋) = 𝑥 ∗ 𝑝(𝑥)

VARIABILITY IN RANDOM VARIABLES


The variability of a random variable can be described by means of its variance and standard deviation.
V ARIANCE
Variance measures the dispersion of the probability distribution around the expected value; it is computed as:
𝜎 = 𝑉𝑎𝑟(𝑋) = [𝑥 − 𝐸(𝑋)] ∗ 𝑝(𝑥)

or, alternatively, as:
𝑉𝑎𝑟(𝑋) = 𝐸(𝑋 ) − 𝐸(𝑋) = 𝑥 ∗ 𝑝(𝑥) − 𝐸(𝑋)

S TANDARD DEVIATION
The standard deviation is simply the square root of variance and it describes, as well as variance, the dispersion of the
probability distribution around the expected value, however, unlike the variance, it is expressed in the same units as the
data. In addition to expressing the variability of a population, the standard deviation is commonly used to measure con-
fidence in statistical conclusions; this derivation, however, is known as the standard error.
LINEAR COMBINATION OF RANDOM VARIABLES
In the context of random variables, we have seen how:
1. A final value can sometimes be described as the sum of its parts in an equation;
2. Intuition suggests that putting the individual average value into this equation gives the average value we would
expect in total.
This second point, however, is only guaranteed to be true in linear combinations of random variables.
L INEAR COMBINATION OF RANDOM VARIABLES
A linear combination of two random variables is defined as:
𝐿𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 = 𝑎𝑋 + 𝑏𝑌
Where X and Y are the random variables and a and b some fixed, known quantities. Furthermore, their mean is defined as:
𝜇 = 𝑎 ∗ 𝐸(𝑋) + 𝑏 ∗ (𝐸𝑌), 𝑤𝑖𝑡ℎ 𝐸(𝑋) = 𝜇𝑥
In turn, as long as the random variables are independent, we can define Variance as:
𝑉𝑎𝑟(𝑎𝑋 + 𝑏𝑌) = 𝜎 = 𝑎 𝑉𝑎𝑟(𝑋) + 𝑏 𝑉𝑎𝑟(𝑌)
And Standard deviation σ as the square root of variance.
M ULTIPLE RANDOM VARIABLES
Consider the sequence of random variables
𝑋 ,𝑋 ,…,𝑋
These n random variables are mutually independent is an only if for every x1, …, xn it holds that:
𝑃(𝑋 ≤ 𝑥 ∩ 𝑋 ≤ 𝑥 ∩ … 𝑋 ≤ 𝑥 ) =
= 𝑃(𝑋 ≤ 𝑥 ) ∗ 𝑃(𝑋 ≤ 𝑥 ) ∗ … ∗ 𝑃(𝑋 ≤ 𝑥 )
15

CONTINUOUS DISTRIBUTION
When dealing with continuous numerical variables, their probability density function is a smooth curve with an area
equal to 1, but, by definition, since continuous probabilities are estimated as “the area under the curve”, the probability
of any exact value is defined as 0.

As a consequence, it is not possible to describe the probabilistic structure of a continuous random variable by means of
a probability mass function; thus, we need to identify an effective way to describe the probabilistic structure of a contin-
uous random variable: the cumulative distribution function.

T HE CUMULATIVE DISTRIBUTION FUNCTION


The cumulative distribution function is the function that assigns probabilities to an interval of the kind:

(−∞; 𝑥), 𝑓𝑜𝑟 𝑥 ∈ ℝ

i.e. 𝐹(𝑥) = 𝑃(𝑋 ∈ (−∞; 𝑏]) = 𝑃(𝑋 ≤ 𝑥), 𝑓𝑜𝑟 𝑥 ∈ ℝ

P ROBABILITY of AN INTERVAL
Given an interval (a,b), the probability that a random variable is included in the interval 𝑃(𝑎 < 𝑋 ≤ 𝑏) is computed as:

𝐹(𝑏) = 𝑃(𝑋 ≤ 𝑏)
= 𝑃(𝑋 ∈ (−∞; 𝑎] ∪ (𝑎, 𝑏])
= 𝑃(𝑋 ∈ (−∞; 𝑎]) + 𝑃(𝑋 ∈ (𝑎, 𝑏])
= 𝑃(𝑋 ≤ 𝑎) + 𝑃(𝑎 < 𝑋 ≤ 𝑏)
= 𝐹(𝑎) + 𝑃(𝑎 < 𝑋 ≤ 𝑏)

So that 𝑃(𝑎 < 𝑋 ≤ 𝑏) = 𝐹(𝑏) − 𝐹(𝑎)


16

DISTRIBUTIONS OF RANDOM VARIABLES


In the context of data analysis or statistical inference different types of distribution arise.

NORMAL DISTRIBUTION
The most common between the distribution of random variables is the nor-
mal distribution; it is a symmetric, unimodal, bell-shaped curve denoted as
𝑁~(𝜇, 𝜎 ) – read as Normal is distributed by μ and σ – where, because the
mean and the standard deviation describe the normal distribution exactly,
they are considered its parameters.

Examples of variables that are normally distributed are the biometric ones.

T HE NORMAL FAMILY OF DISTRIBUTIONS


Strictly speaking, talking about “the” normal distribution is incorrect, since there are
many forms of normal distribution; each normal distribution differs in their mean and
standard deviation. In particular, a normal distribution is defined by the probability
density function:

1 ∗( )
𝑓(𝑥) = ∗𝑒
√2𝜋

With μ = E(X) and σ2 = Var(X).

STANDARDIZING WITH Z SCORES


The process of putting different variables on the same scale in order to compare them is known as standardization. Its
products are standardized scores – or Z scores –, which represent the number of standard deviations above or below the
mean that a specific observation falls; usually observation more than 2 standard deviations – |𝑍| > 2 – from the mean
are considered unusual.

The standardization is computed as:


𝑥−𝜇
𝑍=
𝜎

T HE STANDARD NORMAL DISTRIBUTION

A random variable Z has a standard normal distribution, 𝒁~𝑵(𝟎, 𝟏), if its den-
sity function is:
𝟏 𝟏 𝟐
∗𝒁
𝒇(𝒙) = ∗𝒆 𝟐
√𝟐𝝅
So that it possesses the following characteristics:

1. E(Z)=0 and 𝜎=1;

2. It’s symmetric around the mean and unimodal;

3. Its probability density function is positive ∀ 𝑍 ∈ ℝ, but the area under the curve outside of the interval (-4;4) is
about 0.No index entries found.

T HE 68-95-99.7 RULE – RELEVANT INTERVALS IN NORMAL DISTRIBUTION


17

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a


shorthand used to remember the percentage of values that LIE around the
mean in a normal distribution with a width of two, four and six standard
deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of
the values lie within one, two and three standard deviations of the mean,
respectively.

In mathematical notation, these facts can be expressed as follows:

𝑃(𝜇 − 1𝜎 ≤ 𝑋 ≤ 𝜇 + 1𝜎) ≈ 0.6827


𝑃(𝜇 − 2𝜎 ≤ 𝑋 ≤ 𝜇 + 2𝜎) ≈ 0.9545
𝑃(𝜇 − 3𝜎 ≤ 𝑋 ≤ 𝜇 + 3𝜎) ≈ 0.9974
𝑷(𝝁 − 𝟏. 𝟗𝟔𝝈 ≤ 𝑿 ≤ 𝝁 + 𝟏. 𝟗𝟔𝝈) ≈ 𝟎. 𝟗𝟓𝟎𝟎

THE NORMAL PROBABILITY TABLE


When we are dealing with standardized scores of normal distribution, we can actually compute the:

1. Probabilities of a random variable

2. The percentile in which the event is occurring, where, by percentile, we intend the percentage of observation
falling below a given data point – graphically speaking, the percentile is the area below the probability distri-
bution curve to the left of that observation.

We do this by means of the normal probability table, which is a table that lists all Z-scores and their corresponding per-
centiles. When looking at a table we find that Z-scores are expressed up to their first decimal on the rows with the sec-
ond decimal shown on the columns: the intersection of row and column corresponding to the Z-score we are examining
shows the percentile of interest.

EVALUATING THE NORMAL APPROXIMATION


Since normality is always an approximation, testing the appropriateness of the normal assumption becomes a key step
in many data analyses; there are two visual methods to check for normality:

1. Representing data through a histogram over which the best fitting normal curve is fitted; such curve is defined
by the normal parameters – μ and σ – so that the closer the fit is, the more reasonable the assumption becomes.

2. Examining the normal probability plot, whereby observations are depicted on a graph and more they resemble
a straight line the more confident we can be in the assumption of normality.

In particular, it is worth to notice that, as the sample size increases, histograms look more normal and the normal proba-
bility plot becomes straighter and more stable.

GEOMETRIC DISTRIBUTION
A geometric distribution is a discrete probability distribution based on the concept of the Bernoulli trials.

BERNOULLI DISTRIBUTION
Following Stanley Milgram’s experiment, the Bernoulli distribution theory – named after the Swiss mathematician Ber-
noulli – has been developed; it describes the probability distribution of a random variable that can take exactly two pos-
sible outcomes.

B ERNOULLI TRIALS
Each statistical unit in a Bernoulli distribution is called a “trial” and each trial can either succeed or fail – the definition
of success is given in the study – and is assigned, respectively, the numerical value of 1 and 0.

B ERNOULLI RANDOM VARIABLES


18

If X is a random variable with support {0,1} – i.e. that can only take values within
this set, thus a Bernoulli random variable – its outcome is x and the mathematical
notation is:

𝑋~𝐵𝑖𝑛{1, 𝜋}

Where π is the probability x = 1 – or the probability of success of the trial –

and 1 – π the probability of failure – x=0.

So that we define the probability mass function as 𝑝(𝑥) = 𝜋 ∗ (1 − 𝜋) , the expected value of X as 𝜇 = 𝜋 and the
variance as 𝜎 = √[𝜋 ∗ (1 − 𝜋)]

GEOMETRIC DISTRIBUTION
The geometric distribution describes the numbers of attempts necessary to achieve a success for independent identically
distributed (iid) Bernoulli random variables with probability of success π. Or, more formally:

If the probability of success in one trial is π and the probability of failure is 1 – π,


then the probability of finding the first success in the n th trial is given by:
𝑃(𝑋 = 𝑥) = (1 − 𝜋) ∗𝜋

The mean, the variance and standard deviation are given by:

𝟏 𝟏 𝝅 𝟏 𝝅
𝝁= 𝝈𝟐 = 𝝈=
𝝅 𝝅𝟐 𝝅𝟐

BINOMIAL DISTRIBUTION
The binomial distribution describes the probability of having exactly k successes in n independent Bernoulli trials with
a possibility of success π.

THE CHOOSE FUNCTION


The probability of observing exactly k successes in n independent trials is given
by:
𝒏 𝒌 𝒏!
𝝅 ∗ (𝟏 − 𝝅)𝒏 𝒌
= ∗ (𝟏 − 𝝅)𝒏 𝒌
𝒌 𝒌! (𝒏 − 𝒌)!
Additionally, the mean, variance and standard deviation of the number of ob-
served successes are:

𝝁= 𝒏∗𝝅 𝝈𝟐 = 𝒏 ∗ 𝝅 ∗ (𝟏 − 𝝅) 𝝈= 𝒏 ∗ 𝝅 ∗ (𝟏 − 𝝅)

Where we read the probability function as the “choose function”.

THE NORMAL APPROXIMATION OF THE BINOMIAL DISTRIBUTION


Whenever the sample size n is sufficiently large – so that both 𝑛 ∗ 𝜋 and 𝑛 ∗ 𝜋 ∗ (1 − 𝜋) are at least 10 – the binomial
distribution can be approximated to the normal distribution where the mean and standard deviation parameters correspond
to those of the binomial distribution.

N ORMAL APPROXIMATION FOR SMALL INTERVALS


However, when we are dealing with small samples or intervals, the normal approximation might fail; for this reason it is
useful to slightly modify the cut-off values, so to improve the approximation. Namely, we need to reduce the lower end’s
value by -0.5 and increase the upper end’s cut-off value by +0.5.

MORE DISCRETE DISTRIBUTIONS


19

THE NEGATIVE BINOMIAL DISTRIBUTION


The so-called negative binomial distribution describes the probability of observing the kth success on the nth trial; to
check if a distribution is negatively binomial, four conditions must check:

1. The trials are independent;

2. Each trial outcome can be classified as a success or failure;

3. The probability of success 𝜋 is the same for each trial;

4. The last trial must be a success.

Its function is defined as:

𝑛−1
𝑃(𝑡ℎ𝑒 𝑘 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑛 𝑡𝑟𝑖𝑎𝑙) = 𝜋 (1 − 𝜋)
𝑘−1

B INOMIAL VS . NEGATIVE BINOMIAL


In the binomial case, we typically have a fixed number of trials and instead consider the number of successes; in the
negative case we examine how many trials it takes to observe a fixed number of successes and require that the last obser-
vation is a success.

POISSON DISTRIBUTION
In the Poisson distribution we estimate the number of events in a large population over a unit of time – i.e. the number of
people getting infected by covid-19 in Italy each day.

This distribution has a parameter λ, which represents the rate – namely, how many events we are expecting to observe.

Suppose we are watching for events and the number of observed events follows a
Poisson distribution λ. Then: where k

𝝀𝒌 ∗ 𝒆 𝝀
𝑷(𝒐𝒃𝒔𝒆𝒓𝒗𝒆 𝒌 𝒆𝒗𝒆𝒏𝒕𝒔) =
𝒌!
where k may take a value 0, 1, 2, and so on, an e is the base for the natural loga-
rithm. The mean and the standard deviation are:

𝝁=𝝀 𝝈 = √𝝀

You might also like