Class 1

Class 1
Introduction &
Refresh of (some) statistical
concepts
A.Y. 2022/23 – First Quarter

Statistical analysis I – Group 2 (advanced)
Giampiero Passaretta
giampiero.passaretta@upf.edu
Part 1
Logistics
Logistics
When: September 27 – November 28 (No class:
Novermber 1)
Classroom: 40.245 (Roger de Llúria)
Materials: Aula Global (slides + lab sessions)
Evaluation: Mid-term + final exams

(take home)
Contact: giampiero.passaretta@upf.edu
Office hours: by appointment

Class structure
Two parts
1. Lecture (~1.30–1.45 h)
Giampiero
2. Lab session (~1.15–1.30 h)

Luca Giangregorio (except today!)
 Hands-on-data
 Stata!
Group 2 – Advanced
Some pre-requisites (reviewed later today)
• Population vs sample
• Probability
• Probability distribution
• Normal distribution
• Standard normal
• Sampling and sampling distribution
• Inferential statistics
What is this course about?
LINEAR REGRESSION TECHNIQUES (Y = a + bX)
Bivariate linear
regression
Multiple linear
regression
Categorical independent
variables
Specification errors
& Inefficiencies
Generalized linear
models
Textbook & Readings
Selected chapters & sections

 Agresti & Finlay. Statistical Methods for the Social
Sciences, Prentice Hall, 4th Edition, 2009.
 Kennedy. A Guide to Econometrics, 6th Edition,

MIT Press, 2008.
 Baum. An Introduction to Modern Econometrics

Using Stata, Stata Press, 2006.
Course evaluation
Two take-home exams (~1.30–2h)
(1) Mid-term exam (week 5): 30%
(2) Final exam: 70%
Visible on Aula Global

 Problem sets (check problems in Agresti & Finlay)
 Interpretation of regression output
Weekly assignments
To be completed before next week’s class

 Small practical task in Stata (including
interpretation of results)
 Or problems from Agresti & Finley
Do not contribute to the final mark…
… no need to hand them in…
…BUT be prepared to discuss it in class/lab!

Why regression The use of regression analysis in
LINEAR REGRESSION
techniques? the TECHNIQUES
social sciences (Y = a + bX)
Bivariate linear
regression
Multiple linear
regression
variables
Generalized linear
models
Why regression The use of regression analysis in
techniques? the social sciences
Bivariate linear
regression
Multiple linear
regression
variables
Generalized linear
models
Part 2
The use of regression-type

analysis in the social sciences
Linear regression is a
statistical tool
Three fundamental uses
We may be interested in
Prediction predicting a certain behavior
We may be interested in the

Association association between a certain
behavior and other characteristics
We may be interested in the

Causation causes of a certain behavior
Example: Far-right voting

Far-right voting
Dress and appearance can predict and may

be associated with far-right voting, but likely
do not cause voting behaviour
Real but spurious
Does eating margarine make people divorce?

Real but spurious
Space launches and PhDs in Sociology

Take-home message
Association
Causation
Three fundamental uses
Prediction Most social research

questions!
Association Description of reality
Causation Searching the causal link

(cause-effect)
Research questions (examples)
 Who votes far-right parties?
 Are higher educated women more likely

Association to be childless?
(description)
 Do children from higher socio-economic
background perform better in school?
 Does education impact party choice?
Causation  Does higher education decrease the

(cause-effect) likelihood of having children?
 Does parental money make children

smarter?
OUR FOCUS
Linear regression as a
statistical tool…
…used in the context of a

research design
The nature of a research finding – descriptive or

causal – rests on the research design and not
the statistical tool employed
Part 3
Refreshing some statistical

concepts
Regression requires data!
INDIVIDUALS
COUNTRIES
Unit of analysis
POLITICAL PARTIES
FIRMS
POPULATION Dream
SAMPLE
data Reality data
The inferential problem
POPULATION SAMPLE
What we are interested in What we (usually) work with
(If random…)
Probability theory
(inference on the population
from the sample)
Probability: definition
What is the probability of

getting head when tossing a
coin?
Probability of an outcome is the proportion of times

that the outcome would occur in very long sequence
of trials
(frequentist view)
Probability distribution
Summarize all possible outcomes of a

variable and their probabilities
 Dicothomous variables
2 values; 2 probabilities (example: coin toss)
 Categorical variable: 3 categories

3 values; 3 probabilities
 Continuous variables
Many values; probability assigned to intervals
Continuous variables
Graph: a smooth curve
Probability is the area under

the curve
Full interval  P=1
100%

the curve
Interval A [20+]
P(A) = .39
39%

the curve
Interval B [19–]
P(B) = 1 – P(A)= .61
Interval A [20+]
P(A) = .39
61%
39%
Normal distribution
Simmetric (bell-shaped)
Defined by mean and variance: N ∼ (μ, σ2)
Mean Variance
Below mean Above mean

P = .50 P = .50
σ
(standard deviation)
50% 50%
Variability
around the mean
Normal distribution: Facts
N ∼ (μ, σ2)
«Empirical rule»
68–95–99
For any normal distribution, the probability of falling

within z standard deviations of the mean IS THE SAME
(regardless of the standard deviation)

Normal distribution
μ = 20
σ=2
-3σ -2σ -1σ +1σ +2σ +3σ

Normal distribution
μ = 20
σ=2
99%
between 14 and 26
-3σ -2σ -1σ +1σ +2σ +3σ

Normal distribution
μ = 20
σ=2
99%
between 14 and 26
95%
between 16 and 24
-3σ -2σ -1σ +1σ +2σ +3σ

Normal distribution
μ = 20
σ=2
99%
between 14 and 26
95%
between 16 and 24
68%
between 18 and 22
-3σ -2σ -1σ +1σ +2σ +3σ

Normal tail probabilities
“The probability of falling within z standard deviations of
the mean is EQUAL”
Tail probabilities for any z provided

in any Stats textbook
Standard Normal distribution
Values are standardized:
Normal with 0 mean an 1 SD : N ∼ (0, 1)
68%
μ=0
between σ=1
95%
between
For all standardized
99% distributions
between
(Standard) Normal distribution
Why is important?
(1) Approximate many real-word distributions
(2) Crucial for statistical inference

(many sampling distributions are normal)
POPULATION SAMPLE
What we are interested in What we (usually) work with
(If random…)
Probability theory
(inference on the population
from the sample)
Population
Sample
ESTIMATOR of the
Population PARAMETER of population parameter (
interest ()
 We are interested in a characteristic of the

population ()  Parameter
 We use a sample statistic ( to estimate the the

population parameter of interest  Estimator
Example: Mean age at UPF
Population of interest
UPF students
One random sample (Sample size n =100)

We ask the age to 100 students from UPF
Population Parameter of interest

Mean age
Sample Statistic/Estimator (note: many options)

Mean age in the sample
Value of Sample Statistic/Estimator

Mean age in the sample = 22 years (point estimate)
Standard deviation = 1 year
Is the mean age in the population exactly

22 years?
(Depends on how representative the sample is)
 Even if representative, we estimate the population
mean with error
Thought experiment
Imagine we could draw many random samples
(n=100) of UPF students…
Thought experiment We repeat the sampling!
(contant size n)
years
Pop.
years
years
years
Mean age () Distribution of sample

means ()
If we drew many random samples (equal size),

we would get a distribution of sample means
Sampling distribution of estimator

(variability of estimates across samples of equal size)
Thought experiment
Sampling distribution of sample means is normal!
Means from likely Mean from unlikely

samples samples
Central Limit
Theorem
The average of sample means approaches the true

population mean ()  UNBIASED ESTIMATOR
Thought experiment
Sampling distribution of sample means is normal!
Something we do not know:

 How much does the sampling
distribution vary?
Standard error of
the mean
Standard deviation in
the population
(UNKNOWN)
Sample size
(KNOWN)
In practice
Standard deviation in the population () estimated by the standard
deviation in the sample (s):
Rember: We have only drawn one sample

(n=100)
 Point estimate = 22 years (sample mean)
 Standard deviation of sample mean = 1 year
We can now estimate the variability (standard error)

of the sampling distribution of sample means:
BUT WHY DO WE CARE?!?!?

Is the mean age in the population exactly

22 years?
 We estimate the population mean with error
 The standard error of the sampling distribution
allows quantifying uncertainty
We can estimate an interval of values around

the point estimate
(likely including the true value of the population

parameter)
Interval estimation
Confidence interval = point estimate margin of error
 The confidence interval for a parameter is an interval of

numbers within which the parameter is believed to fall
 The probability that the interval contains the parameter is

called the confidence level (usually set to 95%)
«The confidence interval has 95% probability to include the

true parameter value»
«The confidence interval has 5% probability to NOT include

the true parameter value»
How to define the margin of error?

 We know the sampling distribution of the point estimator
is normal
 Hence, we know the probability that the estimator will fall

z standard deviations of the mean («68–95–99% rule»)
 Margin of error: Standard error *value of z corresponding

to a tail probability of 1 – confidence level
WHAT?!
In practice
Which z-value?
For a 95% confidence level, the error we accept if 5%
This means we look up for the z-value that defines a tail

probability of 5% (we already know by now it is ~2!)
Normal tail probabilities
Look up for 5% tail probability

(2.5% left tail; 2.5% right tail)
Right tail: 0.25

Left tail: 0.25
5% overall
Z-value = 1.96
(95% of the values falls
within 1.96 SD of the mean)
In practice
There is a 95% probability that the interval ~21.8–22.2

includes the true value of mean age in the population
Note: higher confidence level  larger interval

lower confidence level  narrower interval
But again…
What does all this

share with linear
regression?
Linear regression is no different
Population
Sample
Probability theory
(inference on the population from the
sample)
Analyse relationships in the sample…
… and try to generalize to the population

Assignments for next week
Knowing the standard error of the sampling

distribution also allows for hypothesis testing!!
(1) Refresh «hypothesis testing»:

Agresti & Finlay: 6.1, 6.2, 6.4, 6.5, 6.6
(We will see how it works in the context of linear regression

in Week 3)
(2) Check syntax «Lab1.do» (Aula Global) and make sure

you understand what it does
Other important distributions
(for statistical inference)
(1) T-student
(2) F-distribution
Student’s t-distribution
Approximate a standard normal when n>30

F-distribution

Class 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class 1

Uploaded by

Copyright:

Available Formats

Class 1

A.Y. 2022/23 – First Quarter

Classroom: 40.245 (Roger de Llúria)

Materials: Aula Global (slides + lab sessions)

Evaluation: Mid-term + final exams

Office hours: by appointment

2. Lab session (~1.15–1.30 h)

Some pre-requisites (reviewed later today)

LINEAR REGRESSION TECHNIQUES (Y = a + bX)

Selected chapters & sections

 Kennedy. A Guide to Econometrics, 6th Edition,

 Baum. An Introduction to Modern Econometrics

Two take-home exams (~1.30–2h)

(1) Mid-term exam (week 5): 30%

(2) Final exam: 70%

Visible on Aula Global

To be completed before next week’s class

 Or problems from Agresti & Finley

Do not contribute to the final mark…

… no need to hand them in…

…BUT be prepared to discuss it in class/lab!

The use of regression-type

We may be interested in the

We may be interested in the

Example: Far-right voting

Dress and appearance can predict and may

Does eating margarine make people divorce?

Space launches and PhDs in Sociology

Prediction Most social research

Association Description of reality

Causation Searching the causal link

 Who votes far-right parties?

 Are higher educated women more likely

 Does education impact party choice?

Causation  Does higher education decrease the

 Does parental money make children

…used in the context of a

The nature of a research finding – descriptive or

Refreshing some statistical

What is the probability of

Probability of an outcome is the proportion of times

Summarize all possible outcomes of a

 Categorical variable: 3 categories

Graph: a smooth curve

Probability is the area under

Full interval  P=1

Graph: a smooth curve

Probability is the area under

Graph: a smooth curve

Probability is the area under

Defined by mean and variance: N ∼ (μ, σ2)

Below mean Above mean

For any normal distribution, the probability of falling

(regardless of the standard deviation)

Defined by mean and variance: N ∼ (μ, σ2)

-3σ -2σ -1σ +1σ +2σ +3σ

Defined by mean and variance: N ∼ (μ, σ2)

-3σ -2σ -1σ +1σ +2σ +3σ

Defined by mean and variance: N ∼ (μ, σ2)

-3σ -2σ -1σ +1σ +2σ +3σ

Defined by mean and variance: N ∼ (μ, σ2)

-3σ -2σ -1σ +1σ +2σ +3σ