You are on page 1of 59

Class 1

Introduction &
Refresh of (some) statistical
concepts

A.Y. 2022/23 – First Quarter


Statistical analysis I – Group 2 (advanced)

Giampiero Passaretta
giampiero.passaretta@upf.edu
Part 1

Logistics
Logistics
When: September 27 – November 28 (No class:
Novermber 1)

Classroom: 40.245 (Roger de Llúria)

Materials: Aula Global (slides + lab sessions)

Evaluation: Mid-term + final exams


(take home)

Contact: giampiero.passaretta@upf.edu

Office hours: by appointment


Class structure

Two parts
1. Lecture (~1.30–1.45 h)
Giampiero

2. Lab session (~1.15–1.30 h)


Luca Giangregorio (except today!)
 Hands-on-data
 Stata!
Group 2 – Advanced

Some pre-requisites (reviewed later today)

• Population vs sample
• Probability
• Probability distribution
• Normal distribution
• Standard normal
• Sampling and sampling distribution
• Inferential statistics
What is this course about?

LINEAR REGRESSION TECHNIQUES (Y = a + bX)

Bivariate linear
regression

Multiple linear
regression

Categorical independent
variables

Specification errors
& Inefficiencies

Generalized linear
models
Textbook & Readings

Selected chapters & sections


 Agresti & Finlay. Statistical Methods for the Social
Sciences, Prentice Hall, 4th Edition, 2009.

 Kennedy. A Guide to Econometrics, 6th Edition,


MIT Press, 2008.

 Baum. An Introduction to Modern Econometrics


Using Stata, Stata Press, 2006.
Course evaluation

Two take-home exams (~1.30–2h)

(1) Mid-term exam (week 5): 30%

(2) Final exam: 70%

Visible on Aula Global


 Problem sets (check problems in Agresti & Finlay)
 Interpretation of regression output
Weekly assignments

To be completed before next week’s class


 Small practical task in Stata (including
interpretation of results)

 Or problems from Agresti & Finley

Do not contribute to the final mark…

… no need to hand them in…

…BUT be prepared to discuss it in class/lab!


What is this course about?
Why regression The use of regression analysis in
LINEAR REGRESSION
techniques? the TECHNIQUES
social sciences (Y = a + bX)

Bivariate linear
regression

Multiple linear
regression

Categorical independent
variables

Specification errors

Generalized linear
models
What is this course about?
Why regression The use of regression analysis in
techniques? the social sciences

Bivariate linear
regression

Multiple linear
regression

Categorical independent
variables

Specification errors

Generalized linear
models
Part 2

The use of regression-type


analysis in the social sciences
Linear regression is a
statistical tool
Three fundamental uses

We may be interested in
Prediction predicting a certain behavior

We may be interested in the


Association association between a certain
behavior and other characteristics

We may be interested in the


Causation causes of a certain behavior

Example: Far-right voting


Far-right voting

Dress and appearance can predict and may


be associated with far-right voting, but likely
do not cause voting behaviour
Real but spurious

Does eating margarine make people divorce?


Real but spurious

Space launches and PhDs in Sociology


Take-home message

Association

Causation
Three fundamental uses

Prediction Most social research


questions!

Association Description of reality

Causation Searching the causal link


(cause-effect)
Research questions (examples)

 Who votes far-right parties?

 Are higher educated women more likely


Association to be childless?
(description)
 Do children from higher socio-economic
background perform better in school?

 Does education impact party choice?

Causation  Does higher education decrease the


(cause-effect) likelihood of having children?

 Does parental money make children


smarter?
OUR FOCUS

Linear regression as a
statistical tool…

…used in the context of a


research design

The nature of a research finding – descriptive or


causal – rests on the research design and not
the statistical tool employed
Part 3

Refreshing some statistical


concepts
Regression requires data!

INDIVIDUALS

COUNTRIES
Unit of analysis
POLITICAL PARTIES

FIRMS

POPULATION Dream
SAMPLE
data Reality data
The inferential problem

POPULATION SAMPLE
What we are interested in What we (usually) work with

(If random…)

Probability theory
(inference on the population
from the sample)
Probability: definition

What is the probability of


getting head when tossing a
coin?

Probability of an outcome is the proportion of times


that the outcome would occur in very long sequence
of trials
(frequentist view)
Probability distribution

Summarize all possible outcomes of a


variable and their probabilities

 Dicothomous variables
2 values; 2 probabilities (example: coin toss)

 Categorical variable: 3 categories


3 values; 3 probabilities

 Continuous variables
Many values; probability assigned to intervals
Probability distribution
Continuous variables

Graph: a smooth curve

Probability is the area under


the curve

Full interval  P=1

100%
Probability distribution
Continuous variables

Graph: a smooth curve

Probability is the area under


the curve

Interval A [20+]
P(A) = .39

39%
Probability distribution
Continuous variables

Graph: a smooth curve

Probability is the area under


the curve

Interval B [19–]
P(B) = 1 – P(A)= .61

Interval A [20+]
P(A) = .39
61%

39%
Normal distribution
Simmetric (bell-shaped)

Defined by mean and variance: N ∼ (μ, σ2)

Mean Variance

Below mean Above mean


P = .50 P = .50
σ
(standard deviation)

50% 50%
Variability
around the mean
Normal distribution: Facts
N ∼ (μ, σ2)

«Empirical rule»
68–95–99

For any normal distribution, the probability of falling


within z standard deviations of the mean IS THE SAME

(regardless of the standard deviation)


Normal distribution
Simmetric (bell-shaped)

Defined by mean and variance: N ∼ (μ, σ2)

μ = 20
σ=2

-3σ -2σ -1σ +1σ +2σ +3σ


Normal distribution
Simmetric (bell-shaped)

Defined by mean and variance: N ∼ (μ, σ2)

μ = 20
σ=2

99%
between 14 and 26

-3σ -2σ -1σ +1σ +2σ +3σ


Normal distribution
Simmetric (bell-shaped)

Defined by mean and variance: N ∼ (μ, σ2)

μ = 20
σ=2

99%
between 14 and 26

95%
between 16 and 24

-3σ -2σ -1σ +1σ +2σ +3σ


Normal distribution
Simmetric (bell-shaped)

Defined by mean and variance: N ∼ (μ, σ2)

μ = 20
σ=2

99%
between 14 and 26

95%
between 16 and 24

68%
between 18 and 22

-3σ -2σ -1σ +1σ +2σ +3σ


Normal tail probabilities
“The probability of falling within z standard deviations of
the mean is EQUAL”

Tail probabilities for any z provided


in any Stats textbook
Standard Normal distribution
Values are standardized:

Normal with 0 mean an 1 SD : N ∼ (0, 1)

68%
μ=0
between σ=1
95%
between
For all standardized
99% distributions
between
(Standard) Normal distribution

Why is important?

(1) Approximate many real-word distributions

(2) Crucial for statistical inference


(many sampling distributions are normal)
The inferential problem

POPULATION SAMPLE
What we are interested in What we (usually) work with

(If random…)

Probability theory
(inference on the population
from the sample)
The inferential problem

Population
Sample

ESTIMATOR of the
Population PARAMETER of population parameter (
interest ()

 We are interested in a characteristic of the


population ()  Parameter

 We use a sample statistic ( to estimate the the


population parameter of interest  Estimator
Example: Mean age at UPF

Population of interest
UPF students

One random sample (Sample size n =100)


We ask the age to 100 students from UPF

Population Parameter of interest


Mean age

Sample Statistic/Estimator (note: many options)


Mean age in the sample

Value of Sample Statistic/Estimator


Mean age in the sample = 22 years (point estimate)
Standard deviation = 1 year
Example: Mean age at UPF

Is the mean age in the population exactly


22 years?
(Depends on how representative the sample is)
 Even if representative, we estimate the population
mean with error

Thought experiment
Imagine we could draw many random samples
(n=100) of UPF students…
Thought experiment We repeat the sampling!
(contant size n)

years
Pop.
years

years

years

Mean age () Distribution of sample


means ()

If we drew many random samples (equal size),


we would get a distribution of sample means

Sampling distribution of estimator


(variability of estimates across samples of equal size)
Thought experiment

Sampling distribution of sample means is normal!

Means from likely Mean from unlikely


samples samples
Central Limit
Theorem

The average of sample means approaches the true


population mean ()  UNBIASED ESTIMATOR
Thought experiment

Sampling distribution of sample means is normal!

Something we do not know:


 How much does the sampling
distribution vary?

Standard error of
the mean
Standard deviation in
the population
(UNKNOWN)

Sample size
(KNOWN)
In practice
Standard deviation in the population () estimated by the standard
deviation in the sample (s):
Example: Mean age at UPF

Rember: We have only drawn one sample


(n=100)
 Point estimate = 22 years (sample mean)
 Standard deviation of sample mean = 1 year

We can now estimate the variability (standard error)


of the sampling distribution of sample means:

BUT WHY DO WE CARE?!?!?


Example: Mean age at UPF

Is the mean age in the population exactly


22 years?
 We estimate the population mean with error
 The standard error of the sampling distribution
allows quantifying uncertainty

We can estimate an interval of values around


the point estimate

(likely including the true value of the population


parameter)
Interval estimation

Confidence interval = point estimate margin of error

 The confidence interval for a parameter is an interval of


numbers within which the parameter is believed to fall

 The probability that the interval contains the parameter is


called the confidence level (usually set to 95%)

«The confidence interval has 95% probability to include the


true parameter value»

«The confidence interval has 5% probability to NOT include


the true parameter value»
Example: Mean age at UPF

How to define the margin of error?


 We know the sampling distribution of the point estimator
is normal

 Hence, we know the probability that the estimator will fall


z standard deviations of the mean («68–95–99% rule»)

 Margin of error: Standard error *value of z corresponding


to a tail probability of 1 – confidence level

WHAT?!
Example: Mean age at UPF

In practice

Which z-value?
For a 95% confidence level, the error we accept if 5%

This means we look up for the z-value that defines a tail


probability of 5% (we already know by now it is ~2!)
Normal tail probabilities

Look up for 5% tail probability


(2.5% left tail; 2.5% right tail)

Right tail: 0.25


Left tail: 0.25

5% overall

Z-value = 1.96
(95% of the values falls
within 1.96 SD of the mean)
Example: Mean age at UPF

In practice

There is a 95% probability that the interval ~21.8–22.2


includes the true value of mean age in the population

Note: higher confidence level  larger interval


lower confidence level  narrower interval
But again…

What does all this


share with linear
regression?
Linear regression is no different

Population
Sample

Probability theory
(inference on the population from the
sample)

Analyse relationships in the sample…

… and try to generalize to the population


Assignments for next week

Knowing the standard error of the sampling


distribution also allows for hypothesis testing!!

(1) Refresh «hypothesis testing»:


Agresti & Finlay: 6.1, 6.2, 6.4, 6.5, 6.6

(We will see how it works in the context of linear regression


in Week 3)

(2) Check syntax «Lab1.do» (Aula Global) and make sure


you understand what it does
Other important distributions

(for statistical inference)

(1) T-student

(2) F-distribution
Student’s t-distribution

Approximate a standard normal when n>30


F-distribution

You might also like