Measurement and The DGP - Tagged

LECTURE
MEASUREMENT AND THE DATA

RESEARCH DESIGN GENERATING PROCESS: DATA AND
&
BACHELOR THESIS VARIABLES
PROF. MATTHEW LOVELESS

BAES
UNIVERSITY OF BOLOGNA
2023-2024
Outline
Data
Data Quality:
◦ The Data Generating Process
◦ Sampling and Randomization
Types of Data
◦ Units and Levels of Analysis
Questions
Research Design
A Research Design is specific plan for conducting
research that provides a set of instructions for gathering
and interpreting evidence.
◦ Think of this as the transmissible and replicable part
(Essentially) 3 parts:
◦ Positioning One’s Question
◦ Validating Investigative Plausibility
◦ Method of Discovery
What are data?
•Data:
• Codified observations on relevant elements of a research question to
help determine if one explanation is better than another
Its quality is very important and it is valid and informative if and only if:
◦ Are representative of the population it is meant to refer to
◦ Measure what they are supposed to measure
◦ Both conceptually and operationally
What are data?
•Data:
◦ Come from a reliable origin/source (i.e. how collected)
◦ There is enough of it
◦ e.g. The “I know a guy…” argument
◦ Anecdotes are not evidence
◦ There is enough information on all variables that are needed to
answer the question.
◦ Theory testing
◦ Model specification
What are data?
Datum: A single piece of data.
Data: A collection of codified observations on a topic we are interested in.
Variable: The choice of indicator that draws data together conceptually
and operationally.
Unit of Analysis: The subject of your study. What or who you are trying to
learn about. e.g.: countries, individuals, parties, parliaments, ethic groups,
etc...
Case: An included observation of the unit of analysis.
Level of Analysis: The nature of the data used to examine the units of
analysis.
Data Generating Process
◦ Five Considerations of Data Generation
◦ Validity
◦ Population & Sample
◦ Cost and availability
◦ Reactivity of data
◦ Ethics
The Data Generating Process is the true, underlying
phenomenon that is creating the data. The model is the (often
imperfect) attempt to approximate the phenomenon of interest
What does that mean? Refers to 3 things
1. The actual means by which data are created
2. The statistical approximation of this process
3. The probability model for any data creation procedure
Multiple Regression simultaneously tests several variables in
order to control for the impact of all included independent
variables on the dependent variable.
A basic multiple regression model:
y   1 x1   2 x 2     n x n  
The statistical approximation
The probability model – and everything else in the world
Think of simply rolling dice. What factors might influence the
outcome?
◦ Symmetry of the dice; starting orientation; direction of throw; force of throw;
shape of the surface, spin; coefficient of friction between dice and surface; age of
dice (shape of edges and corners); air movement; temperature, etc…
We don’t - and in complicated reality - probably cannot

measure Every. Single. Thing.
Therefore: statistical approximation and probability
model of true Data Generating Process
Data Generating Process: Example
EU Support: statistical approximation
National versus European identities
Subjective evaluations of national and EU institutional performance
Social location
Citizens’ instrumental self-interest
What else (probability model)? [hint]: everything else in the world

◦ This is – and cannot - be directly analyzed. It refers to an unobservable stochastic
process from the actual numbers that will be analyzed (i.e. it is indirectly modeled)
Together, this is the true DATA GENERATING PROCESS; namely, everything
that possibly shapes EU support
Statistical approximation & Probability model
◦ We have to make simplifying assumptions
◦ e.g. linearity
◦ The model does not have all the information
◦ Our measurement may have some error
◦ The (true) data generating process may change
“All models are wrong but some are useful”
Science is based on the premise that Data Generating Processes underpin reality
and through careful thought and analysis we can uncover that underlying reality.
Sampling and Randomization
In order to make strong inferential claims, we must construct a
representative sample of the population for which we want to gain
some knowledge.
◦ Often don't know population parameters (but want them!)
◦ Parameter: number that describes a population
◦ Statistic: number calculated from data to estimate a parameter
Definitions
◦ Population: the entire set of your unit of analysis that you wish to draw
conclusions about.
◦ Sample: a subset of units in the population of interest.
◦ Sampling Frame: population from which sample is actually drawn
How might a sampling frame differ from a population?
◦ EX: with phone survey, this would be all citizens living in households
with a phone
◦ How would this change if you used the internet instead?
◦ What about asking people on the street?
Inference involves generalizing from a sample to a (statistical)

universe.
◦ It does so by estimating the probability that a sample result could be due to chance
◦ That is, our “statistic” serves as a “parameter” (however well)
 Only possible with random samples
Key to avoiding major problems in empirical analyses is proper
sampling. And the key to proper sampling is randomization.
Simply, sampling requires randomization, without which we cannot
(reliably) make inferences.
◦ Do not systematically exclude any group
◦ Think: “sample frame”
◦ Equal opportunity to be selected
◦ Sampling procedures
◦ Expected value = ‘true’ value
◦ Given our techniques (our Methodology)
When we talk about a sample, there is inherently some
uncertainty about the population
◦ Therefore we use probability theory (and other things) to determine
the likelihood that our estimates are accurate
What we really want to know is whether any observed
relationship (i.e.: pattern in the sample) be inferred to apply
to the population?
◦ This is, is the pattern in your data strong enough to conclude that the
apparent relationship is ‘real’ and not just chance
Data Availability: Units and Levels of Analysis
Macro-level data
What are ‘macro’ units of analysis?
What kinds of questions can we examine?
What kinds of question can we not examine?
What are ‘Macro-processes’?

International events
Within country
Cross-temporal events
IR
Patterns of behavior but not individual behavior
Data Availability: Units and Levels of Analysis
Micro-level data
What are ‘micro’ units of analysis?
What kinds of questions can we examine?
What kinds of question can we not examine?
What are ‘Micro-processes’?

Cross-national events
Cross-temporal events
Panel data
Comparative Politics
Behavior and patterns of behavior
Data Availability: Existing data
Benefits of using existing data?
Low cost
Available
Often reliable
Common in the literature (transmissible)
Problems of using existing data?
Low design capability
Over-analyzed (common in the literature)
Very often, not exactly what, where, when you want
Data Availability: Making data
Benefits of making data?
High design capability
Novel/original conceptualization or operationalization
Exactly what, where, when you want
Problems of making data?
High cost ($)
Sufficient sample size and randomness for inference
Design is not easy
Needs to conform somewhat to existing data for comparability
Data Collection: ESS Example
◦ Polls, interviews, surveys: ‘Easy, ask people’
Asking people to provide answers to the questions we want to know
can be … problematic:
◦ Design of data collection:
◦ Are we able to capture what people can tell us?
◦ Observability:
◦ Are people able to articulate their values?
◦ Data collection process is very personal:
◦ Asking people to reveal a lot about themselves
◦ Polls, interviews, surveys:
The following are from Section B in the European Social
Survey 2020
◦ This section asks respondents about their political interest,
trust, electoral and other forms of participation, party
allegiance, socio-political orientations, immigration.
◦ That is, their political values, attitudes, and behaviors.
◦ Answer the (18) question on your own, then we can discuss
them.
Data
Collection:
ESS
Example
Data Collection:
ESS Example
Data Collection:
ESS Example
◦ Polls, interviews, surveys: Quality of the questions?
◦ Are the data what we want to measure?
European Social Survey
B1 – political interest
B13 – self-reported
voting
B43 – Foreigners make
things better/worse
voting
things better/worse
voting
things better/worse
Data Collection:
Example DE
ESS9: Political Interest
FR
CH
GB
IE
AT
NO
NL
FI
BE
CY
EE
SI
PL
IT
RS Not at all interested
BG Hardly interested
HU Quite interested
CZ Very interested
0 20 40 60 80 100
percent
Operationalization: Data Collection
Equivalence?
◦ Cross-national/cultural congruency of concepts
◦ When/where can generalizations be drawn?
◦ Impose our understandings, i.e. cultural values imposed
when we ask our questions
Levels of analysis: fallacies
◦ Who are the units of analysis?
Data Collection: Aggregate Data
◦[Aggregate] Observation
◦ Political Participation
◦ GDELT monitors print, broadcast, and
web news media in over 100 languages
from across every country in the world to
keep continually updated on breaking
developments anywhere on the planet.
◦ Electoral data for turnout
◦ What does this tell us about individuals?
◦ Ecological Fallacy
Fallacies: Consistency in the data
Ecological Fallacy:
◦ EX1: A county with 60% purple and 40% green. Mayor wins
with 60% of the vote….
◦ What can we conclude about individual purple people or greens?
◦ EX2: If you have regional level education data and regional
level election turnout data, which of these hypotheses can
you use?
◦ Turnout is higher in regions with more educated citizens.
◦ More educated citizens are more likely to vote.
Fallacies
Deductive fallacy: Individual conclusions based on groups
properties
• Liberals support this issue, so Joe, a liberal, supports this issue.
Inductive fallacy: Group conclusions based on individual
evidence.
◦ Individualistic fallacy: “Joe, a liberal, likes it, so Liberals like it”
◦ IOW: One observation as the basis for generalization: “This swan is
white, so all swans must be white.”
Data Collection: Big Data Revolution (?)
Explosion of data availability, largely due to the growth of internet.
Big Data: Huge amount of available data/very large datasets
◦ Billions of data on thousands of variables and units of analysis.
◦ Google answers 100 billion search queries each month (Sullivan 2012).
Because of size and structure, using Big Data requires:

 Specific software, powerful computers, and specific tools developed by
corporations - although often open source - to manage and analyze the data:
 ex. Google File System, which supports files so large that they have to be
distributed across hundreds of computers.
◦ Statistical techniques: machine learning methods
Machine learning: the use and development of computer systems that are able to
learn and adapt by using algorithms and statistical models to analyse and draw
inferences from patterns in data.
◦ Sometimes called Data Science, machine learning refers to the acquisition of
data, model building, and creation of an exploration scenario with different
validation settings.
Unsupervised ML: methods for finding patterns in the data, e.g.: groups of similar
items.
Supervised ML: methods that focus primarily on prediction.
Goal: build a statistical model that maximizes predictive power while avoiding
over-fitting and at the same time producing the best out of sample predictions.
So, what is all the hubbub about?
1. Available in real time
2. Available at larger scale: Tens of millions of observations and huge
number of covariates. Thus, statistical power is not a big concern.
3. Novel types of variables: Ex. social network, email data.
4. Unstructured: Unstructured data do not come in the classic NxK
rectangular shape. Differently from survey data, they were not meant for
data analysis. They have to be reorganized.
Strengths: Weaknesses:
- There’s a lot of it - What is public and private data?
- Aggregating creates new issues of sensitivity and privacy
- It’s always on - Non-representative
o ‘pre-‘/’post-‘ o Twitter is not reality? Approx. 80% internet access in the
studies! US; 85% in EU
- Non-reactive - Lack of custom-made data
- “data” in that it is something but is it really what we want?
- Captures social ◦ Drifting (across platforms) & Algorithmic
relationships confounding
◦ Positivity bias: First adaptors become gatekeepers
◦ How is data collected? API’s (a real black box)
Data: A Conclusion
Data are the building blocks for us to ‘know the world’ so that we
can arrange our understanding around that knowledge
◦ See gravity
Data must be transformed into meaningful representations of what we

intend to study, managed and handled to serve our analytical approach
with the ultimate goal of satisfying the demands of our research design.
◦ Data must be transformed into variables.
◦ Simply, a variable is data grouped to represent a concept.

Measurement and The DGP - Tagged

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measurement and The DGP - Tagged

Uploaded by

Copyright:

Available Formats

LECTURE

MEASUREMENT AND THE DATA

PROF. MATTHEW LOVELESS

We don’t - and in complicated reality - probably cannot

What else (probability model)? [hint]: everything else in the world

Inference involves generalizing from a sample to a (statistical)

What are ‘Macro-processes’?

What are ‘Micro-processes’?

Because of size and structure, using Big Data requires:

Data must be transformed into meaningful representations of what we

You might also like