You are on page 1of 19

Advanced Research Methods:

Data

Markus Lampe
Carlos Santiago-Caballero

1
Presentations
• Carlos and I are in the process of aggregating your
choices into a program for the workshop part of the
course, and hope to be able to send the program to you
this afternoon (we are still missing some picks!)
• Many choose natural resource-related topics, here we
most likely will form groups, based on your specifications
(this might mean that you together present two or three
papers, trying to highlight similarities and differences in
design, methods and data)
• Some might end up with their “second choice” to avoid
overcrowding of groups or too much of the same.
What you should do
• Tell us about the general design
• What is the hypothesis? Which theory is tested? Is that
interesting? (Why?)
• What is the method (formal econometrics? cross-section,
time-series, panel, something else?)
• Where does the data come from?
• Are the results convincing?
• Would you be able to replicate this? (so, could you get the
data, is the method feasible/outdated/too complex for you?)
• Do not discuss the results and what this means about our
understanding of the world, but the approach and what it
means about how we (you!) best investigate similar or related
issues?
3
The research process
Sampling: Units and level of analysis
The world International
macroeconomics

Individuals Microeconomics
Research Design in Empirical Economics
• You have a topic/question you are interested in answering (last
session), e.g. why somebody is engaging in criminal activities
• This is important to determine what your “units of study” are
• You formulate an economic model, either mathematically derived
from utility maximization under constraints (see Becker 1968) or
based on your intuition (and literature review).
• This might – functional form not specified – look like (Wooldridge
2006, p. 3)
Hours criminal = f(“wage” per criminal hour, legal employment
hourly wage, additional income not from crime or
employment, probability of getting caught, probability of being
convicted if caught, expected sentence if convicted, age)
Further steps
• Of this, you need to make an econometric model, easiest = linear (but
should match theory and reality) to find out if it matches real cases
systematically
Hours criminal = β0 + β1 wagecriminal + β2 wagelegal + β3 otherincome
+ β4 freqarrested+ β5 freqconvicted + β6 avgsentence + β7 age+ u
• the model is likely not perfect, since many other factors can affect
criminal activity for the individual (family background, etc.)
• things from the model can often not be measured (like the “wage per
criminal hour” or the original outcome, “hours spent on criminal activity”
which we would have to substitute for some “crime frequency proxy”, or
the “probability of being arrested” for the individual)
• It might be difficult to find criminals that have not been caught or
convicted, so there might be a sample selection bias
• Everything we do not measure or proxy well or cannot measure at all
ends up in “u” and affects the goodness of fit and validity of our empirical
strategy. The choice of estimator depends on this (next week)
• Today we deal with how to find the data to empirically assess (“test”) the
validity of the model
Generalization:
Measurement and operationalization
• Empirical (quantitative) research is about measurement of
theoretical components of models
• Many concepts and issues that we research are not directly
measurable
• We need to devise indirect concrete measures and indicators for
abstract concepts we are interested in (e.g. inequality, well-being,
democracy/political stability)
• Transforming abstract unobservable concepts into concrete
observable indicators is called ‘operationalization’
• This is important on all levels, and we should never forget how
our “proxies” might affect the results, either by “measuring slightly
different things” or by their statistical properties (nominal, ordinal,
categorical, continuous, interval, ratio, etc.)
Economic models and data I: Microeconomic
• Becker’s model is a classical microeconomic model,
dealing with behaviour of individuals, for testing it we
need therefore data on individuals
• It is, furthermore, a static model, in the sense that it
looks at one decision (dedicate hours to crime) based on
benefits and costs
• We would therefore look for data on criminal activity at
one point in time across individuals. This is a “cross-
section”.
• Potential data sources: surveys/records of relevant
population (and “control group”), own interviews, official
government statistics (censuses), etc.
Economic models and data II: Macroeconomic
• You might be interested in why some countries (or
federal states, etc.) have higher crime rates than others
• If we suppose that individuals are “the same”
everywhere (so Becker’s theory is “universal”, not
“Chicago-specific”), we would look for differences
between countries in characteristics affecting crime rates
• Some factors might be better measurable (like
probability of being caught) than on the level of
“individual crime choice”
• but we are now not measuring any more the probability
of an individual’s criminality, but the frequency of crimes
in a society compared to another
• Sources: national, international statistics, newspaper
surveys, etc.
Economic models and data III: Changes
• Many questions/models, both micro- and macroeconomic, do refer to the
effects of “changing” something (policy measures, “natural
experiments”), e.g.: “Trade liberalization will boost growth rates” or
Malthusian arguments about income and population growth (or in
Freakonomics: Abortion legalization (and more police and and more
prisoners and the waning of the crack epidemic) reduce crime rates
(Levitt 2004; Donohue/Levitt 1999, 2004)
• Some of these models can be tested indirectly through cross-sections,
but often time-series analysis or time-series cross-section (panel) is
appropriate (e.g. difference-in-difference to find a “treatment effect”)
• Econometric specifications aside, for this kind of analysis you need a
reliable number of observations over time (taking into account “ceteris
paribus”; e.g. Malthusian theory in England before industrialization) – or
a reasonable number of cases where things change in comparison to
cases where no change is observed
Common fallacies
• Ecological fallacy or spurious individual-level inference (problems
arising if we move between individual characteristics and average
characteristics of a population of which individuals are members;
conclusions about individuals drawn from aggregate data might be
wrong)
• Individualistic fallacy (inferring aggregate relationships from
individual-level observations; supposing that the characteristics of a
society are the same as those of the sum of its individuals, and
assuming that these can be explained at the individual level only)
• Universal fallacy (assumption that patterns observed in a selection of
individuals would hold for its population, also called “hasty
generalization” – cf. “statistical inference” and assumptions about
random sampling, etc.)
• Selective fallacy (the use of carefully chosen cases to prove a point;
a case for “sample selection bias”; is the research material
representative?)
• Cross-sectional fallacy (assumption that what is observed at one
point in time would apply to other times - or to a subgroup or individual
over time)
Going practical
• If you write an empirical thesis, assembling your database (or
understanding and formatting the databases you use) is likely one of
your biggest and most time-consuming tasks – and the most
important “bottleneck”
• Since analysis and results depend on the data, you should start soon at
least to check whether suitable data is available
• Some say that the famous 80:20 (or even 90:10) rule also applies to
data collection: often you can easily get most of the data you need from
handy datasets and replication data offered by others, but your new
data/concepts will take most of your time in operationalization
• Start with a “data wish list” (based on what others have done, “pearl
growing”), incl. unit of study (macro, meso [branches, regions], micro
[firms, individuals]), likely sample size, cross-section/time series/panel,
frequency of observation. List all your variables, try to find proxies for
those you might not be able to operationalize, or check whether you can
do a valid analysis (from model/content and estimation points of view)
without them, “controlling econometrically”).
• Get the data.
• Talk to your supervisor about problems with data availability or
econometrics (he might also warn you at an early stage that your
approach is not appropriate concerning your model or standard
econometric practice in “your field”)
“unique datasets” make papers very interesting…
• Morgan Kelly and Cormac Ó Grada (AER 2000), “Market Contagion: Evidence from the
Panics of 1854 and 1857”
To test a model of contagion--where individuals hear some bad news and communicate
it to their acquaintances, who then pass it on, leading to a market panic--requires a
knowledge of the information networks of participants, something hitherto unavailable.
For two panics in the 1850s this paper examines the behavior of Irish depositors in a
New York bank. As recent immigrants, their social network was determined largely by
their place of origin in Ireland, and where they lived in New York. During both panics
this social network turns out to be the prime determinant of behavior.”
• Steven D. Levitt and Sudhir Alladi Venkatesh (QJE 2000), “An Economic Analysis Of A
Drug-Selling Gang‘s Finances,”
We use a unique data set detailing the financial activities of a drug-selling street gang
to analyze gang economics. On average, earnings in the gang are somewhat above the
legitimate labor market alternative. The enormous risks of drug selling, however, more
than offset this small wage premium. Compensation within the gang is highly skewed,
and the prospect of future riches, not current wages, is the primary economic
motivation. The gang engages in repeated gang wars and sometimes prices below
marginal cost. Our results suggest that economic factors alone are unlikely to
adequately explain individual participation in the gang or gang behavior. 14
…because they are difficult to obtain
• they are not systematically and comparatively assembled by statistical
agencies on an international level
• they normally make for original research, since authors can claim that
“no one else has used similar data to look at this kind of question” or
even that “this allows for the first empirical test of this or that theory in
this and that institutional framework”
• they might exist by chance or be the result of special efforts by
researchers in collecting the data or searching for it in sometimes
seemingly unlikely sources (lucky “matching” of theoretical interest and
data collection by ONGs, state authorities for reasons that are “unique”)
• This does not mean that similar data is unavailable, but it does mean
that finding it might depend on similar “luck” (but by learning how the
original authors found it, you might learn to search) or “efforts” (collect
the data yourself or convince other to collect it)
• A safer bet is to go for something more standard… (sorry!) 15
Data sources I: UC3M Library

You might also check “multidisciplinar”


and “general” (databases which might
contain “also” data)
Results – some only accessible in the
library itself, most online

The Economist Intelligence Unit

IHS: Sources, among others: OECD, IMF, World Bank,


United Nations, Eurostat, Goldman Sachs, Investors
Business Daily, The McGraw-Hill Companies, Morgan
Stanley, Reuters, The Wall Street Journal.
Other sources
• Very useful overview: American Economic Association, Resources for
Economists on the Internet (http://www.rfe.org), exploring it takes time!
• Data at the NBER (http://www.nber.org/data/), among them Penn World
Tables (macro), Macrohistory Database, and lots of replication datasets
• BREAD, Duke University, Household surveys from Developing Countries
(http://ipl.econ.duke.edu/dthomas/dev_data/index.html)
• Inter-University Council for Political and Social Research
(http://www.icpsr.umich.edu), mainly, but not only US data (some free, some
only obtainable for members, Spain has a national membership via CIS)
• http://mimas.ac.uk (UK data repository)
• Cepal Stat, all kinds of data about Latin America and the Caribbean
(http://websie.eclac.cl/sisgen/ConsultaIntegrada.asp)
• National statistical bureaus have lots of data: look at a list at
http://www.census.gov/aboutus/stat_int.html or
http://unstats.un.org/unsd/methods/inter-natlinks/sd_natstat.asp or
http://www.bls.gov/bls/other.htm
• Look at the papers you read, authors often tell you where the data comes from
or even have replication datasets available (on their homepage or at another
place like NBER or journal websites – but check whether the author has
already used the data to answer “your question” in another paper)
Sources for this presentation
• Loraine Blaxter et. al., Cómo se hace una investigación, Barcelona:
Gedisa, 2000, ch. 6.
• Jeffrey M. Wooldridge, Introductory Econometrics. A Modern
Approach, Mason, OH: Thomson, 2006, ch. 1.
• Gary Koop, Analysis of Economic Data, Chichester: Wiley, 2009, ch. 1
• Gary S. Becker, “Crime and Punishment: An Economic Approach”,
Journal of Political Economy 76 (1968), pp. 169-217,
www.jstor.org/stable/1830482
• Elizabeth Garnsey, “Designing Research on Management topics”,
http://www.ifm.eng.cam.ac.uk/mtms/events/documents/research_meth
ods_design.pdf
• Tero Mamia, “Quantitative Research Methods II: Descriptive
univariate analysis Inferential statistics”,
http://www.uta.fi/~tero.mamia/opetus/luennot/lecture2.pdf
• Reed College, Economics Department: “Writing a thesis”,
http://academic.reed.edu/economics/theses/thesis_writing.html

You might also like