Professional Documents
Culture Documents
Marvin Titus
Higher
Education Policy
Analysis Using
Quantitative
Techniques
Data, Methods and Presentation
Quantitative Methods in the Humanities
and Social Sciences
Editorial Board
Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich,
Alyn Rockwood
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgments
There are many individuals who encouraged and inspired me, over the past
few years, to write this book. I am grateful to my colleagues, Alberto Cabrera
and Sharon Fries-Britt, at the University of Maryland who encouraged me to
write this book. I am grateful to my students, whom I taught in a graduate
course that covered many of the topics that are introduced in this book. They
contributed to my deeper understanding of the topics that I introduced in
the course and also inspired me to take on this project. I am particularly
grateful to my former students who reviewed the draft chapters of this book.
They are as follows: Christie De Leon, MacGregor Obergfell, Matt Renn,
and Liz Wasden. They provided valuable comments, edits, and suggestions
for improvement.
I thank Ozan Jaquette, who graciously makes institution- and state-level
data available to me and other researchers. Some of these data are used in
many of the examples in this book.
I would also like to thank Springer Publishing for their support in
publishing this book and their patience as I wrote and revised the book.
I would also like to thank my academic department, college, and university
for a semester-long sabbatical, which I used to develop the book proposal.
Finally, I would like to thank my wife Beverly, who provided encourage-
ment and support as I spent an enormous amount of time away from her
working on the draft manuscript of this book. I owe a great deal of gratitude
to her.
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The What Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 The How Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 The How Questions and Quantitative Techniques . . 15
2.2.4 So Many Answers and Not Enough Time . . . . . . . . . . . 17
2.2.5 Answers in Search of Questions . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Identifying Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 International Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 National Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 State-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Institution-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Creating Datasets and Managing Data . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Stata Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Primary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Secondary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
About the Author
Marvin Titus His research focuses on the economics and finance of higher
education and quantitative methods. While he has explored how institutional
and state finance influences student retention and graduation, Dr. Titus’
most recent work is centered on examining the determinants of institutional
cost and productivity efficiency. He investigates how state higher education
finance policies influence degree production. Through the use of a variety
of econometric techniques, Dr. Titus is also exploring how state business
cycles influence volatility in state funding of higher education. Named a
TIAA Institute Fellow in 2018, Dr. Titus has published in top-tier research
journals, including the Journal of Higher Education, Research in Higher
Education, and Review of Higher Education. He is an associate editor of
Higher Education: Handbook of Theory and Research and has served on
the editorial board of Research in Higher Education, Review of Higher
Education, and the Journal of Education Finance. Dr. Titus also serves on
several technical review panels for national surveys produced by the National
Center for Education Statistics. To conduct his research utilizing national
and customized state- and institution-level datasets, Dr. Titus uses several
statistical software packages such as Stata, Limdep, and HLM. He earned a
BA in economics and history from York College of the City University of
New York, MA in economics from the University of Wisconsin-Milwaukee,
and a PhD in higher education policy, planning, and administration from the
University of Maryland.
xi
Chapter 1
Introduction
This book will also touch on the subject of policy research questions.
Higher education policy analysis is not only about asking the right questions,
it’s also about using the appropriate quantitative techniques to answer
those questions. While acknowledging and touching on the former, this
book focuses on the latter. Some books on the higher education policy
analysis show how to frame a research agenda (e.g., Hillman et al. 2015).
A plethora of literature in a variety of journals addresses a wide range
of higher education policy areas such as state funding, tuition, student
financial aid, governance, accountability, and college completion. A smaller
body of literature introduces higher education researchers to the use of
specific quantitative techniques. As pointed out above, whole chapters in
Higher Education: Handbook of Theory and Research have been devoted to
a particular quantitative research method in higher education. However, to
date, there is no comprehensive reference text that provides guidance to
higher education policy analysts, researchers, and students with respect to
the research design that may be necessary to answer important questions
using quantitative techniques. A research design would include asking the
“right” questions, identifying existing data sources or creating a customized
dataset, and using the appropriate statistical techniques.
This book goes beyond providing guidance to higher education policy
analysts with respect to research design. On the front end, it also covers
the identification of data sources, management and exploration of data.
On the back end, the book introduces advanced quantitative techniques
but also demonstrates how to present research results to higher education
policymakers and other lay people. Consequently, the book is organized in
the following fashion. Chapter 2 discusses the questions that higher education
policy analysts and researchers, who use quantitative methods should ask,
and may not be able to answer. These questions may involve the use of a
variety of data and statistical techniques.
Chapter 3 introduces the reader to various secondary data sources that can
be used to answer policy or research questions or build custom datasets. This
chapter will provide an overview of easily accessible data for higher education
policy analysis across countries, U.S. states, institutions, and students. Most
of these data are publicly available but others are restricted and require a
license. In this book, only data from publicly available sources are accessed
and used in examples. Many higher education analysts and researchers have
used data from these publicly available sources to examine various policy-
related topics. It should be noted that this chapter does not provide an
exhaustive list of higher education data sources.
Chapter 4 shows how to create analytic datasets, organize, and manage
datasets that can be used to answer specific higher education policy questions.
By way of step-by-step instructions on how to build a custom dataset, this
chapter shows how to import data into Stata datasets for analysis. Using
examples, the organization and management of customized datasets are also
4 1 Introduction
References
Acock, A. C. (2018). A Gentle Introduction to Stata (6th ed.). A Stata Press Publication,
StataCorp LLC.
Arellano, E. C., & Martinez, M. C. (2009). Does Educational Preparation Match
Professional Practice: The Case of Higher Education Policy Analysts. Innovative Higher
Education, 34 (2), 105–116. https://doi.org/10.1007/s10755-009-9097-0
Bielby, R. M., House, E., Flaster, A., & DesJardins, S. L. (2013). Instrumental variables:
Conceptual issues and an application considering high school course taking. In Higher
education: Handbook of theory and research (pp. 263–321). Springer.
Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The
Review of Higher Education, 23 (2), 119–132. https://doi.org/10.1353/rhe.2000.0002
DesJardins, S. L. (2003). Event history methods: Conceptual issues and an application to
student departure from college. In J. C. Smart (Ed.), Higher Education: Handbook of
Theory and Research (Vol. 18, pp. 421–471). Springer.
Fowles, J. T., & Tandberg, D. A. (2017). State Higher Education Spending: A Spatial
Econometric Perspective. American Behavioral Scientist, 61 (14), 1773–1798. https://
doi.org/10.1177/0002764217744835
Furquim, F., Corral, D., & Hillman, N. (2020). A Primer for Interpreting and Designing
Difference-in-Differences Studies in Higher Education Research. In L. W. Perna (Ed.),
Higher Education: Handbook of Theory and Research: Volume 35 (pp. 667–723).
Springer International Publishing. https://doi.org/10.1007/978-3-030-31365-4_5
Hillman, N. W., Tandberg, D. A., & Sponsler, B. A. (2015). Public Policy and Higher
Education: Strategies for Framing a Research Agenda. ASHE Higher Education Report,
41 (2), 1–98.
Jaquette, O., & Curs, B. R. (2015). Creating the Out-of-State University: Do Public
Universities Increase Nonresident Freshman Enrollment in Response to Declining
State Appropriations? Research in Higher Education, 56 (6), 535–565. https://doi.org/
10.1007/s11162-015-9362-2
Jaquette, O., Kramer, D. A., & Curs, B. R. (2018). Growing the Pie? The Effect
of Responsibility Center Management on Tuition Revenue. The Journal of Higher
Education, 89 (5), 637–676.
References 7
Lacy, T. A. (2015). Event history analysis: A primer for higher education researchers. In
M. Tight & J. Huisman (Eds.), Theory and Method in Higher Education Research (Vol.
1, pp. 71–91). Emerald Publishing Group.
McCall, B. P., & Bielby, R. M. (2012). Regression discontinuity design: Recent develop-
ments and a guide to practice for researchers in higher education. In Higher education:
Handbook of theory and research (pp. 249–290). Springer.
Rios-Aguilar, C., & Titus, M. A. (Eds.). (2018). Spatial Thinking and Analysis in Higher
Education Research: New Directions for Institutional Research: Vol 2018, No 180 (Vol.
2018). Wiley Press. https://onlinelibrary.wiley.com/toc/1536075x/2018/2018/180
Chapter 2
Asking the Right Policy Questions
Abstract This chapter discusses asking the right policy questions. It points
out how the nature of those questions and answers are shaped by the policy
context. With the most appropriate methodological tools, policy analysts
should be prepared to address follow-up questions. These include “what” and
“how” questions. The chapter discusses how academic researchers have to
simultaneously use rigorous methods and provide results of their research
that is of use to policymakers and the general public.
2.1 Introduction
This chapter discusses higher education policy analysis and evaluation with
respect to the nature policy questions. The first part of chapter discusses the
policy context within which the right policy question is addressed by policy
analysts. The next section provides a perspective on the “what” questions
that policymakers ask policy analysts to address. The chapter then discusses
the “how” question, followed by the next section that explains how academic
researchers may also provide answers in search of questions. The chapter ends
with some concluding remarks in the summary section.
Policy analysis involves asking the right questions and providing those
answers. But how does one determine what constitutes the right questions?
It is necessary to clearly identify the policy issue at hand, who are concerned
about the issue, how to frame questions about the issue, and the possibility
of providing the relevant answers. Identification of a policy issue in higher
education is not as straightforward as one may think. Take for example the
issue of college affordability. The context and focus of that same policy issue
differs by who is discussing it. In the popular press, college affordability may
be presented in terms of the increase in the price of college (i.e., tuition
and fees). Among higher education advocacy groups such as the Institute
for Higher Education Policy, college affordability may be discussed within
the context of the extent to which students from low-income families are
being priced out of the higher education market. Therefore, with respect to
identifying policy issues, the audience also matters. Even if the issue and
audience have been identified, policy research and the policy issue have to be
bridged (Ness 2010). With regard to an identified policy issue, the question
that policy researchers and policymakers are asking may not be one in the
same. Moreover, the decisions of policymakers may not be linked to answers
to questions addressed by policy researchers. According to Ness (2010), a
direct application of policy research to policymaking process is more closely
connected to the rational choice model. But a more realistic policy making
process is the “multiple streams” model (Kingdon 2011). Policy analysts who
operate under the assumptions of the “multiple streams” model of the policy
process produce research for multiple audiences such as academics, advocacy
groups, policymakers, the media, as well as the general public. Consequently,
research findings have to be clearly articulated or written for a wide audience
of users who may or may not influence the policy process or policymakers.
Given the variety of groups, a variety of questions and answers may
have to be posed and addressed. This is rather challenging for the policy
analyst who must be cognizant of her or his audience, the policy process,
as well as a variety of analytical techniques, modes of communicating
the results, and the possible implications for policy. Different questions
will require different methods and analytical techniques. In general, the
“why” questions usually require a qualitative research design. The “what”
and “how” questions generally necessitate a quantitative research design,
which includes continuous and categorical data, measures or variables, and
statistical techniques. But to answer the questions, the policy analyst or
researcher must choose the appropriate data and statistical techniques, which
depend on several factors.
2.2 Asking the Right Policy Questions 11
1 For more discussion on this, see Toutkoushian, R. K., & Paulsen, M. B. (2016). Economics
of Higher Education: Background, Concepts, and Applications (first ed. 2016 edition).
Springer.
2 The issue of college affordability has increasingly received attention at the state and
national level. For example, see Miller, G., Alexander, F. K., Carruthers, G., Cooper, M.
A., Douglas, J. H., Fitzgerald, B. K., Gregoire, C., & McKeon, H. P. “Buck.” (2020). A
2.2 Asking the Right Policy Questions 13
Many higher education policy inquiries are “how” questions. A state policy-
maker may inquire how a particular policy may have affected a particular
outcome or output. For example, policymakers in Maryland may want to
know how the adoption of a state-wide policy on articulation has affected
transfer rates from community colleges to 4-year institutions in Maryland.
Using quantitative techniques, the policy analyst can approach this question
in several different ways. First, he or she may want to answer this question
from the perspective of Maryland’s transfer rates before and after the
adoption of a state-wide policy on articulation, without comparison to
other states that have articulation policies. This is probably the easiest but
not necessarily the best way to answer this question. The second way to
answer this question is to compare Maryland’s transfer rates before and
after the adoption of a state-wide policy on articulation, with comparison
to comparable states that have no articulation policy. This approach to
answering the question involves collecting data on comparable states. But
this prompts the analyst to ask the following set of questions:
What states are considered to be comparable to Maryland?
Are only border states comparable to Maryland?
Are states in the same regional compact, the Southern Regional Education
Board (SREB) comparable to Maryland?
2.2 Asking the Right Policy Questions 15
regression (RD) have more causal inferences than from ordinary least squares
(OLS), fixed-effects (FE), and random-effects (RE) regression models. Other
quantitative techniques that suggest a causal inference include synthetic
control methods (SCM), a recently developed technique. An experimental
research (ER) design or “scientific” method is utilized to establish the cause-
effect relationship among a group of variables, with a random assignment to
treatment and control groups. While it is considered the “gold standard” of
research design where the researcher can manipulate the policy intervention
or “treatment”, in most instances, ER cannot be used to conduct policy
analysis or evaluation, due to legal, ethical, or practical reasons. Therefore,
the vast majority of analyses of higher education policy is conducted using
either descriptive statistics or correlational methods such as OLS, FE, and
RE or quasi-experimental methods such as IV, DiD, RD regression, or SCM.
The nature of the policy research question and data should determine the
most appropriate method to be utilized by the analyst. For example, if the
question is referring to the incidence of the adoption of a state policy (e.g.,
free tuition for community college students) across the United States by year,
the use of descriptive statistics or exploratory data analysis (EDA) may be all
that is needed. If the question is about the relationship between an outcome
(e.g., the enrollment of full-time students in community colleges within a
state) and a state higher education policy (e.g., free tuition at community
colleges), an ordinary least squares (OLS) regression model may be more
appropriate.3 If few states (e.g., 20) have implemented similar free tuition
policies among all 50 states and across many (e.g.,10) years, then a fixed-
effects regression model may be the most appropriate technique to address
the question in terms of the “average” influence of such policies.4 If complete
data are available in only a subset of states, then a random-effects regression
model probably should be employed.5 Finally, if the question is referring
to how the adoption of a particular policy in a particular state affected an
outcome in that state (compared to similar states without no such policy),
then a difference-in-difference (DiD) regression may be the most appropriate
method.6
If one chooses to address the question with respect to the effect of the
policy in a specific state (e.g., Tennessee) or group of states (e.g., Tennessee
and Maryland) compared to states that did not adopt the policy, has access
to data for only a few comparable states (e.g., members of the Southern
Regional Education Board—SREB) and a few years, and wants to address
the question in terms of the effect of the policy compared to states that did
not adopt the policy, DiD regression or SCM may be the method of choice.
2.3 Summary
This chapter discussed the asking and answering higher education policy
questions. It was pointed out how the nature of those questions and answers
are shaped by their context. These questions are not always straightforward
and may lead to additional questions by policymakers. Policy analysts should
be prepared to address follow-up questions. This chapter also discussed the
nature of policy inquiries, which may include “what” questions or “how”
questions or both. Policy analysts have to choose the appropriate methods to
address these questions. The chapter ended with a discussion of how academic
researchers may have to simultaneously use rigorous methods and provide
results of their research that is of use to policy and the general public.
References
Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The
Review of Higher Education, 23 (2), 119–132.
Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and
mixed methods approaches (5th ed.). Sage Publications.
Kingdon, J. W. (2011). Agendas, Alternatives, and Public Policies. Netherlands: Longman.
Ness, E. C. (2010). The role of information in the policy process: Implications for the
Examination of Research Utilization in Higher Education Policy. In J. C. Smart (Ed.),
Higher education: Handbook of theory and research (Vol. 25, pp. 1–49). Springer.
Zinth, K., & Smith, M. (2012). Tuition-Setting Authority for Public Colleges and
Universities (p. 10). Education Commission of the States.
Chapter 3
Identifying Data Sources
3.1 Introduction
This chapter identifies and discusses some of the major data sources that are
available to conduct higher education policy research. The first part of the
chapter introduces sources of data that include international organizations.
The next section discusses the U.S. national-level data from the U.S.
Department of Education and other sources. Higher education institutional-
level data are introduced and discussed in the following section of the chapter.
The last section of the chapter provides concluding statements on data
sources.
students in 2009 beginning in the ninth grade at 944 schools. The first follow-
up of the HSLS:09 was in 2012. In 2013, there was an update to HSLS:09.
A second follow-up, conducted in 2016, collected information on students
in postsecondary education and/or the workforce. In 2017, HSLS:09 was
supplemented with PETS. Information on accessing HSLS:09 can be found at:
https://nces.ed.gov/surveys/hsls09/. Recently, a few higher education policy
analysts and researchers have used HSLS:09 to examine college readiness
(e.g., Alvarado and An 2015; George Mwangi et al. 2018; Kurban and Cabrera
2020; Pool and Vander Putten 2015) and college enrollment (e.g., Engberg
and Gilbert 2014; Goodwin et al. 2016; Nienhusser and Oshio 2017; Schneider
and Saw 2016).
National Postsecondary Student Aid Study (NPSAS). NPSAS is a nation-
ally representative cross-sectional survey, with a focus on financial aid, of
students enrolled in postsecondary education institutions. Beginning in 1987,
the NPSAS survey has been conducted almost every other year. A NPSAS
survey is planned for 2020, which will include state-representative data for
most states. In addition to student interviews, NPSAS includes data from
institution records and government databases. Analysts and researchers can
perform analysis on NPSAS data only through NCES, via its Datalab at:
https://nces.ed.gov/surveys/npsas/. NPSAS microdata or restricted use file
data are only available to analysts and researchers who have been granted a
license from IES/NCES. The federal government, higher education advocacy
groups, and researchers have used NPSAS data to produce reports to help
inform policy on federal financial aid.
Beginning Postsecondary Students Longitudinal Study (BPS). The BPS,
a spin-off of the NPSAS, is a nationally representative survey, based on
a multistage sample of postsecondary education institutions and first-time
students. Drawing on cohorts from the NPSAS, the BPS surveys collect
data on student demographic characteristics, PSE experiences, persistence,
transfer, degree attainment, entry into the labor force and/or enrollment
in graduate or professional school. The first BPS survey was first conducted
1990 (BPS:90/94) and followed a cohort of students through 1994. Since then,
BPS surveys of students have been conducted at the end of their first, third
and sixth year after entering a postsecondary education (PSE) institution.
The BPS has been repeated every few years. Beginning with the BPS:04/09,
PETS information is also provided. The most recent BPS (BPS:12/17) survey
followed a cohort of 2011–2012 first-time beginning students, with a follow-
up in 2017. The next BPS survey will collect information on students who
began their postsecondary education in the academic year 2019–2020 and
will follow that cohort in surveys to be conducted in 2020, 2022, and 2025.
Users can access a limited amount of BPS data through the NCES Datalab.
Information on the accessing data from the BPS can be obtained from NCES
at: https://nces.ed.gov/surveys/bps/. The complete BPS with microdata are
available to restricted use file license holders. Many higher education policy
3.3 National Data 23
analysts and researchers (too numerous to mention) have used the BPS to
investigate college student persistence and completion.
The Baccalaureate and Beyond Longitudinal Study (B&B) is a nationally
representative survey, based on a sample of postsecondary education students
and institutions, of college students’ education and labor force experiences
after they complete a bachelor’s degree. Drawing from cohorts in the NPSAS,
the B&B surveys also collect information on degree recipients’ earnings,
debt repayment, as well as enrollment in and completion of graduate and
professional school. Students in the B&B survey are followed up in their
first, fourth, and tenth year after receiving their baccalaureate degree. The
first B&B survey was conducted in 1993, with follow-ups in 1994, 1997, and
2003. The second B&B survey (B&B:2000/01) had only one follow-up, which
was in 2001. The B&B:2008/12, which focuses on graduates from STEM
education programs, was completed in 2008 and included follow-ups in 2009
and 2012. The B&B:2008/18 will include a follow-up in 2018. Using the NCES
Datalab, analysts can perform limited analyses of B&B data. Microdata from
the B&B surveys, which include PETS information, are only available to users
who are given a license by IES/NCES to use restricted use files. Numerous
analysts and researchers have used the B&B to examine such topics as: labor
market experiences and workforce outcomes of college graduates (e.g., Bastin
and Kalist 2013; Bellas 2001; Joy 2003; Strayhorn 2008; Titus 2007, 2010);
graduate and professional school enrollment and completion (e.g., English
and Umbach 2016; Millett 2003; Monaghan and Jang 2017; Perna 2004;
Strayhorn et al. 2013; Titus 2010); student debt and repayment (e.g., Gervais
and Ziebarth 2019; Millett 2003; Scott-Clayton and Li 2016; Velez et al. 2019;
Zhang 2013); and family formation (e.g., Velez et al. 2019) and career choices
(e.g., Xu 2013, 2017; Zhang 2013) of bachelor’s degree recipients.
Digest of Education Statistics. In addition to providing microdata on
institutions and students, the U.S. Department of Education (DOE) also pro-
duces statistics at an aggregated or macro level on postsecondary education.
For example, IES/NCES publishes the Digest of Education Statistics, which
provides national- and state-level statistics on various areas of education,
including postsecondary education (PSE). For PSE, these areas include:
institutions; expenditures; revenues; tuition and other student expenses;
financial aid; staff; student enrollment; degrees completed; and security and
crime. The statistics on PSE are mostly based on aggregated data from NCES
surveys discussed above (e.g., IPEDS, NPSAS, BPS, B&B). The statistics,
aggregated over time and in some cases across states, are provided in tables.
The tables can be downloaded in an Excel format, which can be used to either
produce reports or merge with data from other sources to conduct statistical
analyses.
Current Population Survey (CPS). The U.S. Census Bureau also provides
national-level postsecondary education data to the public in the form of
the CPS. U.S Census Bureau microdata sample data files are available to
researchers who are given authorization to use specific datasets at one of the
24 3 Identifying Data Sources
secure Federal Statistical Research Data Centers. For example, the restricted
use dataset of the CPS, School Enrollment Supplement provides microdata
at the household level. The CPS School Enrollment Supplement has been
used to examine demographic differences in postsecondary enrollment (e.g.,
Hudson et al. 2005; Jacobs and Stoner-Eby 1998; Kim 2012).
Other Sources of National Data. The U.S. government collects and
disseminates national higher education data that focuses on specific areas.
The Office of Postsecondary Education of the U.S. Department of Education
(DOE) provides data on campus safety and security. The College Scorecard,
which is maintained by the U.S. DOE, produces a national database on
student completion, debt and repayment, earnings, and other data.
The College Board . There are other sources of aggregate national postsec-
ondary education data, such as the College Board, that draw on nationally
representative surveys and federal administrative information. The College
Board data, however, are focused mainly on tuition price and college student
financial aid across years and to limited extent across states. The data can
be accessed at the College Board website (https://research.collegeboard.org/
trends/trends-higher-education) and can be downloaded in Excel format.
Policy analysts use the College Board data to explain patterns and trends
in average higher education tuition prices (e.g., Baum and Ma 2012; Heller
2001; Mitchell 2017; Mitchell et al. 2016) and college student financial aid
(e.g., Baum and Payea 2012; Deming and Dynarski 2010).
examine state need- and merit-based financial aid (e.g., Cohen-Vogel et al.
2008; Doyle 2006; Hammond et al. 2019; Titus 2006).
State Higher Education Executive Officers. Another source of state-level
higher education data is the State Higher Education Executive Officers
(SHEEO). SHEEO provides data on higher education finance (i.e., state
appropriations and net tuition revenue) and postsecondary student unit
record systems. The SHEEO finance data, some of which go as far back as fis-
cal year 1980, can be downloaded (https://shef.sheeo.org/data-downloads/)
in an Excel file format. SHEEO finance data have been used by several higher
education policy analysts and researchers to produce reports and studies on
state support for higher education (e.g., Doyle 2013; Lacy and Tandberg 2018;
Lenth et al. 2014; Longanecker 2006).
National Science Foundation (NSF). The National Science Foundation
(NSF) is another source of state-level higher education data. More specifically,
NSF provides statistics based on Science and Engineering Indicators (SEI)
State Indicators (https://ncses.nsf.gov/indicators/states/). These statistics
include the number of science and engineering (S&E) degrees conferred,
academic research and development (R&D) expenditures at state colleges and
universities, academic S&E article output, and academic patents awarded.
The data are available to the public and can be downloaded in Excel file
format. Utilizing NSF/SEI state-level data, a few analysts and researchers
(e.g., Coupé 2003; Fanelli 2010; Wetter 2009) have addressed the topic of
academic R&D.
Regional Compacts. There are several academic common market or
regional compacts that provide state-level higher education data. The
Southern Regional Education Board (SERB) is a regional compact that
includes 16 member states in the South which provides state-level information
to the public. With respect to higher education, SREB produces a “factbook”
(https://www.sreb.org/fact-book-higher-education-0) which contains tables
on data such as the population and economy, enrollment, degrees, student
tuition and financial aid, faculty, administrators, revenue, and expenditures.
These tables can be downloaded in an Excel file format.
The Western Interstate Commission for Higher Education (WICHE) is an
academic common market that is composed of 15 Western states and member
U.S. Pacific Territories and Freely Associated States (which currently include
the Commonwealth of the Northern Mariana Islands and Guam). WICHE
produces a regional “factbook” for higher education that contains “policy
indicators” (https://www.wiche.edu/pub/factbook). Similar to SREB’s, the
WICHE higher education factbook provides state-level data in the following
areas: demographics (including projections); student preparation, enrollment,
and completion; affordability and; finance.
The Midwest Higher Education Compact (MHEC) is an academic common
market that is composed of 12 states in the Midwest. MHEC, via its Inter-
active Dashboard (https://www.mhec.org/policy-research/mhec-interactive-
dashboard), provides state-level data and key performance indicators of
26 3 Identifying Data Sources
3.6 Summary
2 This is particularly the case with respect to the Finance (F) survey.
28 3 Identifying Data Sources
via IPEDS, are made available to the public but also have limitations. The
overview of data sources that are provided in this chapter is, by no means,
an exhaustive list of all sources of data.
References
Alvarado, S. E., & An, B. P. (2015). Race, Friends, and College Readiness: Evidence from
the High School Longitudinal Study. Race and Social Problems, 7 (2), 150–167. https:/
/doi.org/10.1007/s12552-015-9146-5
Azevedo, J. P. (2020). WBOPENDATA: Stata module to access World Bank databases.
In Statistical Software Components. Boston College Department of Economics. https:/
/ideas.repec.org/c/boc/bocode/s457234.html
Bastin, H., & Kalist, D. E. (2013). The Labor Market Returns to AACSB Accreditation.
Journal of Labor Research, 34 (2), 170–179. https://doi.org/10.1007/s12122-012-9155-
8
Baum, S., & Ma, J. (2012). Trends in College Pricing, 2012. Trends in Higher Education
Series. (pp. 1–40). College Board Advocacy & Policy Center. https://files.eric.ed.gov/
fulltext/ED536571.pdf
Baum, S., & Payea, K. (2012). Trends in Student Aid, 2012. Trends in Higher Education
Series. (pp. 1–36). College Board Advocacy & Policy Center. https://files.eric.ed.gov/
fulltext/ED536570.pdf
Belasco, A. S., & Trivette, M. J. (2015). Aiming low: Estimating the scope and predictors
of postsecondary undermatch. The Journal of Higher Education, 86 (2), 233–263.
Bellas, M. L. (2001). Investment in higher education: Do labor market opportunities differ
by age of recent college graduates? Research in Higher Education, 42 (1), 1–25.
Chatterji, M. (1998). Tertiary education and economic growth. Regional Studies, 32 (4),
349–354.
Cohen-Vogel, L., Ingle, W. K., Levine, A. A., & Spence, M. (2008). The “Spread” of Merit-
Based College Aid: Politics, Policy Consortia, and Interstate Competition. Educational
Policy, 22 (3), 339–362. https://doi.org/10.1177/0895904807307059
Coupé, T. (2003). Science Is Golden: Academic R&D and University Patents. The Journal
of Technology Transfer, 28 (1), 31–46. https://doi.org/10.1023/A:1021626702728
Deming, D., & Dynarski, S. (2010). College aid. In P. B. Levine & D. J. Zimmerman
(Eds.), Targeting investments in children: Fighting poverty when resources are limited:
Vol. Targeting Investments in Children: Fighting Poverty When Resources are Limited
(pp. 283–302). University of Chicago Press. https://www.nber.org/chapters/c11730.pdf
Doyle, W. R. (2006). Adoption of merit-based student grant programs: An event history
analysis. Educational Evaluation and Policy Analysis, 28 (3), 259–285.
Doyle, W. R. (2013). Playing the Numbers: State Funding for Higher Education: Situation
Normal? Change: The Magazine of Higher Learning, 45 (6), 58–61.
Engberg, M. E., & Gilbert, A. J. (2014). The Counseling Opportunity Structure:
Examining Correlates of Four-Year College-Going Rates. Research in Higher Education,
55 (3), 219–244. https://doi.org/10.1007/s11162-013-9309-4
English, D., & Umbach, P. D. (2016). Graduate school choice: An examination of individual
and institutional effects. The Review of Higher Education, 39 (2), 173–211.
Fanelli, D. (2010). Do Pressures to Publish Increase Scientists’ Bias? An Empirical Support
from US States Data. PLoS ONE, 5 (4). https://doi.org/10.1371/journal.pone.0010271
Fontenay, S. (2018). SDMXUSE: Stata module to import data from statistical agencies
using the SDMX standard. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s458231.html
References 29
George Mwangi, C. A., Cabrera, A. F., & Kurban, E. R. (2018). Connecting School and
Home: Examining Parental and School Involvement in Readiness for College Through
Multilevel SEM. Research in Higher Education. https://doi.org/10.1007/s11162-018-
9520-4
Gervais, M., & Ziebarth, N. L. (2019). Life After Debt: Postgraduation Consequences of
Federal Student Loans. Economic Inquiry, 57 (3), 1342–1366. https://doi.org/10.1111/
ecin.12763
Glennie, E. J., Dalton, B. W., & Knapp, L. G. (2015). The influence of precollege access
programs on postsecondary enrollment and persistence. Educational Policy, 29 (7), 963–
983.
Gonçalves, D. (2016). GETDATA: Stata module to import SDMX data from several
providers. In Statistical Software Components. Boston College Department of Eco-
nomics. https://ideas.repec.org/c/boc/bocode/s458093.html
Goodwin, R. N., Li, W., Broda, M., Johnson, H., & Schneider, B. (2016). Improving College
Enrollment of At-Risk Students at the School Level. Journal of Education for Students
Placed at Risk, 21 (3), 143–156. https://doi.org/10.1080/10824669.2016.1182027
Hammond, L., Baser, S., & Cassell, A. (2019, June 7). Community Col-
lege Governance Structures & State Appropriations for Student Financial Aid.
36th Annual SFARN Conference. http://pellinstitute.org/downloads/sfarn_2019-
Hammond_Baser_Cassell.pdf
Heller, D. E. (2001). The States and Public Higher Education Policy: Affordability, Access,
and Accountability. JHU Press.
Hemelt, S. W., & Marcotte, D. E. (2016). The changing landscape of tuition and enrollment
in American public higher education. RSF: The Russell Sage Foundation Journal of
the Social Sciences, 2 (1), 42–68.
Holmes, C. (2013). Has the expansion of higher education led to greater economic growth?
National Institute Economic Review, 224 (1), R29–R47.
Hudson, L., Aquilino, S., & Kienzl, G. (2005). Postsecondary Participation Rates by Sex
and Race/Ethnicity: 1974–2003. Issue Brief. NCES 2005-028. (NCES 2005–028; pp.
1–3). National Center for Education Statistics, Institute of Education Sciences, U.S.
Department of Education.
Jacobs, J. A., & Stoner-Eby, S. (1998). Adult Enrollment and Educational Attainment.
The Annals of the American Academy of Political and Social Science, 559, 91–108.
JSTOR.
Joy, L. (2003). Salaries of recent male and female college graduates: Educational and labor
market effects. ILR Review, 56 (4), 606–621.
Kim, D., & Nuñez, A.-M. (2013). Diversity, situated social contexts, and college enrollment:
Multilevel modeling to examine student, high school, and state influences. Journal of
Diversity in Higher Education, 6 (2), 84.
Kim, J. (2012). Welfare Reform and College Enrollment among Single Mothers. Social
Service Review, 86 (1), 69–91. https://doi.org/10.1086/664951
Knowles, S. (1997). Which level of schooling has the greatest economic impact on output?
Applied Economics Letters, 4 (3), 177–180. https://doi.org/10.1080/135048597355465
Kurban, E. R., & Cabrera, A. F. (2020). Building Readiness and Intention Towards STEM
Fields of Study: Using HSLS: 09 and SEM to Examine This Complex Process among
High School Students. The Journal of Higher Education, 91 (4), 1–31.
Lacy, T. A., & Tandberg, D. A. (2018). Data, Measures, Methods, and the Study of
the SHEEO. In D. A. Tandberg, A. Sponsler, R. W. Hanna, J. Guilbeau P., & R.
Anderson E. (Eds.), The State Higher Education Executive Officer and the Public
Good: Developing New Leadership for Improved Policy, Practice, and Research (pp.
282–299). Teachers College Press.
30 3 Identifying Data Sources
Lee, K. A., Leon Jara Almonte, J., & Youn, M.-J. (2013). What to do next: An exploratory
study of the post-secondary decisions of American students. Higher Education, 66 (1),
1–16. https://doi.org/10.1007/s10734-012-9576-6
Lenth, C. S., Zaback, K. J., Carlson, A. M., & Bell, A. C. (2014). Public Financing of
Higher Education in the Western States: Changing Patterns in State Appropriations
and Tuition Revenues. In Public Policy Challenges Facing Higher Education in the
American West (pp. 107–142). Springer.
Longanecker, D. (2006). A tale of two pities. Change: The Magazine of Higher Learning,
38 (1), 4–25.
Millett, C. M. (2003). How undergraduate loan debt affects application and enrollment in
graduate or first professional school. The Journal of Higher Education, 74 (4), 386–427.
Mitchell, J. (2017, July 23). In reversal, colleges rein in tuition. The Wall
Street Journal. http://opportunityamericaonline.org/wp-content/uploads/2017/07/IN-
REVERSAL-COLLEGES-REIN-IN-TUITION.pdf
Mitchell, M., Leachman, M., & Masterson, K. (2016). Funding down, tuition up. Center on
Budget and Policy Priorities. https://www.cbpp.org/sites/default/files/atoms/files/5-
19-16sfp.pdf
Mokher, C. G., & McLendon, M. K. (2009). Uniting Secondary and Postsecondary
Education: An Event History Analysis of State Adoption of Dual Enrollment Policies.
American Journal of Education, 115 (2), 249–277. https://doi.org/10.1086/595668
Monaghan, D., & Jang, S. H. (2017). Major Payoffs: Postcollege Income, Graduate School,
and the Choice of “Risky” Undergraduate Majors. Sociological Perspectives, 60 (4), 722–
746. https://doi.org/10.1177/0731121416688445
Morgan, G. B., D’Amico, M. M., & Hodge, K. J. (2015). Major differences: Modeling
profiles of community college persisters in career clusters. Quality & Quantity, 49 (1),
1–20. https://doi.org/10.1007/s11135-013-9970-x
Nienhusser, H. K., & Oshio, T. (2017). High School Students’ Accuracy in Estimating
the Cost of College: A Proposed Methodological Approach and Differences Among
Racial/Ethnic Groups and College Financial-Related Factors. Research in Higher
Education, 58 (7), 723–745. https://doi.org/10.1007/s11162-017-9447-1
Perna, L. W. (2004). Understanding the decision to enroll in graduate school: Sex and r
racial/ethnic group differences. The Journal of Higher Education, 75 (5), 487–527.
Pool, R., & Vander Putten, J. (2015). The No Child Left Behind Generation Goes to
College: A Longitudinal Comparative Analysis of the Impact of NCLB on the Culture
of College Readiness (SSRN Scholarly Paper ID 2593924). Social Science Research
Network. https://doi.org/10.2139/ssrn.2593924
Rowan-Kenyon, H. T., Blanchard, R. D., Reed, B. D., & Swan, A. K. (2016). Predictors of
Low- SES Student Persistence from the First to Second Year of College. In Paradoxes
of the Democratization of Higher Education (Vol. 22, pp. 97–125). Emerald Group
Publishing Limited. https://doi.org/10.1108/S0196-115220160000022004
Savas, G. (2016). Gender and race differences in American college enrollment: Evidence
from the Education Longitudinal Study of 2002. American Journal of Educational
Research, 4 (1), 64–75.
Schneider, B., & Saw, G. (2016). Racial and Ethnic Gaps in Postsecondary Aspirations
and Enrollment. RSF: The Russell Sage Foundation Journal of the Social Sciences,
2 (5), 58–82. JSTOR. https://doi.org/10.7758/rsf.2016.2.5.04
Schudde, L. (2016). The Interplay of Family Income, Campus Residency, and Student
Retention (What Practitioners Should Know about Cultural Mismatch). Journal of
College and University Student Housing, 43 (1), 10–27.
Schudde, L. T. (2011). The causal effect of campus residency on college student retention.
The Review of Higher Education, 34 (4), 581–610.
Scott-Clayton, J., & Li, J. (2016). Black-white disparity in student loan debt more than
triples after graduation. Economic Studies, 2 (3), 1–9.
References 31
4.1 Introduction
Data that can be used in Stata may be generated from surveys created and
inputted by the analyst or imported from an external source. The former
is a primary data source while the latter is a secondary data source. Data
produced by the analyst from original surveys are considered primary data,
while secondary data originally compiled by another party are secondary
data. In the sections below, we discuss both.
If we are entering data from a very short survey, then we use the input com-
mand. The example below shows how data for three variables (variable_x,
variable_y, and variable_z) can be entered in Stata by typing the following:
input variable_x variable_y variable_z
31 57 18
25 68 12
35 60 13
38 59 17
30 59 15
end
To see the data that was entered above, type.
list
which would show the following:
. list
+--------------------------------+
| variabx variaby variabz |
|--------------------------------|
1. | 31 57 18 |
2. | 25 68 12 |
3. | 35 60 13 |
4. | 38 59 17 |
5. | 30 59 15 |
+--------------------------------+
To save the above data, type:
save “Example 1.0.dta”
4.2 Stata Dataset Creation 35
To use the Stata editor to enter additional data in Example 1.0, type:
edit
Importing data from a data management (e.g., dBase) file or a spreadsheet
(e.g., Excel) file would be a more efficient way to enter data in Stata. There
are several ways we can do this. We can import data from comma delimited
Excel files (csv). For example, the data above may be imported from an Excel
comma delimited file (csv) by typing in the following:
insheet using “Example 1.csv”, comma
The use of primary data requires careful planning and a well-developed
data collection process. Many of these processes involve conducting computer-
assisted personal interviews (CAPI). If we need to collect data, there are
several Stata-based tools available to assist in such an effort. One such tool
is a Stata-user created package of Stata commands, iefieldkit, developed
by the World Bank’s Development Research Group Impact Evaluations team
(DIME). The most recent version of the package can be installed in Stata by
typing in “ssc install iefieldkit, replace”. Information on iefieldkit
can be found at the website address: https://dimewiki.worldbank.org/wiki/
Iefieldkit. Once installed, iefieldkit allows for the automatic creation of
Excel files containing the collected data.
5. Because the total number (N ) of cases 51 (50 states plus the District of
Columbia,) the state id numbers should be entered, ranging from 1 to 51
to reflect N. (If the analyst chooses to delete one or more cases, then the
range of the state id would reflect the modified N ).
6. All numbers should be formatted as numeric with the appropriate decimal
places and not as text characters.
7. Any characters that are not alpha-numeric should be removed from all
cells.
8. After steps 1–7, the file should be saved in an Excel format in the “working”
directory (as discussed in the previous chapter).
9. Open Stata and change to the same “working directory” as in step 8.
Based on writing this chapter for this book, the Stata command to change
to the “working directory” which contains the Excel file is as follows:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”
10. The entire Excel file can be imported into Stata. Be sure to indicate the
first row as a variable name as an option. Using the same file from above,
the Stata command is:
11. Open the Stata Data editor either in edit or browse mode to look at the
imported data.
In the Stata Data editor, you should see the following:
In Fig. 4.1, take note of the column with the State names, which are in
red text. This indicates State is a string variable. We may want to include
Federal Information Processing Standard Publication (FIPS) codes and the
abbreviations of state names in a state-level dataset. Using the user-created
Stata program “statastates”, the FIPS codes and state abbreviations can be
easily added to any state-level data set that includes the state name. (In our
example from above, the state name is “States”.) This is demonstrated in the
two steps below:
1. ssc install statastates
2. statastates, name(<State name>).
3. We can delete the variable _merge, which was created when we added the
FIPS codes and state abbreviations. This is done by simply typing
drop _merge
We may also want to move the FIPS codes and state abbreviations
somewhere near the front of our dataset. This can be accomplished by typing
the following Stata command:
order state_abbrev state_fips, before( state)
The dataset should look like Fig. 4.2:
We can then save this file with a new more descriptive name, such as “US
high school graduates in 2012 enrolled in PSE, by state”, in a working direc-
tory containing Stata files (e.g., C:\Users\Marvin\Dropbox\Manuscripts\
Book\Chapter 4\Stata files). After changing to the working directory and
38 4 Creating Datasets and Managing Data
reopening the new Stata file, we can show a description of our dataset by
typing:
describe
The output is the following:
. describe
Contains data
obs: 51
vars: 11
----------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------
Stateid byte %10.0gc Stateid
state_abbrev str2 %9s
state_fips byte %8.0g
state str20 %20s state
total long %10.0g total
public long %10.0g public
private int %10.0g private
anystate long %10.0g anystate
homestate long %10.0g homestate
anyrate double %10.0g anyrate
homerate double %10.0g homerate
-----------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Contains data from US high school graduates in 2012 enrolled in
PSE, by state.dta
Take note, that none of the variables have labels. To create labels, based
on the column names in the Excel file, we use the label variable (lab var)
command for each variable. Here is an example:
lab var Stateid “Stateid”
lab var state_abbrev “State abbreviation”
lab var state_fips “FIPS code”
lab var state “State name”
lab var total “Total number of graduates from HS located in
the state”
lab var public “Number of graduates from public HS
located in the state”
lab var private “Number of graduates from private HS
located in the state”
lab var anystate “Number of first-time freshmen graduating
4.2 Stata Dataset Creation 39
A word of caution when using secondary data such as the NCES Excel
files. Many of those files contain non-numeric characters, such as commas
and dollar signs which will yield string variables in Stata. Before we copy
and paste data from those types of files, we have to properly reformat the
cells with data so they contain no non-numeric characters. Using the time-
series data, we can create graphs (which we will discuss in the next chapter).
In many instances, cross-sectional time-series or panel data are used
to conduct higher education policy research. In some cases, analysts have
direct access to data in a panel format, such as some tables that are
published in the Digest of Education Statistics. Consistent with the examples
above and the use of panel data, we download the Excel version of Table
304.70 from the 2018 version of The Digest.2 Because the data on total
fall enrollment of undergraduate students in degree-granting postsecondary
education institutions by state are for selected years 2000 through 2017, we
can characterize those data as panel in nature. Unlike the above example
of time-series data, none of the panel data in this format can be easily
copied and pasted into the Stata Data Editor. Prior to copying or importing
them into Stata, the data have to be properly formatted. The easiest way to
reformat data is in Excel worksheets, containing data on each of the variables
to be subsequently analyzed in Stata. For example, some of the data on
undergraduate students by state from Table 304.70 of the 2018 version of
of enrollment and state or jurisdiction: Selected years, 2000 through 2017. The table can
be found at: https://nces.ed.gov/programs/digest/d18/tables/dt18_304.70.asp.
42 4 Creating Datasets and Managing Data
As a result of steps 1–10, our new worksheet should look like this (Fig.
4.5):
As we can see, this worksheet allows us to view and manage the data
that we are interested in and if necessary, access the source of that data
in the other worksheet (i.e., Digest 2018 Table 304.70). We can import this
worksheet from this Excel workbook into Stata, via the following syntax (all
on one line):
import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 4\Excel files\College enrollment data.xls”,
sheet(“Ugrad”) firstrow
The result is as follows:
Take note that the option sheet(“Ugrad”)refers to the specific worksheet
we would like to import. The option firstrow tells Stata that we would like
to designate the first row of the worksheet as variable labels.
. import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4
\Excel files\College enrollment data.xls”,sheet(“Ugrad”) firstrow
(8 vars, 50 obs)
We now have 300 observations and four variables, including a year variable.
This new dataset, in a long format, now has to be “declared” a panel dataset
by typing:
xtset id year, yearly
The result is:
. xtset id year, yearly
panel variable: id (strongly balanced)
time variable: year, 2000 to 2017, but with gaps
delta: 1 year
The example above is a strongly balanced panel dataset with gaps in the
years. Panel datasets can be strongly balanced, strongly balanced with gaps,
weakly balanced, or unbalanced. In a panel dataset, the total number (N ) of
observations equals the number of units (e.g., states or institutions, etc.) or
panels (p) multiplied by the number of time (t) points (e.g., days, or weeks,
or months, or years) or where N = p x t. A strongly balanced dataset is
one in which all the panels have been observed for the same number of time
points. Panel datasets in which all the panels have been observed for the same
number of time points but have gaps in time points are known as strongly
balanced with gaps in the years. A weakly balanced dataset exists if each
panel has the same number of observations but not the same time points. An
unbalanced dataset is when each panel does not have the same number of
4.2 Stata Dataset Creation 45
the two datasets, based on id, into one that would contain two variables that
we can analyze: FirsTim and HSGrad. We do this by specifying the dataset
(“First-Time - Long.dta”)that contains the two variables we would like
to add to the dataset that is currently open. We carry out this procedure by
typing the following:
joinby id year using “First-Time - Long.dta”, unmatched(none)
Because the file contains the same yearly data on first-time college students
as the data on public high school students, we do not have to specify “year”
as a variable. But as shown in the next example below, it is a good practice
to include that variable as well. Given our example, our new Stata file looks
like this (Fig. 4.7):
In the data editor, we can see the same two variables (HSGrads and
FirsTim) that we can later analyze.
If we have data for additional variables in other worksheets located in the
same working directory (e.g., “C:\Users\Marvin\Dropbox\Manuscripts
\Book\Chapter 4\Excel files”), we would simply repeat the steps above
referring to the specific Excel files/worksheets that we want to import and
the Stata files that we want to reshape from wide to long and ultimately join
to our current file in memory. Similarly, we would reshape these datasets
from wide to long and ultimately join them to our current file in memory.
We could also join two or more Stata files that were reshaped from wide
to long and have the variables State, id, and year. For example, if in our
current directory, we have a file that contains state-level undergraduate need-
based financial aid (Undergraduate state financial aid - need.dta) and
another that has merit-based financial aid (Undergraduate state financial
4.2 Stata Dataset Creation 47
Fig. 4.7 Stata file based on Digest 2018 Table 219.20 (Excel)
aid - need.dta) data, we could add the data from those two those files to
our long-format panel dataset on undergraduate college enrollment (College
enrollment data.dta) by executing the following commands:
use “Undergraduate enrollment data - Long.dta”, clear
joinby id year using “Undergraduate state financial aid - need”
joinby id year using “Undergraduate state financial aid - merit”
xtset id year, yearly
save “Example - 4.1.dta”
Notice that in the joinby, syntax we did not have to include the option
unmatched(none). We also did not have to include the extension dta as
a part of the names of the Stata files. We did, however, have to declare our
dataset as a panel data and save it with a new file name (e.g., Example 4.1).
We can see in our Stata data editor, we now have six variables in our new
panel dataset (Fig. 4.8).
After closing the Stata data editor, we can see how our new panel dataset
is structured, by typing the command xtdescribe or the shortened version
(xtdes):
. xtdes id: 1, 2, ..., 50 n = 50
year: 2000, 2010, ..., 2016 T = 5
Delta(year) = 1 year
Span(year) = 17 periods
(id*year uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
48 4 Creating Datasets and Managing Data
Fig. 4.8 Modified Stata file based on Digest 2018 Table 219.20 (Excel)
5 5 5 5 5 5 5
Freq. Percent Cum. | Pattern
---------------------------+-------------------
50 100.00 100.00 | 1.........1.1..11
---------------------------+-------------------
50 100.00 | X.........X.X..XX
We can see that like our original dataset of only undergraduate college
enrollment, our new appended panel dataset spans 17 years, has 250
observations (50 states × 5 years), is strongly balanced, but with gaps in
the years. This structure is acceptable when conducting basic data analysis
such as descriptive statistics, and running some regression models (which we
will cover in other chapters). But as we shall see in the other chapters, a
strongly balanced panel data set with no gaps in the time periods is required
to conduct more advanced statistical analyses.
4.3 Summary
4.4 Appendix
*Chapter 4 Syntax
*Primary data
*example below shows how data for three variables (variable_x, variable_y, ///
and variable_z) can be entered in Stata
input variable_x variable_y variable_z
31 57 18
25 68 12
35 60 13
38 59 17
30 59 15
end
*To use the Stata editor to enter additional data in Example 1.0, type:
edit
*the data above may be imported from an Excel comma delimited file (csv) ///
by typing in the following:
insheet using “Example 1.csv”, comma
*Secondary data
to change to the “working directory” which contains the Excel file ///
is as follows:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”
*Using the same file from above, the Stata command is:
import excel “tabn302.50 - reformatted.xls”, firstrow
*Using the user-created Stata program “statastates”, the FIPS codes ///
and state abbreviations can be easily added to any state-level data ///
set that includes the state name. (In our example from above, the ///
state name is “States”.) This is demonstrated in the two steps below:
ssc install statastates
statastates, name(<State name>)
*We can delete the variable _merge, which was created when we added ///
the FIPS codes and state abbreviations. This is done by simply typing:
drop _merge
*We may also want to move the FIPS codes and state abbreviations ///
somewhere near the front of our dataset. This can be accomplished ///
typing the following Stata command:
order state_abbrev state_fips, before( state)
*To create labels, based on the column names in the Excel file, ///
we use the label variable (lab var) command for each variable. ///
Here is an example:
lab var Stateid “Stateid”
lab var state_abbrev “State abbreviation”
lab var state_fips “FIPS code”
lab var state “State name”
lab var total “Total number of graduates from HS located in the state”
lab var public “Number of graduates from public HS located in the state”
lab var private “Number of graduates from private HS located in the state”
lab var anystate ///
50 4 Creating Datasets and Managing Data
*We relocate the year variable to the beginning of the dataset by typing:
order year, first
*import worksheet from Excel workbook into Stata, via the following syntax
clear all
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”
import excel “College enrollment data.xls”,sheet(“Ugrad”) firstrow
*convert the data from a “wide” to a “long” format using the reshape ///
or the much faster user-created sreshape (Simons 2016)
*install sreshape
net install dm0090.pkg, replace
sreshape long Ugrad, i(id) j(year)
*change our working directory to where we want to save our Stata ///
file and save it, by typing:
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
save ”HSGrad - Wide.dta“
*reformat our file from wide to long, declare it a panel data set, ///
and save it to a new file
sreshape long HSGrad, i(id) j(year)
xtset id year, yearly
save ”HSGrad - Long.dta“
*join the two datasets, based on id, into one dataset that would ///
contain two variables
joinby id year using ”First-Time - Long.dta“, unmatched(none)
References 51
*join two or more Stata files that were reshaped from wide to long ///
and have the variables State, id, and year.
use ”Undergraduate enrollment data - Long.dta“, clear
joinby id year using ”Undergraduate state financial aid - need“
joinby id year using ”Undergraduate state financial aid - merit“
xtset id year, yearly
save ”Example - 4.1.dta“
*see how our new panel dataset is structured, by typing the command ///
xtdescribe or the shortened version:
xtdes
*end
References
Simons, K. L. (2016). A sparser, speedier reshape. The Stata Journal, 16 (3), 632–649.
Chapter 5
Getting to Know Thy Data
5.1 Introduction
Contains data
obs: 56
vars: 2
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
year float %ty
totalpct float %8.0g
---------------------------------------------------------------
Sorted by: year
We see that “year” and “totalpct” are stored as a floating or float type. By
default, Stata stores all numbers as floats, also known as single-precision or 4-
byte reals (StataCorp 2019). Compared to the integer storage type, the float
storage type uses more memory.1 While it may be necessary for the “totalpct”
variable, this level of precision is not necessary for the year variable, which
is an integer. So we can reduce the amount of memory required by float
by compressing the data using the compress command.2 The use of this
command automatically changes the storage type for the year variable from
float to integer (int). We see from the output below that we save 112 bytes.
. compress
variable year was float now int
(112 bytes saved)
. describe
Contains data
1 For a complete description of the storage types, see page 89 of Stata User’s Guide Release
16.
2 For more information on compress, see pages 77–78 of the Stata User’s Guide Release
16.
5.2 Getting to Know the Structure of Our Datasets 55
obs: 56
vars: 2
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
year int %ty
totalpct float %8.0g
---------------------------------------------------------------
Sorted by: year
Note: Dataset has changed since last saved.
Therefore, it is a good practice to invoke the compress command,
particularly using large datasets with numeric variables that are actually
integers. As an example, we will use an enhanced version of one of our panel
data files that we created in the previous chapter and saved to a new file name,
Example 5.0. With the exception of how state expenditures on financial aid
for undergraduates is measured in millions of dollars, this file contains the
same data as in the file we used as in example Chap. 4 (Example 4.1). Most
likely, we would have either imported these data on state expenditures on
financial aid for undergraduates from National Association of State Student
Grant and Aid Programs (NASSGAP) Excel files or copied and pasted the
data, or manually entered the data from NASSGAP pdf files into a Stata file.
Because it has implications for how our variables are stored, it is important
that we are aware of whether or not the state financial aid data in our dataset
are also measured in millions. If they are measured in millions, then those
variables are not stored as integers. We can verify this by typing the describe
command.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata
files“
use ”Example 5.0.dta“
. describe
Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 5\Stata files\Example 5.0.dta
obs: 250
vars: 6 18 Jul 2020 15:57
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
id float %10.0gc id
year int %ty
State str20 %20s State
Ugrad long %10.0g Undergraduate enrollment
need float %9.0g State spending on need-
56 5 Getting to Know Thy Data
3 If we are using Stata/IC, then the maximum number of variables is 798. If we are using
Stata/MP, then the maximum number of variables is 65,532. In this example, we are using
Stata/SE which has as a maximum 10,998 variables.
58 5 Getting to Know Thy Data
Memory usage
used allocated
------------------------------------------------------------
data 1,026,681,549 1,241,513,984
strLs 0 0
------------------------------------------------------------
data & strLs 1,026,681,549 1,241,513,984
------------------------------------------------------------
5 The egen command, which is short for extensions to generate, can be employed to create
variables that also require an additional function. For a detailed explanation of the egen
command, see the pages 203–223 of the Stata User’s Guide Release 16.
60 5 Getting to Know Thy Data
Result # of obs.
-----------------------------------------
5.3 Getting to Know Our Data 61
not matched 18
from master 18
from using 0
matched 450
-----------------------------------------
This creates two additional variables, state_abbrev and state_fips. The
first is the two character state abbreviation and the second is the state FIPS
code. To create a variable, stateid, based on state names, we use egen.
egen stateid = group(State)
We use the compress command to save computer memory.
. compress
variable stateid was float now byte
variable State was str20 now str14
(4,212 bytes saved)
After compressing the data, we use stateid and FY to declare the dataset
to be a panel.
We use the following syntax, xtset stateid FY, yearly.
. xtset stateid FY, yearly
panel variable: stateid (strongly balanced)
time variable: FY, 2010 to 2018
delta: 1 year
We see the dataset is strongly balanced with no gaps in the time periods.
The data are saved to a file with a new name.
save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata fi-
les\Example 5.2.dta”.
Much of the data, including secondary data that we use for higher education
policy analysis, may be missing. We need to have a sense of not only how
much of the data is missing but the pattern of “missingness”. Using selected
variables from the public-use version of the High School Longitudinal Study
of 2009 (HSLS:09) that we saved in a Stata file (i.e., Example 5.3), we
demonstrate how to identify missing data. In the next section (5.4), we show
how to analyze missing data.
Like many other NCES longitudinal datasets, the HSLS:09 contains many
variables that are labeled with codes that indicate missing data. In some
instances, missing data are coded as −9. A good way to determine if and
62 5 Getting to Know Thy Data
how missing data are coded in datasets from secondary data sources is to
use the codebook command in Stata. In this particular example, we focus
on one variable, S3CLGPELL, which indicates whether in November 2013
a high school student was offered a scholarship or grant for the 2013–2014
school year.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata
files“
use ”Example 5.3.dta“
Codebook S3CLGPELL
From the last command, we see the following output:
. codebook S3CLGPELL
---------------------------------------------------------------
S3CLGPELL S3 D07C Offered scholarship/grant to attend
Nov 1 2013 school for 2013-2014 year
---------------------------------------------------------------
_all as a part of the syntax when using mvdecode. The result of the latter
is:
. mvdecode _all, mv(-9=.)
STU_ID: string variable ignored
X1SEX: 6 missing values generated
X1RACE: 1006 missing values generated
X4ATPRLVLA: 136 missing values generated
We then save this file to a new version of itself (Example 5.4) and are
ready to do some missing data analysis, which is shown below.
can also see that missing values of the variable X1SEX and X4ATPRLVLA
are less than 1%, while we have complete data for the variables X1SES and
X1SESQ5. Notice that none of our student identification numbers (STU_ID),
which is a string variable, are missing.
We can also use the Stata command misstable tree, with various options,
to show the pattern of “missingness” in the data. The output for this
command is shown below:
. misstable tree
Nested pattern of missing values
X1RACE S3CLGPELL X4ATPRLVLA X1SEX
-------------------------------------------
4% <1% 0% 0%
0
<1 0
<1
4 <1 0
<1
4 <1
4
96 2 <1 0
<1
2 0
2
94 <1 0
<1
93 <1
93
-------------------------------------------
(percent missing listed first)
We can also use Stata command misstable patterns to produce the
following output:
. misstable patterns
Missing-value patterns
(1 means complete)
| Pattern
Percent | 1 2 3 4
------------+-------------
93% | 1 1 1 1
|
4 | 1 1 1 0
2 | 1 1 0 1
<1 | 1 0 1 1
<1 | 1 1 0 0
<1 | 0 1 1 0
5.4 Missing Data Analysis 65
<1 | 1 0 0 1
<1 | 1 0 1 0
<1 | 0 1 1 1
------------+-------------
100% |
Variables are (1) X1SEX (2) X4ATPRLVLA (3) S3CLGPELL (4) X1RACE
If we are using panel data, we can also conduct missing data analysis
employing the user-created Stata program xtmis (Nguyen 2008). The
program must be installed by typing: ssc install xtmis. For xtmis to
work, another Stata program, tomata, must also be installed by typing:
ssc install tomata. The xtmis program will produce a report of the
number and percent of missing and non-missing values for each variable
in groups (e.g., states) indicated. Suppose we downloaded IPEDS data on
the amount of grants and scholarships awarded to low-income students (i.e.,
from families with annual incomes of $30,000 and lower) by private higher
education four-year institutions for the years 2010 to 2018 (Example 5.5).
The file has been declared a panel dataset based on the variable “unitid”
(the IPEDS code) and year. We need to determine the extent to which these
institutions did not provide data on the amount of grants and scholarships
66 5 Getting to Know Thy Data
Variable: grantlow
Group by | Obs Missing Feq.Missings NonMiss Feq.NonMiss
-------------------+---------------------------------------------------------
456348 | 9045 2234 24.698729 6811 75.301271
367909 | 112 37 33.035714 75 66.964286
445072 | 252 95 37.698413 157 62.301587
438601 | 48 17 35.416667 31 64.583333
220941 | 30 14 46.666667 16 53.333333
177162 | 84 47 55.952381 37 44.047619
164571 | 15 12 80 3 20
109013 | 6 6 100 0 0
181011 | 4 4 100 0 0
-------------------+---------------------------------------------------------
| 9596 2466 25.698208 7130 74.301792
We can see that about 26% of all observations have missing values for
the variable of interest. It appears that one institution in particular has a
substantial amount of missing data on the amount of grants and scholarships
awarded to low-income students. This may warrant dropping that institution
from any further analysis of the data. In addition to the procedures in Stata
discussed above to determine if and how data are missing, there is a whole
suite of utilities, embedded in the Stata user-created missings command.6
We can use the missing command to examine missing data by a categorical
variable, such as income group (e.g., quintiles). Using data extracted from
the public-use version of the HSLS:09, we can show the patterns of missing
data by student income level. First, install the most recent version of missing
(net install dm0085_1.pkg., replace). Then examine missing data by SES
quartiles.
. bysort X1SESQ5 : missings table
-----------------------------------------------------------------------------
-> X1SESQ5 = Unit non
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,023 49.68 49.68
1 | 1,005 48.81 98.49
2 | 31 1.51 100.00
------------+-----------------------------------
Total | 2,059 100.00
-----------------------------------------------------------------------------
-> X1SESQ5 = First qu
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,370 98.14 98.14
1 | 64 1.86 100.00
------------+-----------------------------------
Total | 3,434 100.00
-----------------------------------------------------------------------------
-> X1SESQ5 = Second q
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,632 98.03 98.03
1 | 73 1.97 100.00
------------+-----------------------------------
Total | 3,705 100.00
-----------------------------------------------------------------------------
-> X1SESQ5 = Third qu
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
68 5 Getting to Know Thy Data
-----------------------------------------------------------------------------
-> X1SESQ5 = Fourth q
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 4,431 97.32 97.32
1 | 120 2.64 99.96
2 | 2 0.04 100.00
------------+-----------------------------------
Total | 4,553 100.00
-----------------------------------------------------------------------------
-> X1SESQ5 = Fifth qu
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 5,344 96.83 96.83
1 | 174 3.15 99.98
2 | 1 0.02 100.00
------------+-----------------------------------
Total | 5,519 100.00
-----------------------------------------------------------------------------
-> X1RACE = Amer. In
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 159 96.36 96.36
1 | 6 3.64 100.00
5.4 Missing Data Analysis 69
------------+-----------------------------------
Total | 165 100.00
-----------------------------------------------------------------------------
-> X1RACE = Asian, n
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,894 97.03 97.03
1 | 57 2.92 99.95
2 | 1 0.05 100.00
------------+-----------------------------------
Total | 1,952 100.00
-----------------------------------------------------------------------------
-> X1RACE = Black/Af
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,383 97.27 97.27
1 | 66 2.69 99.96
2 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,450 100.00
-----------------------------------------------------------------------------
-> X1RACE = Hispanic
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 407 96.45 96.45
1 | 15 3.55 100.00
------------+-----------------------------------
Total | 422 100.00
-----------------------------------------------------------------------------
-> X1RACE = Hispanic
70 5 Getting to Know Thy Data
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,302 97.84 97.84
1 | 73 2.16 100.00
------------+-----------------------------------
Total | 3,375 100.00
-----------------------------------------------------------------------------
-> X1RACE = More tha
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,904 98.09 98.09
1 | 37 1.91 100.00
------------+-----------------------------------
Total | 1,941 100.00
-----------------------------------------------------------------------------
-> X1RACE = Native H
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 105 95.45 95.45
1 | 5 4.55 100.00
------------+-----------------------------------
Total | 110 100.00
-----------------------------------------------------------------------------
-> X1RACE = White, n
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 11,776 97.47 97.47
5.4 Missing Data Analysis 71
-----------------------------------------------------------------------------
-> X1RACE = .
# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
1 | 975 96.92 96.92
2 | 31 3.08 100.00
------------+-----------------------------------
Total | 1,006 100.00
From the output above, it appears that missing data are more prevalent
among non-whites.
Because the p-value is less than 0.05 in the above output, missing data in the
two variables (S3CLGPELL and P1TUITION) are not MCAR.
We can conduct the test with unequal variances.
. mcartest S3CLGPELL P1TUITION, unequal
note: 32 observations omitted from EM estimation because of all
imputation
variables
missing
Expectation-maximization
estimation Number obs = 22465
Number missing = 1772
Number patterns = 3
Prior: uniform Obs per pattern: min = 404
avg = 7488.333
max = 20693
------------------------------------
| S3CLGPELL P1TUITION
-------------+----------------------
Coef |
1b.X1RACE | 0 0
2.X1RACE | 2.566391 .5038324
3.X1RACE | 1.102913 .590291
4.X1RACE | -.3349718 -1.390813
5.X1RACE | .9184874 .2190636
6.X1RACE | 1.246321 .9427008
7.X1RACE | .4102142 .2168802
8.X1RACE | 1.798048 1.152957
_cons | -3.909168 -6.200928
-------------+----------------------
Sigma |
S3CLGPELL | 20.58743 4.445989
P1TUITION | 4.445989 11.75155
------------------------------------
There are at least two implications for data that are MCAR. First, it is
probably not a good idea to delete missing data that are not MCAR and the
variances do not matter. Second, statistical methods that assume no missing
data are valid when missing data are MCAR. In the next few chapters, some
of those statistical methods will be discussed.
5.5 Summary
5.6 Appendix
*Chapter 5 Syntax
*use time series data from Chap. 4.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”Percent of US high school graduates in PSE, 1960 to 2016.dta“
*reduce the amount of memory required by float by compressing the data using ///
compress
*
recast int id
describe
*save
save ”Example 5.0.dta“, replace
*clear all
*using a large amount of data from secondary data sources such as the ///
National Center for Education Statistics’ NCES ///
public-use High School Longitudinal Study of 2009 (HSLS:09) student dataset
*close dataset
clear all
*We use the list command to take a quick look at the data, particularly ///
with respect to FY 2010. We will also make the command ///
conditional by using
list if FY==2010
*Because we want only states in our dataset, we drop all observations for ///
the U.S. total and Washington DC.
drop if State==”US“
drop if State==”Washington DC“
*we employ the user-created statastates (Schpero 2018) program to create ///
fips codes and other state identifiers; include nogenerate option to ///
prevent the generation the variable _merge
statastates, name(State) nogenerate
*Using selected variables from the public-use version of the HSLS:09 ///
at we saved in a Stata file (i.e., Example 5.3)
use ”Example 5.3.dta“
* determine if and how missing data are coded of the variable S3CLGPELL
Codebook S3CLGPELL
*produce a table with the number of missing values, total number of cases, ///
and percent missing for each variable in our file.
mdesc
*use the Stata command misstable tree, with various options, to show the ///
pattern of “missingness” in the data
misstable tree
* use the Stata command ”tostring“ to create a string variable (unitid_s), ///
based on the numeric IPEDS variable (unitid). Then we invoke xtmis.
tostring unitid, generate(unitid_s)
xtmis grantlow , id(unitid_s)
*set maximum variables to 10,000 and Open a large dataset – public ///
use version of the HSLS09
set maxvar 10000
use ”HSLS09.dta“
*exit Stata
exit
*end
References 77
References
Cox, N. J. (2015). Speaking Stata: A set of utilities for managing missing values. The Stata
Journal ,15 (4), 1174–1185.
Li, C. (2013). Little’s Test of Missing Completely at Random. The Stata Journal ,13 (4),
795–809. https://doi.org/10.1177/1536867X1301300407
Little, R. J. (1988). A test of missing completely at random for multivariate data with
missing values. Journal of the American Statistical Association,83 (404), 1198–1202.
Medeiros, R. A., & Blanchette, D. (2011). MDESC: Stata module to tabulate prevalence
of missing values. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s457318.html
Nguyen, M. C. (2008). XTMIS: Stata module to report missing observations for each
variable in xt data. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s456945.html
Schpero, W. L. (2018). STATASTATES: Stata module to add US state identifiers to
dataset. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s458205.html
StataCorp. (2019). Stata User’s Guide Release 16. Stata Press.
Chapter 6
Using Descriptive Statistics
and Graphs
6.1 Introduction
We commonly use the arithmetic mean or average and the median to provide
basic information to policymakers and other data users. The average is
reflected in the formula below as:
1
n
A= · xi (6.1)
n i=1
where A is the average, n is the number of terms (e.g., items, cases, etc.,
being averaged), and x1 is the value of each individual term in the list of
terms being averaged. Using cross-sectional data introduced in Chap. 4 and
the Stata command, ameans, we can easily demonstrate how to compute
the arithmetic means for both public and private high school graduates in
2012 who enrolled in post-secondary education (PSE) institutions by state
and the District of Columbia (DC).
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”US high school graduates in 2012 enrolled in PSE, by state.dta“
In the output above, we can see the geometric and harmonic means in
addition to the arithmetic means. (The output also includes the number
of observations and the 95% confidence intervals, which we will discuss
6.2 Descriptive Statistics 81
later.) While interesting, the geometric and harmonic means are almost never
provided to policymakers and other data users.1 So if we wanted to generate
only the arithmetic mean, we could use the Stata command, mean, which
would result in the following output.
. mean public private
Mean estimation Number of obs = 51
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
public | 61748.73 10371.81 40916.34 82581.11
private | 6054.314 950.7169 4144.743 7963.885
--------------------------------------------------------------
In addition to the mean, we also see the standard errors (Std. Err.) and the
95% confidence intervals (95% Conf. Interval), both of which we will ignore
for now. From this output, we see the average (mean) number of public high
school graduates who enrolled in PSE institutions across all 50 states and DC
during 2012 was 61,749. The average number of private high school graduates
who enrolled in PSE institutions was 6054.
If we are interested in the other measures of central tendency, such as the
median, we can use the following Stata command, summarize, detail or
(sum, detail).
. sum, detail
Stateid
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 3 2
10% 6 3 Obs 51
25% 13 4 Sum of Wgt. 51
50% 26 Mean 26
Largest Std. Dev. 14.86607
75% 39 48
90% 46 49 Variance 221
95% 49 50 Skewness 0
99% 51 51 Kurtosis 1.799077
State abbreviation
-------------------------------------------------------------
1 The geometric mean multiplies rather than sums values, then takes the nth root rather
than dividing by n. The harmonic mean is the reciprocal of the arithmetic mean of the
reciprocals of the numbers in a dataset.
82 6 Using Descriptive Statistics and Graphs
no observations
FIPS code
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 4 2
10% 8 4 Obs 51
25% 16 5 Sum of Wgt. 51
State name
-------------------------------------------------------------
no observations
-------------------------------------------------------------
Percentiles Smallest
1% 450 450
5% 2413 2040
10% 4443 2413 Obs 51
25% 6179 2426 Sum of Wgt. 51
The output above shows that FTE students are more dispersed than net
tuition revenue within the U.S. between FY 2010 and FY 2018. The tabstat
command can include options to include other statistics such as the mean,
median (50th percentile), standard deviation, minimum, and maximum by
unit (e.g., state). The options can also include, specifying the width of the
variable labels, a long format, displaying the statistics in columns rather than
rows, and with no column total. For example, the syntax would be as follows
(all on one line):
tabstat Netuition FTEStudents, stat(mean median sd min max
cv)labelwidth(30) long format by(state) col(stat) nototal
In Fig 6.1, (the remainder of the output after Idaho is omitted), we can
compare the descriptive statistics for net tuition revenue and FTE students
86 6 Using Descriptive Statistics and Graphs
Fig. 6.1 Net tuition revenue and FTE students by state, descriptive statistics
across states. Using tabstat with options, we can show the same set of
statistics by fiscal year.
In Fig. 6.2, we see that the coefficient of variation (CV) of net tuition
revenue has declined slightly between FY 2000 and 2018. We also see that
the CV of net tuition revenue across states has been consistently less than
that of FTE students over the same time period.
6.2.3 Distributions
Fig. 6.2 Net tuition revenue and FTE students by year, descriptive statistics
cross tabulations (crosstabs). Using data from the High School Longitudinal
Study of 2009 (HSLS:09), we demonstrate how to show the frequencies of
various racial/ethnic categories of the variable X1RACE.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files“
use ”Example 6.1.dta“, clear
prop X1RACE
The output for the last line of syntax is shown in Fig. 6.7.
An alternative to the above procedure is to use the table command with
variable X1RACE, reflecting the original race/ethnicity categories to depict
mean hourly earnings by race/ethnicity by sex.
. table X1RACE X1SEX, contents(mean EarnHr)
This is shown in the output below in Fig. 6.8.
The table command provides more options with regard to the statistics
(e.g., median, percentiles, etc.) that can be shown. Formatting options can
also be included, such as stub (first column) width and other features.
Fig. 6.7 HSLS:09 race/ethnicity categories by sex, mean earnings per hour
Fig. 6.8 HSLS:09 race/ethnicity categories by sex, mean earnings per hour
6.2 Descriptive Statistics 91
From the output above, we see that our data are balanced with each of
the 50 states having 27 years of data. We invoke the xtttab command.
. xttab region_compact
6.3 Graphs
6.3.1 Graphs—EDA
When conducting exploratory data analysis (EDA), graphs are useful tools
to quickly and initially determine whether certain assumptions of various
statistical techniques, such as regression, are valid. To ascertain if data for
a particular continuous variable has a normal distribution, one can create a
histogram with a superimposed normal curve. We illustrate this by using the
dataset above. First, we create a new variable, stapr_fte (state appropriations
per FTE student). Then we create a histogram, with a superimposed normal
curve, of the stapr_fte data.
gen stapr_fte = stapr/fte
histogram stapr_fte, normal
6.3 Graphs 93
Figure 6.9 shows that stapr_fte data are not normally distributed and
skewed to the right. This indicates that before any additional analysis (e.g.,
regression) is conducted, a transformation (e.g., logarithmic) of the data may
be required. (More on this will be discussed in the next chapter.)
We can also create a box chart of the same data to examine the distribution
of state appropriations data per FTE student by using the following syntax:
graph box stapr_fte
If the data are normally distributed, the line (the median) would be in
the middle of the box (the 25th and 75th percentiles). We can see that in
Fig. 6.10, however, the median is closer to the lower end of the box. The
graph also shows outliers at the upper end of the box, indicating a positive
skew.
A histogram can also be created to provide a quick depiction of the
frequency of categories. For example, if we wanted to see the distribution
of states by regional compact, we can easily do so using the following syntax
with the added options to include labels and percent (all on one line):
histogram region_compact, discrete addlabels ylabel(,grid) xlabel(0 1 2 3
4, valuelabel) percent
From Fig. 6.11, we can easily see that the largest proportion (32%) of
the states are in the SREB compact. From an analytical perspective, this
information is useful if we need to know to what extent we may need
to collapse the data into a smaller number of categories, due to skewed
distributions across categories, prior to additional analysis.
Given the distribution of states across regional compacts, we may also want
to see if state appropriations are distributed normally by regional compact.
This can be easily done by invoking the following syntax:
6.3 Graphs 95
Fig. 6.13 Box chart of state appropriations per FTE student by regional compact
and has a few outliers at the upper end (75th percentile) of the box. So Fig.
6.13 provides additional information regarding the characteristics of the data.
As part of EDA, scatter plots can also be used by analysts to show the
simple relationship between two continuous variables at a given point in time.
For example, we can show how net tuition revenue per FTE student is related
to state appropriations per FTE student in fiscal year 2016. Here is the syntax
and results:
graph twoway scatter stapr_fte netuit_fte if year==2016
Figure 6.14 shows there is a negative relationship between state appropri-
ations per FTE student and net tuition revenue per FTE student. We can fit
a regression line (more on this in the next chapter) through the data points
in Fig. 6.14 by slightly changing the previous syntax and typing the following
(all on one line):
twoway (scatter stapr_fte netuit_fte) (lfit stapr_fte netuit_fte) if
year==2016
or
twoway scatter stapr_fte netuit_fte, mlabel(state) || lfit stapr_fte netuit_fte
|| if year==2016
6.3 Graphs 97
Fig. 6.14 Scatter plot of state appropriations and net tuition revenue per FTE student
Fig. 6.15 Scatter plot of state appropriations and net tuition revenue per FTE student a
fitted line
While we can see how far they are from the regression line, we do not know
which states are outliers (see Fig. 6.15). We can, however, do so by simply
adding the option mlabel(state) to the following syntax (all on one line).
98 6 Using Descriptive Statistics and Graphs
Figure 6.16 shows that Alaska (AK), Arizona (AZ), Wyoming (WY),
Hawaii (HI), and Connecticut (CT) are among the outliers and can be
interpreted as having an influence on the regression line. This suggests those
states should be excluded from any subsequent analysis of the relationship
between state appropriations and net tuition revenue per FTE student in
2016.
Finally, scatter plots can be used to determine if the relationship between
two variables changes over time by employing a Stata user-written program,
aaplot (Cox 2015). After installing the aaplot (ssc install aaplot), we create
scatter plots for two different time periods (1990 and 2016) to see if the
relationship remains the same over time.
aaplot netuit_fte stapr_fte if year==1990
aaplot netuit_fte stapr_fte if year==2016
Figure 6.17 is for the scatter plot with data in 1990. We see not
only the scatter plot, but we also see some statistics. First, we see a
negative relationship between net tuition revenue per FTE student and state
appropriations per FTE student. Second, the R2 , which measures the fit
between the data points and regression line, is 9.3%. This means 9.3% of the
variance in net tuition revenue per FTE student is explained by the variance
in state appropriations per FTE student in 1990. We see not only the scatter
plot, but also some statistics.
Fig. 6.16 Scatter plot of state appropriations and net tuition revenue per FTE student a
fitted line
6.3 Graphs 99
Fig. 6.17 State appropriations and net tuition revenue per FTE student and regression
line, FY1990
Fig. 6.18 State appropriations and net tuition revenue per FTE student and regression
line, FY 2016
Like the previous figure, Fig. 6.18 indicates there is a negative relationship
between net tuition revenue per FTE student and state appropriations per
FTE student. Compared to 1990, there is a slightly closer fit (R2 = 13%)
100 6 Using Descriptive Statistics and Graphs
between the data points and the regression line. Together, the two graphs
(Figs. 6.17 and 6.18) suggest the relationship between net tuition revenue
per FTE student and state appropriations per FTE student did not change
over time.
6.4 Conclusion
The measures of central tendency and graphs that are discussed above are
examples of descriptive statistics and EDA that can be used to provide
basic information to data users. These basic methods can and should also be
employed to better understand the nature of data used with intermediate and
advanced methods such as multiple regression as well as other techniques that
are used in higher education policy analysis and evaluation. In the following
chapters, we will turn our attention to those methods and techniques.
6.5 Appendix
*Chapter 6 Syntax
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”US high school graduates in 2012 enrolled in PSE, by state.dta“
*compute the arithmetic means
ameans public private
mean public private
*Measures of dispersion
*Distributions
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files“
use ”Example 6.1.dta“, clear
*with labels
table X1RACE X1SEX, contents(mean EarnHr)
clear
*Graphs - EDA
*histogram, with a superimposed normal curve
*create a new variable
gen stapr_fte = stapr/fte
*scatter plots to show the simple relationship between two continuous variables
graph twoway scatter stapr_fte netuit_fte if year==2016
*scatter plots to show the simple relationship between two continuous ///
variables with fitted regression line
twoway (scatter stapr_fte netuit_fte) (lfit stapr_fte netuit_fte) if year==2016
*run aaplot for two different time periods (1990 & 2016)
aaplot netuit_fte stapr_fte if year==1990
aaplot netuit_fte stapr_fte if year==2016
*close dataset
Clear
*end
Reference
Cox, N. J. (2015). AAPLOT: Stata module for scatter plot with linear and/or quadratic
fit, automatically annotated. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s457286.html
Chapter 7
Introduction to Intermediate
Statistical Techniques
7.1 Introduction
where Yi is the actual and Ŷ i is the expected outcome for unit (e.g., state)
i.
For an OLS regression model with one independent variable, or what is known
as a bivariate OLS regression model, the estimated beta coefficient in Figs.
6.22 and 6.23 or slopes are calculated as:
n
Xi − X Yi − Y
i=1
β1 = 2 (7.3)
Xi − X
β0 = Y − β̂1 X (7.4)
106 7 Introduction to Intermediate Statistical Techniques
------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -.354383 .132125 -2.68 0.010 -.6200382 -.0887278
_cons | 10192.75 1107.684 9.20 0.000 7965.599 12419.9
------------------------------------------------------------------------------
We can see that R2 , 0.1303, is the same value as what was shown
in Fig. 6.18. But the regression output provides an analysis-of-variance
(ANOVA) table with the model and residual (errors) sum of squares (SS),
the degrees of freedom (df), and mean square (MS).2 Information from the
ANOVA table can be used to calculate the R2 , which is the regression
model sum of squares (RSS) divided by the total sum of squares (TSS) or
R2 = RSS/TSS, where
n
2
RSS = Ŷi − Y
i=1
1 For OLS regression formulas with more than one independent variable, see introductory
n
2
TSS = Yi − Y
i=1
With respect to the overall regression model, the output includes the F -
statistic and its statistical significance, adjusted R2 , root mean square error
(MSE), where
k n−1
adjusted R = 2
R −
2
n−1 n−k−1
The smaller sβ , the larger tn−2 , and more likely the probability that the
null hypothesis will be rejected and the claim the parameter (β) estimate
is statistically significant. If we can reject the null hypothesis with more
108 7 Introduction to Intermediate Statistical Techniques
than 95% certainty (95% of the values of β lie within mean ±1.96 * standard
deviation) then we can say β is not equal to zero or not the result of statistical
chance. This is the same as saying there is less than a 5% probability (p value
<0.05) the estimated β coefficient is equal to zero. In education and most of
the social sciences, p < 0.05, is acceptable to claim statistical significance. If
the p value is greater than 0.05, then it is not acceptable and we cannot reject
the null hypothesis. Therefore, we would need to reject the null hypothesis
(H0 ) that the beta coefficient β 1 is not equal to zero or accept the alternative
hypothesis (Ha ) or β 1 = 0 to make the claim there is statistical significance
with respect to the state appropriations per FTE student variable. This is
represented as:
H0 : β 1 = 0
0
Ha : β 1 =
So the standard errors of the βs are VERY important!
The adjusted R2 is lower than the unadjusted R2 , taking into account the
one independent variable X. This suggests that 11% of the variability of net
tuition revenue per FTE student is explained by the regression model.
The estimated beta coefficient for state appropriations per FTE student is
−.354383, equal to what is shown in Fig. 6.18. This suggests that, on average,
a one dollar increase in state appropriations per FTE student will result in
a decrease of 35 cents in net tuition revenue per FTE student. The t-test for
stapr_fte equals −2.68, which means it is statistically significant in terms of
its difference from zero at the 0.01 level of significance (p < 0.05).3 Therefore,
based on this example, we can say net tuition revenue per FTE student is
negatively related to state appropriations per FTE student across 50 states
in 2016.
In the real world, however, of higher education policy analysis and research,
OLS regression models with only one independent variable should never be
used to address a question about the importance of a policy-oriented variable.
At the very least, the regression should include control variables. These are
not policy-oriented variables that can be manipulated by higher education
policymakers. So, we turn our attention to an OLS regression model with two
or more independent variables, otherwise known as multiple or multivariate
OLS regression. Expanding on Eq. (7.1), a multivariate OLS regression model
is represented mathematically as the following:
3 Thet statistic is equal to the estimated beta coefficient divided by the standard error. So
the smaller the standard error, the larger the absolute value of the t statistic.
7.2 Review of OLS Regression 109
Ŷi = β0 + β1 X1 + β2 X2 + . . . βn Xn + εi (7.5)
-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -1.608785 .5061895 -3.18 0.003 -2.627692 -.5898788
stapr_fte2 | .0000543 .0000232 2.33 0.024 7.45e-06 .0001011
pc_income | .1322943 .0533078 2.48 0.017 .0249912 .2395974
_cons | 9744.101 3472.105 2.81 0.007 2755.115 16733.09
-------------------------------------------------------------------------------
We see the adjusted R2 is now 0.278, suggesting the model explains 28%
of the variability in net tuition revenue per FTE student across states in
2016. More importantly, the size of the estimated beta coefficient for state
appropriations per FTE student is now −1.61. Because it only relies on cross-
sectional data in 2016, this multiple regression model may have actually
produced biased estimates of the beta coefficients. With data from only
50 cases (i.e., states), a multiple regression model limits the number of
independent variables that may be included in the model.
For example, suppose we include seven independent variables in our model
(i.e., the variable reflecting states grouped by regional compacts, the number
of independent units of analysis). This means that the degrees of freedom
(which is the number of estimated beta coefficients minus one) will be
110 7 Introduction to Intermediate Statistical Techniques
reduced. Multiple regression models with very low degrees of freedom may
result in inefficient estimates of the beta coefficients.
If they are available, then data should be used that allows us to overcome
possible problems of low degrees of freedom and consequently inefficient
estimates of beta coefficients. The availability of panel data (discussed in
Chap. 4) would enable us to run pooled OLS (POLS) regression models. The
following example illustrates this point where we regress net tuition revenue
per FTE student on the same variables shown in the previous example.
However, now we use panel data (50 states across 27 years).
reg netuit_fte stapr_fte stapr_fte2 pc_income
The output is:
. reg netuit_fte stapr_fte stapr_fte2 pc_income
-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -1.018307 .0773341 -13.17 0.000 -1.170015 -.8665983
stapr_fte2 | .0000329 4.33e-06 7.60 0.000 .0000244 .0000413
pc_income | .2036221 .0048243 42.21 0.000 .1941581 .2130862
_cons | 2403.068 320.5399 7.50 0.000 1774.256 3031.88
-------------------------------------------------------------------------------
From the results of the POLS, we see the number of observations at 1350
(50 × 27) is substantially large, compared to those in previous output. The
adjusted R2 at 57.6% is also greater while the root MSE is smaller, indicating
a better model fit. But more relevant to a higher education policy analyst
would be the estimated beta coefficients, specifically for stapr_fte, which
is now −1.018. This indicates while there is still a negative relationship
between net tuition revenue per FTE student and state appropriations per
FTE student, and the value of the beta coefficient is lower than when using
the 2016 cross-sectional data.
The larger number of observations will also enable us to include more
independent variables in our POLS regression model without being too
concerned about low degrees of freedom. For example, we can now include
the categorical variable representing region compacts (region_compact) in
7.2 Review of OLS Regression 111
-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -1.04053 .0750848 -13.86 0.000 -1.187826 -.8932333
stapr_fte2 | .0000383 4.22e-06 9.09 0.000 .00003 .0000466
pc_income | .1917324 .0047585 40.29 0.000 .1823976 .2010672
|
region_compact |
SREB | 185.804 194.7014 0.95 0.340 -196.1481 567.7562
WICHE | -957.9857 199.7539 -4.80 0.000 -1349.85 -566.1219
MHEC | 99.67403 197.3705 0.51 0.614 -287.5143 486.8623
NEBHE | 1100.607 215.7601 5.10 0.000 677.3429 1523.87
|
_cons | 2712.485 366.2787 7.41 0.000 1993.944 3431.027
-------------------------------------------------------------------------------
We see that controlling for regional compact does not substantially change
the estimated beta coefficient for state appropriations per FTE student or for
any of the variables. It is worth noting that compared to states that are not
members of regional compacts, WICHE states have lower net tuition revenue
per FTE student and NEBHE states have higher net tuition revenue per FTE
student.
Because we are using pooled data, we can also include more variables,
including interaction terms. Interaction terms are combinations of existing
variables. The combination may include the following:
1. two or more categorical variables
2. two or more continuous variables
3. one or more categorical variables with one or more continuous variables
112 7 Introduction to Intermediate Statistical Techniques
------------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------------+--------------------------------------------------------------
stapr_fte | .0275013 .0289854 0.95 0.343 -.0293604 .084363
|
region_compact |
None | 0 (base)
SREB | -4376.048 663.3507 -6.60 0.000 -5677.367 -3074.728
WICHE | -4350.232 610.8338 -7.12 0.000 -5548.527 -3151.937
MHEC | -3356.754 660.639 -5.08 0.000 -4652.754 -2060.754
NEBHE | -917.9704 637.625 -1.44 0.150 -2168.823 332.8823
|
ugradmerit |
No | 0 (base)
Yes | -2149.968 648.7122 -3.31 0.001 -3422.571 -877.3649
|
region_compact#ugradmerit |
None#No | 0 (base)
None#Yes | 0 (base)
SREB#No | 0 (base)
SREB#Yes | 3477.178 732.8958 4.74 0.000 2039.429 4914.927
WICHE#No | 0 (base)
WICHE#Yes | 2837.446 696.7433 4.07 0.000 1470.619 4204.274
MHEC#No | 0 (base)
MHEC#Yes | 3084.481 735.2593 4.20 0.000 1642.096 4526.867
NEBHE#No | 0 (base)
NEBHE#Yes | 3028.864 743.8004 4.07 0.000 1569.723 4488.005
|
_cons | 6658.134 600.8424 11.08 0.000 5479.439 7836.829
------------------------------------------------------------------------------------------
Because the difference between the R2 of the main effects model and the
2
R of the interaction model is 0.0001, we can reject the null hypothesis and
conclude the model with the interaction terms does not help to explain more
variance in net tuition revenue per FTE enrollment. Using the testparm
command, the statistical significance of the interaction terms can also be
checked.
. testparm i.region_compact#i.ugradmerit
( 1) 1.region_compact#1.ugradmerit = 0
( 2) 2.region_compact#1.ugradmerit = 0
( 3) 3.region_compact#1.ugradmerit = 0
( 4) 4.region_compact#1.ugradmerit = 0
F( 4, 1339) = 5.84
Prob > F = 0.0001
The test results above indicate that the interaction terms as a whole are
statistically significant.
What if we wanted to investigate if the difference in net tuition revenue
per FTE enrollment by tuition-setting authority (i.tuitset) changes with the
amount of state appropriations per FTE enrollment? This is an example of
number 2 above, where the interaction term is composed of one continuous
variable and one categorical variable. The following syntax includes “c.”,
which indicates state appropriations per FTE enrollment (c.stapr_fte) is a
continuous variable.
. reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset
-------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
ugradmerit |
No | 0 (base)
Yes | 561.8776 152.1065 3.69 0.000 263.4843 860.271
|
114 7 Introduction to Intermediate Statistical Techniques
region_compact|
None | 0 (base)
SREB | -1500.987 276.8269 -5.42 0.000 -2044.049 -957.9247
WICHE | -2130.328 283.3574 -7.52 0.000 -2686.201 -1574.454
MHEC | -1020.981 280.5026 -3.64 0.000 -1571.254 -470.7078
NEBHE | 1018.53 320.9729 3.17 0.002 388.8648 1648.195
|
stapr_fte | .2060764 .217321 0.95 0.343 -.2202508 .6324036
|
tuitset |
Legislature | 0 (base)
State-Wide Board | 10193.37 1678.683 6.07 0.000 6900.234 13486.51
System Board | 1900.957 1390.018 1.37 0.172 -825.8971 4627.81
Campus | 3296.063 1411.991 2.33 0.020 526.1053 6066.022
|
tuitset#c.stapr_fte |
State-Wide Board | -1.310195 .273725 -4.79 0.000 -1.847173 -.7732179
System Board | -.1069439 .2198251 -0.49 0.627 -.5381836 .3242958
Campus | -.2043566 .224125 -0.91 0.362 -.6440316 .2353184
|
_cons | 1895.032 1400.074 1.35 0.176 -851.5488 4641.613
-------------------------------------------------------------------------------------
. testparm c.stapr_fte#i.tuitset
( 1) 2.tuitset#c.stapr_fte = 0
( 2) 3.tuitset#c.stapr_fte = 0
( 3) 4.tuitset#c.stapr_fte = 0
F( 3, 1337) = 17.31
Prob > F = 0.0000
The results of the regression model show that the difference in net
tuition revenue per FTE enrollment by tuition-setting authority (specifically
for state-wide boards compared to the reference category, the legislature),
declines with increases in state appropriations per FTE enrollment. The
results of the post-estimation test indicate that the interaction terms are
statistically significant.
What if we wanted to find out how the relationship between net tuition
revenue per FTE enrollment and state appropriation changes as the amount
of state total need-based financial aid (state_needFTE) changes. Therefore,
the regression model should include an interaction term that is composed of
two continuous variables as shown below in the output.
. reg netuit_fte i.region_compact c.stapr_fte##c.state_needFTE
---------------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------------------+----------------------------------------------------------------
region_compact |
None | 0 (base)
7.2 Review of OLS Regression 115
The results above indicate that the relationship between net tuition
revenue and state appropriations per FTE enrollment is captured in the
interaction term that reflects state appropriations and total need-based aid.
If we focus on state total need-based aid changes, this means that the
relationship between net tuition revenue per FTE enrollment and state total
need-based aid changes as state appropriations per FTE enrollment changes.
(This would be the case, even if state appropriations per FTE enrollment by
itself was not statistically significant.)
The interpretation of the results of a regression with an interaction term
that is composed of two continuous variables is facilitated with the use of the
margins and marginsplot post-estimation commands. To restrict some of
the output, we include the vsquish option.
. margins, dydx(stapr_fte) at(state_needFTE=(0(3000)10000)) vsquish
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte |
_at |
1 | .1285269 .0364127 3.53 0.000 .057095 .1999588
2 | -1.047798 .2012571 -5.21 0.000 -1.442611 -.6529854
3 | -2.224123 .4220569 -5.27 0.000 -3.052086 -1.39616
4 | -3.400448 .6435904 -5.28 0.000 -4.663001 -2.137895
------------------------------------------------------------------------------
state total need-based aid per FTE enrollment at $0, $3000, $6000, $9000.
The output above indicates that state appropriations per FTE enrollment is
statistically significant for all of those values of state total need-based aid per
FTE enrollment. We can show this relationship changes at each value in a
graph by entering the following syntax.
. qui margins, at(stapr_fte=(0 10000) state_needFTE=(0(3000)10000)) vsquish
.
. marginsplot, noci x(stapr_fte) recast(line) xlabel(0(3000)10000)
We can see from Fig. 7.1 that where there is no state need-based
financial aid per FTE enrollment, the relationship between net tuition
revenue per FTE enrollment and state appropriations per FTE enrollment
is slightly positive. But as the amount of state need-based financial aid
per FTE enrollment increases the relationship between net tuition revenue
per FTE enrollment and state appropriations per FTE enrollment becomes
increasingly negative. This suggests that as states increase their funding
directly to students, net tuition revenue to institutions decline more rapidly
in response to higher amounts of state appropriations.
But it is possible that the estimated beta coefficients in this POLS
regression are biased due to violations of one or more of the seven classical
OLS assumptions presented in Sect. 7.2.1. More specifically, we can and
should check to see if some of the assumptions have been violated by
performing post-estimation diagnostics. One such diagnostic is a residual-
Fig. 7.1 Predictive margins of net tuition revenue per FTE by state need-based aid per
FTE
7.2 Review of OLS Regression 117
versus-fitted plot that can be created immediately after running the regression
by simply typing the Stata command syntax, rvfplot. This command graphs
the following plot.
We can see from Fig. 7.2 that the residuals are more dispersed in the
middle of the graph than at the right and left. This indicates there is a
violation of the assumption that the error term (ε) has a constant variance
or of homoscedasticity. Additionally, it is quite possible that the errors are
not normally distributed. So a comprehensive post-estimation test should be
conducted to detect if, in addition to the assumption of violation of normally
distributed errors, there is also heteroscedasticity. This is done by typing the
Stata command syntax estat imtest, which produces the following output:
. estat imtest
---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 189.76 24 0.0000
Skewness | 63.95 7 0.0000
Kurtosis | 9.56 1 0.0020
---------------------+-----------------------------
Total | 263.27 32 0.0000
---------------------------------------------------
The p values indicate the assumptions of homoscedasticity and normally
distributed errors have been violated. To take into account these two
violations of assumptions, we should rerun our POLS regression model using
the robust option.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, robust
------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -1.04053 .0721036 -14.43 0.000 -1.181978 -.8990817
stapr_fte2 | .0000383 4.03e-06 9.51 0.000 .0000304 .0000462
pc_income | .1917324 .0060699 31.59 0.000 .1798248 .20364
|
region_compact |
SREB | 185.804 199.3342 0.93 0.351 -205.2365 576.8446
WICHE | -957.9857 180.9863 -5.29 0.000 -1313.033 -602.9389
118 7 Introduction to Intermediate Statistical Techniques
From this output we see the estimated beta coefficients are the same but
some of the standard errors have changed. But, it is also possible that the
variability of the dependent variable is unequal across a range of independent
variables or there is group-wise heteroscedasticity. In other words, net tuition
revenue per FTE student within each state may not be independent, leading
to residuals that are not independent across states. To detect group-wise
heteroscedasticity, another test should be conducted. This test, which is
robust to non-normality, is called the Levene test of homogeneity and is
conducted in the following steps.4
quietly: reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact
predict double eps, residual
robvar eps, by(state)
4 For a full description of the Levene test, see Levene, H. (1960). Robust tests for equality
State |
abbreviatio | Summary of Residuals
n | Mean Std. Dev. Freq.
------------+------------------------------------
AK | 1122.2639 939.42995 27
AL | 1616.6929 1956.758 27
AR | 162.56587 334.56975 27
AZ | -137.19617 477.91889 27
CA | -2498.2149 1121.2912 27
CO | -90.322544 701.35696 27
CT | -845.30401 699.56268 27
DE | 4531.8731 3107.7021 27
FL | -2373.7873 730.65775 27
GA | -810.79875 723.35512 27
HI | 1297.0811 471.53693 27
IA | 1141.2379 504.31709 27
ID | 468.24506 360.61989 27
IL | -1369.4974 1150.3627 27
IN | 1404.8174 1242.093 27
KS | -933.50045 366.96024 27
KY | 623.90684 721.21209 27
LA | -1011.2335 521.60785 27
MA | -2230.438 826.23532 27
MD | -752.82225 415.40112 27
ME | 873.48914 1048.5964 27
MI | 1873.1069 1620.3943 27
MN | -134.58677 664.61368 27
MO | -865.87429 391.30801 27
MS | 504.26361 703.23813 27
MT | 483.33723 393.45575 27
NC | -630.84519 306.35169 27
ND | 227.6154 755.3359 27
NE | -662.25153 372.39504 27
NH | -1012.948 760.14257 27
NJ | 223.57714 440.59753 27
NM | 254.49388 293.12376 27
NV | -464.44503 327.67539 27
NY | -1739.7618 502.25295 27
OH | 385.59024 496.92629 27
OK | -845.73432 563.76589 27
OR | 500.09117 664.07122 27
PA | 1516.1847 582.46832 27
RI | -78.700309 720.27019 27
SC | 878.38713 769.57448 27
SD | 172.33711 638.97413 27
TN | -197.12965 483.98213 27
TX | -1033.8177 637.22766 27
UT | 680.8684 424.88764 27
VA | -1089.88 508.53689 27
VT | 3293.9012 1334.3393 27
WA | -1094.1928 523.50132 27
WI | -1238.9945 541.81062 27
120 7 Introduction to Intermediate Statistical Techniques
WV | 428.35934 789.9017 27
WY | -522.0092 1628.6087 27
------------+------------------------------------
Total | -1.169e-13 1567.427 1,350
The output above shows that Delaware (DE), Alabama (AL), Wyoming
(WY), Michigan (MI), and Vermont (VT) have very large standard devia-
tions, which suggests they are outliers. But more relevant to the Levene test,
the p value of W0 (which is more robust to non-normality than the other
tests) indicates the equality of variances should be is rejected. This strongly
suggests there is group-wise heteroscedasticity. To address this particular
violation of the assumption of homoscedasticity, we use the cluster option,
with state as the cluster variable in our POLS regression model.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state)
Compared to the previous regression model with the robust option, this
model produces different results with respect to the statistical significance
of the categorical variables reflecting regional compacts. In this model, net
tuition revenue per FTE student is not related to state membership in
regional compacts. Using this example, we can see that not taking into
account the clustered nature of the residuals with respect to the states
results in making false claims about the statistical significance of certain
variables or Type I errors (rejection of true null hypotheses). Therefore,
7.4 Fixed-Effects Regression 121
when employing POLS regression models we should always test for group-wise
heteroscedasticity and if called for, allow for use the appropriate standard
errors that reflect the relaxation of intragroup independence.
Ŷit =β0 +β1 X1it +β2 X2it + . . . βn Xnit +α1 D2t +α2 D3t + . . . αn DN −1t +uit +εit
(7.7)
where each β is the estimated beta coefficient for each of the respective
dummy variables (D). Equation (7.7) excludes the first dummy variable (D1 ),
which is the reference group. Applying this above equation to a state-level
panel dataset, αi is a state fixed-effect as the “effect” of state i is “fixed”
across all years. In Eq. (7.7), each α represents a different state fixed-effect,
while β 1 . . . β n are the same for all states.
Using the panel data from the example above and dummy variables, we
show how state fixed-effects can be taken into account by adding i.stateid
to the multivariate POLS regression model (without the regional compact
categorical variable) above.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state)
F(2, 49) = .
Prob > F = .
R-squared = 0.8989
Root MSE = 837.7
( 1) 2.stateid = 0
( 2) 3.stateid = 0
( 3) 4.stateid = 0
( 4) 5.stateid = 0
( 5) 6.stateid = 0
( 6) 7.stateid = 0
( 7) 8.stateid = 0
( 8) 9.stateid = 0
( 9) 10.stateid = 0
[omitted output]
(45) 46.stateid = 0
(46) 47.stateid = 0
(47) 48.stateid = 0
(48) 49.stateid = 0
(49) 50.stateid = 0
F( 3, 49) = 30.34
Prob > F = 0.0000
We reject the null that the coefficients for all 49 state dummy variables
are jointly equal to zero. Therefore, state fixed-effects can be retained in
the regression model. We see from the output that every state except the
first state, Alabama, was included in the regression results. Compared to
Alabama, net tuition revenue per FTE student is lower in every state
7.4 Fixed-Effects Regression 125
Minus the estimated beta coefficients for 49 states, the results are exactly
the same as the previous output. This option is very useful when running
a FEDV multivariate POLS regression model with many units or groups
(e.g., institutions). For example, suppose we are conducting a study of how
education and general (EG) expenditures across 220 public master’s colleges
and universities (over 10 years) are related to state appropriations (controlling
126 7 Introduction to Intermediate Statistical Techniques
While the use of the Stata command areg enables us to run a FEDV regres-
sion model that takes into account unobserved time-invariant heterogeneity,
xtreg, allows us to do the same via the within-group estimator. The within-
group estimator involves the indirect use of the between-effects model, which
regresses the group mean of the dependent variable on the group means of
the independent variables. This is reflected in Eq. (7.8). The within-group
estimator fixed-effects regression is obtained by subtracting Eq. (7.8) from
Eq. (7.6).
7.4 Fixed-Effects Regression 127
Y i = β1 X 1i + β2 X 2i + . . . βn X ni + μi + εi (7.8)
The result of this subtraction, also known as “time demeaning” the data, is
the disappearance of the ui term or time-invariant unobserved heterogeneity.
In Stata, this is equivalent to using the xtreg command with the fe option.
. xtreg eg statea tuition totfteiarep ftfac ptfac, fe cluster(opeid5_new)
F(5,219) = 284.17
corr(u_i, Xb) = -0.7836 Prob > F = 0.0000
While the output shows that the estimated beta coefficients are the same as
those produced by the FEDV POLS regression model with dummy variables
using the areg command the above output provides more information.
First, it shows the within R2 , between R2 , and the overall R2 . The within
R2 measures how much variation in the dependent variable within groups
(e.g., institutions) units is explained over time by the regression model. The
between R2 measures how much variation in the dependent variable between
groups is captured by the model. The overall R2 is a weighted average of
the within R2 and the between R2 . In some cases, the within R2 will be
higher than the between R2 and in other cases, the reverse may hold true.
Because most higher education policy research is more concerned with the
importance (i.e., the statistical significance of beta coefficients) of policy-
oriented variables, there is less focus on the R2 s.
Second, information is provided about the time-invariant group-specific
error term (μi ) and the idiosyncratic error term (εi ). The sigma_u is the
128 7 Introduction to Intermediate Statistical Techniques
Drawing heavily from Furquim et al. (2020) and using their notation, the
DiD estimator is based on the following:
T T
C C
δDiD = Y 1 − Y 0 − Y 1 − Y 0 (7.9)
In 2004, Colorado enacted Senate Bill 189 (SB 04-189) to establish the College
Opportunity Fund (COF) program. Starting in 2005, the COF-designated,
higher education institutions no longer received state appropriations. Instead
funding was provided to resident undergraduate students in the form of a
stipend to help pay their tuition. The legislation also required that 20% of
increased resident tuition be set aside for financial aid. This suggested that
net tuition should not increase substantially. If Colorado state policymakers
ask whether COF had an effect on net tuition revenue, then a fixed-effects
regression-based DiD model is an appropriate technique that analysts can
use to address this question.
use ”Example 7.1.dta“, clear
. reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1),
rob
note: 2016.year omitted because of collinearity
note: 8.fips omitted because of collinearity
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.T | -1111.343 365.6843 -3.04 0.002 -1829.184 -393.5028
1.P | 4634.093 427.505 10.84 0.000 3794.899 5473.288
|
T#P |
1 1 | 501.2044 202.7361 2.47 0.014 103.2322 899.1765
|
stapr_fte | -.1933747 .0320378 -6.04 0.000 -.2562652 -.1304842
pc_income | .0001359 .0198814 0.01 0.995 -.0388913 .0391632
|
year |
2001 | 219.4993 172.0082 1.28 0.202 -118.1539 557.1525
[omitted output]
|
fips |
2 | -165.3607 397.6397 -0.42 0.678 -945.9298 615.2084
[omitted output]
|
_cons | 5565.35 539.019 10.32 0.000 4507.252 6623.447
------------------------------------------------------------------------------
We see from the output above that the DiD coefficient (δ DiD ) is positive
and statistically significant (beta = 501, p < 0.05). This suggests that net
tuition revenue per FTE enrollment was, on average, higher by $501 in
Colorado after passage of SB 04-189, compared to net tuition revenue per
FTE enrollment in all other states.
The within-group fixed-effects DiD regression model (xtreg) can also be
employed.
xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob
-------------+----------------------------------------------------------------
1.T | 0 (omitted)
1.P | 4634.093 1141.528 4.06 0.000 2340.108 6928.079
|
T#P |
1 1 | 501.2044 162.1192 3.09 0.003 175.4137 826.995
|
stapr_fte | -.1933747 .0767252 -2.52 0.015 -.3475598 -.0391896
pc_income | .0001359 .0554728 0.00 0.998 -.1113407 .1116126
|
year |
2001 | 219.4993 66.723 3.29 0.002 85.41442 353.5842
[omitted output]
|
_cons | 4291.512 1517.395 2.83 0.007 1242.192 7340.831
-------------+----------------------------------------------------------------
sigma_u | 2066.6057
sigma_e | 685.81506
rho | .90079681 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F(12,12) = .
corr(u_i, Xb) = -0.0404 Prob > F = .
T#P |
1 1 | 947.8925 215.3771 4.40 0.001 478.6261 1417.159
|
stapr_fte | -.1722081 .1092674 -1.58 0.141 -.4102812 .065865
pc_income | .0047754 .0758187 0.06 0.951 -.1604195 .1699702
|
year |
2001 | 106.4016 75.83226 1.40 0.186 -58.82273 271.6259
[omitted output]
|
_cons | 3187.94 2251.055 1.42 0.182 -1716.689 8092.568
-------------+----------------------------------------------------------------
sigma_u | 1228.8195
sigma_e | 543.94789
rho | .83615752 (fraction of variance due to u_i)
------------------------------------------------------------------------------
We see that when the second control group is used, the DiD coefficient
(δ DiD ) is also positive and statistically significant but the value is higher
($948).
The preferred regression-based DiD model is a matter of choice for
analysts. The choice depends on the selection of treatment period which the
analyst thinks the adoption of a policy began to take full effect, the control
variables, and control group.
F(3,12) = .
corr(u_i, Xb) = -0.2875 Prob > F = .
We see from the output directly above that the coefficient for the placebo
is statistically insignificant. This suggests the effect of SB 04-189 on net
tuition revenue per FTE enrollment is “real”. Policy analysts are encouraged
to conduct several placebo tests and use different control groups to validate
their findings.
Estimated results:
| Var sd = sqrt(Var)
---------+-----------------------------
netuit_e | 6674588 2583.522
e | 701748.5 837.7043
u | 1525396 1235.069
Test: Var(u) = 0
chibar2(01) = 8010.80
Prob > chibar2 = 0.0000
6 For a complete discussion of this test, see Breusch and Pagan (1980).
7 For a technical discussion of the Hausman test, see Hausman (1978).
7.5 Random-Effects Regression 137
in five steps. First we quietly run (i.e., not showing the results) a within-
group fixed-effects model. Second, we store those estimated results (i.e., est
sto fixed) to memory, which is illustrated below. Third, we quietly run a
random-effects model. Fourth, we store those estimated results (i.e., est sto
random). Fifth, we run the Hausman test (i.e., hausman fixed random,
sigmamore).
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
est sto fixed
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
est sto random
hausman fixed random
Given the ordering of the stored estimated results in the last line of syntax,
a rejection of the null would indicate the fixed-effects regression is the more
appropriate model. The output is below.
. quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
.
. est sto fixed
.
. quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
.
. est sto random
.
. hausman fixed random
Note: the rank of the differenced variance matrix (3) does not equal the
number of coefficients being tested (5); be sure this is what you expect,
or there may be problems computing the test. Examine the output of
your estimators for anything unexpected and possibly consider scaling your
variables so that the coefficients are on a similar scale.
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed random Difference S.E.
-------------+----------------------------------------------------------------
statea | .6359503 .711084 -.0751337 .0078128
tuition | 1.20119 1.078007 .1231832 .0101439
totfteiarep | 1050.312 -332.6668 1382.979 296.0531
ftfac | 32819.51 10317.97 22501.54 6505.906
ptfac | 7375.78 5765.71 1610.069 2575.428
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg
chi2(3) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B)
= 45.67
Prob>chi2 = 0.0000
138 7 Introduction to Intermediate Statistical Techniques
While the above results of the Hausman test suggest we should use the
fixed-effects regression model, the note in the beginning of the output states
there may be possible problems with the test and recommends rescaling the
variables. To rescale the variables, we log transform the variables and rerun
the test. This time we will show the entire output, including the results of
each regression model by excluding the quiet (qui) option.
. xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, fe
F(5,1753) = 708.86
corr(u_i, Xb) = -0.8252 Prob > F = 0.0000
-------------------------------------------------------------------------------
lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+---------------------------------------------------------------
lnstatea | .0128249 .0045956 2.79 0.005 .0038114 .0218384
lntuition | .5562887 .0157124 35.40 0.000 .5254718 .5871057
lntotfteiarep | .113466 .0386945 2.93 0.003 .0375738 .1893581
lnftfac | .5642428 .0458174 12.32 0.000 .4743802 .6541054
ptfac | .0003861 .0000482 8.01 0.000 .0002915 .0004806
_cons | 3.801971 .3236401 11.75 0.000 3.16721 4.436732
--------------+---------------------------------------------------------------
sigma_u | .32099191
sigma_e | .12532947
rho | .86771903 (fraction of variance due to u_i)
-------------------------------------------------------------------------------
F test that all u_i=0: F(219, 1753) = 16.90 Prob > F = 0.0000
.
. est sto fixed
.
. xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re
-------------------------------------------------------------------------------
lneg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+---------------------------------------------------------------
lnstatea | .0192334 .004174 4.61 0.000 .0110526 .0274143
lntuition | .5254071 .0151305 34.72 0.000 .4957518 .5550624
lntotfteiarep | .0644028 .0331053 1.95 0.052 -.0004824 .129288
lnftfac | .3408924 .0344727 9.89 0.000 .273327 .4084577
lnptfac | .0417042 .0071786 5.81 0.000 .0276343 .0557741
_cons | 5.823929 .1957728 29.75 0.000 5.440222 6.207637
--------------+---------------------------------------------------------------
sigma_u | .15783994
sigma_e | .12699568
rho | .60703286 (fraction of variance due to u_i)
-------------------------------------------------------------------------------
.
. est sto random
.
. hausman fixed random
chi2(4) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B)
= 454.45
Prob>chi2 = 0.0000
While the results of the Hausman test using the log transformed variables
are more accurate, they are based on models that cannot allow us to take
into account heteroscedasticity via cluster-robust errors. This is a limitation
of the standard Hausman test provided by Stata. For this reason, we now
turn to a Stata user-written Hausman routine (rHausman) by Kaiser (2015)
that addresses this limitation. (We have to download this program by typing
ssc install rHausman.) Using this program, the log transformed variables,
and the models with cluster-robust errors, we rerun the Hausman test. The
options reps(1000) and cluster are included to allow for random sampling
140 7 Introduction to Intermediate Statistical Techniques
with replacement (i.e., 400 times) and take into account the cluster variable
institution and cluster variable opeid5_new, respectively.8
. quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) fe
. est sto fixed
. quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) re
. est sto random
. rhausman fixed random, reps(400) cluster
bootstrap in progress
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
(This bootstrap will approximately take another 0h. 1min. 10sec.)
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
-------------------------------------------------------------------------------
Cluster-Robust Hausman Test
(based on 400 bootstrap repetitions)
b1: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) fe
b2: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) re
Given the output directly above, now we can be confident that the results
of the Hausman test accurately indicate the fixed-effects regression model is
more appropriate. Using a fixed-effects regression model and our panel data
of public master’s universities and colleges, we can now conclude that E&G
expenditures are positively related to state appropriations (lnstatea), tuition
revenue (lntuition), total FTE students (lntotfteiarep), full-time faculty
(lnftfac), and part-time faculty (ptfac).
7.6 Summary
7.7 Appendix
*Chapter 7 Stata syntax
*test to see if there is an interaction effects by quietly (qui) running the ///
models and storing (est sto) the model results without (model1)and with and the
*Using the testparm command, the statistical significance of the interaction ///
terms can also be checked.
testparm i.region_compact#i.ugradmerit
*if the interaction term is composed of one continuous (c) variable and one ///
categorical (i) variable
reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset
testparm c.stapr_fte#i.tuitset
142 7 Introduction to Intermediate Statistical Techniques
*comprehensive post-estimation
estat imtest
*regression model.
reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state)
*Fixed-Effects Regression
* Unobserved Heterogeneity and Fixed-Effects Dummy Variable (FEDV) ///
Regression Estimating FEDV Multivariate POLS Regression Models
reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state)
*Based on every state other than the treatment state (Colorado), we create the ///
first control group.
gen C1 = 0
replace C1=1 if state !=”CO“
*we use the global command to create temporary variables reflecting the ///
dependent variable net tuition revenue per FTE enrollment (y)
global y ”netuit_fte“
*and the set of control variables state appropriations to higher education per ///
FTE enrollment (stapr_fte) and state per capita income (pc_income).
global controls ”stapr_fte pc_income“
*To take into account unobserved heterogeneity, we include the robust (rob) as ///
an option in the syntax.
reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1), rob
*For comparison, we run the within-group fixed-effects model with the second ///
control group (states in WICHE).
xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob
*Random-Effects Regression
xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, ///
re cluster(stateid)
* Hausman test
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
est sto fixed
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
est sto random
hausman fixed random
*run rHausman
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) fe
est sto fixed
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) re
144 7 Introduction to Intermediate Statistical Techniques
*end
References
Breusch, T. S., & Pagan, A. R. (1980). The Lagrange Multiplier Test and its Applications
to Model Specification in Econometrics. The Review of Economic Studies, 47 (1), 239–
253. JSTOR. https://doi.org/10.2307/2297111
Furquim, F., Corral, D., & Hillman, N. (2020). A Primer for Interpreting and Designing
Difference-in-Differences Studies in Higher Education Research. In L. W. Perna (Ed.),
Higher Education: Handbook of Theory and Research: Volume 35 (pp. 667–723).
Springer International Publishing. https://doi.org/10.1007/978-3-030-31365-4_5
Guan, W. (2003). From the help desk: Bootstrapped standard errors. The Stata Journal,
3 (1), 71–80.
Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46 (6), 1251–
1271. JSTOR. https://doi.org/10.2307/1913827
Hoechle, D. (2007). Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence. The Stata Journal: Promoting Communications on Statistics and Stata,
7 (3), 281–312. https://doi.org/10.1177/1536867X0700700301
Hutchinson, S. R., & Lovell, C. D. (2004). A review of methodological characteristics
of research published in key journals in higher education: Implications for graduate
research training. Research in Higher Education, 45 (4), 383–403.
Judge, G. G., Hill, R. C., Griffiths, W. E., Lutkepohl, H., & Lee, T.-C. (1988). Introduction
to the Theory and Practice of Econometrics, 2nd Edition (2 edition). Wiley.
Kaiser, B. (2015). RHAUSMAN: Stata module to perform Robust Hausman Specification
Test. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s457909.html
Wells, R. S., Kolek, E. A., Williams, E. A., & Saunders, D. B. (2015). “How We Know
What We Know”: A Systematic Comparison of Research Methods Employed in Higher
Education Journals, 1996—2000 v. 2006—2010. The Journal of Higher Education,
86 (2), 171–198.
Chapter 8
Advanced Statistical Techniques: I
8.1 Introduction
When conducting policy analysis using time series data and regression
techniques, it is very likely that we will encounter violations of many OLS
assumptions. OLS assumes that errors terms have constant variance. In
other words, no autocorrelation is present. Autocorrelation occurs when the
residual or idiosyncratic error (ε) for one time period (εt + 1) is correlated
with the error for the subsequent time period (εt + 2). In a regression
framework, the first-order autoregressive disturbance term (ρ) is reflected
in the following equations:
Yt = βXt + ut , (8.1)
where
ut = ρut−1 + t
Second, we set the data to a time series by typing, tsset year. Third, after
observing that the data are skewed, we log transform and visually inspect
the data for each of the three variables over time. We use the following Stata
syntax (which must be entered on one line) to create a line graph of the log
of enrollment, tuition, and unemployment. (Take note of the options.)
twoway (line lnenpub2yr year, lcolor(black) lpattern(solid))
(line lntupub2yr year, lcolor(black) lpattern(dash))
(line lnunemprate year, lcolor(black) lpattern(dot)),
xlabel(1970 (6) 2017, labsize(small)) ytitle(Logs)
title(”Trends in Enrollment in 2 YR, Tuition at 2 YR, and
Unemployment Rates“ ”1970 to 2017“, size(medium))
Figure 8.1 shows that log transformed enrollment and tuition changes with
time (years). This observation suggests the time series data are nonstationary,
which would produce a spurious relationship between the variables and
unreliable beta coefficients when using a regression model. But more evidence
is needed to arrive at a conclusion regarding the existence of nonstationary
data or otherwise known as unit root. To uncover unit root, statistical tests
are conducted. The most well-known unit root test is the augmented Dickey–
Fuller test (ADF) unit root test (Dickey and Fuller 1979). However, the
modified Dickey–Fuller test (known as the DF-GLS test), is more powerful
than the ADF (Elliott et al. 1996).1 Therefore, we will use the DF-GLS unit
root test.
The null hypothesis of the DF-GLS test is the time series of the variable
is unit root, while the alternative is (1) stationary about a linear time trend
or (2) stationary with a possibly nonzero mean but with no linear time
trend. When we conduct the DF-GLS test via Stata (dfgls), we use the
first alternative.
. dfgls lnenpub2yr
DF-GLS for lnenpub2yr Number of obs = 38
1 The DF-GLS unit root test uses generalized least squares (GLS) regression to de-trend
the data.
148 8 Advanced Statistical Techniques: I
Fig. 8.1 Enrollment, tuition, and unemployment, changes over time (1970–2017)
From the test results for the log of enrollment, we can see that the null
hypothesis of a unit root is not rejected for any of the lags.
. dfgls lntupub2yr
From the test results for the log of tuition, we can see that the null
hypothesis of a unit root is rejected for only lag 1, but not at the 1% level.
. dfgls lnunemprate
. dfgls lnunemprate
DF-GLS for lnunemprate Number of obs = 38
Maxlag = 9 chosen by Schwert criterion
DF-GLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value
------------------------------------------------------------------------------
9 -1.674 -3.770 -2.723 -2.425
8 -1.603 -3.770 -2.783 -2.490
7 -2.413 -3.770 -2.850 -2.559
6 -2.039 -3.770 -2.921 -2.630
5 -2.230 -3.770 -2.994 -2.701
4 -1.908 -3.770 -3.066 -2.769
3 -2.330 -3.770 -3.133 -2.833
2 -2.294 -3.770 -3.195 -2.889
1 -3.013 -3.770 -3.247 -2.937
Opt Lag (Ng-Perron seq t) = 1 with RMSE .1173187
Min SIC = -4.09427 at lag 1 with RMSE .1173187
Min MAIC = -3.763267 at lag 2 with RMSE .1165867
The test results indicate there is unit root in the unemployment rate data.
So, the DF-GLS unit root test results above confirm that all three variables
are nonstationary.
Because the data are nonstationarity, we have to transform the time series
data by taking their first differences and then run the regression model
on first-differenced data. Differencing is simply computing the differences
between consecutive observations, or in other words, subtracting the previous
value from the current value. While the logarithmic transformation of the
data may stabilize the variance, differencing may result in a constant mean
of a time series. In Stata, we can automatically difference data inserting “D1.”
in syntax as shown below. We first create a graph of the first-differenced time
series. We enter the following syntax (all on one line).
twoway (line D1.lnenpub2yr year, lcolor(black) lpattern(solid))
(line D1.lntupub2yr year, lcolor(black) lpattern(dash))
(line D1.lnunemprate year, lcolor(black) lpattern(dot)),
xlabel(1971 (5) 2017, labsize(small)) ytitle(Change in Logs)
title(”First-Differenced Enrollment in 2 YR, Tuition at
2 YR, and Unemployment Rates“ ”1971 to 2017“, size(small))
Figure 8.2 shows that the first-differenced logged data (except enrollment)
do not change with time and are, for the most part, stationary. Now we can
regress the first-differenced log of enrollment on the first-differenced log of
tuition and unemployment.
150 8 Advanced Statistical Techniques: I
Fig. 8.3 Autocorrelation (correlogram) of the residuals from the regression model
generating the residuals from the model (predict residuals, resid) and
creating a graph of partial autocorrelations (pac residuals, yw). We see
that after the first lag, the partial autocorrelations of the residuals dissipate
at the higher lags and are well within the 95% confidence interval (Fig. 8.4).
Combined, these visuals suggest evidence of first-order autocorrelation
(AR1) that should be addressed before using a final regression model.
However, for more definitive evidence, we should conduct statistical tests.
When using time series data, the most common tests for autocorrelation are
the Durbin–Watson (D-W) test (Durbin and Watson 1950) and Breusch–
Godfrey (B-W) test (Breusch 1978; Godfrey 1978). The D-W test is based
on a measure of autocorrelation in the residuals from a regression model.
That measure or D-W (d ) statistic always has a value between 0 and 4. A
d statistic with a value of 2.0 indicates there is no autocorrelation, while a
value from 0 to less than 2 indicates a positive autocorrelation. A d statistic
with a value from 2 to 4 indicates negative autocorrelation. Different versions
of the D-W test are based on different assumptions regarding the exogeneity
of the independent variables. The results of the D-W test that are based on
the work of Durbin and Watson (1950) assume the independent variables
152 8 Advanced Statistical Techniques: I
To demonstrate how to conduct a D-W test when using time series data,
we use the same data from above, assume the independent variables are
exogenous, and use the Stata post-estimation time series command estat
dwatson.
. estat dwatson
Durbin-Watson d-statistic( 3, 47) = .8127196
8.4 Time Series Regression Models with AR terms 153
While the value of the D-W d statistic shown above is 0.813, the D-
W test does not tell us whether or not the value is statistically different
from 2. In addition, the results are based on the assumptions of exogenous
independent variables, a normal distribution of the residuals or errors (ε),
and homoscedastic errors. (The OLS regression model that we ran did not
take into account possible heteroscedasticity.) So, we “quietly” (i.e., do not
show the output) rerun the regression model with the robust (rob) option
and use the alternative D-W test, the post-estimation time series command
estat durbinalt, with the force option.
. quietly: reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
.
---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 24.039 1 0.0000
---------------------------------------------------------------------------
H0: no serial correlation
We can see from the above output of the D-W alternative test that the
null hypothesis (Ho ) of no serial correlation (no autocorrelation) is rejected
(p < 0.001).
The rho (ρ) in the output above shows there is positive autocorrelation.
The regression model with an AR1 term shows that the first-differenced log
transformed tuition and unemployment variables are statistically significant.
We also see that the value of the transformed D-W d statistic is 2, suggesting
no autocorrelation. However, we should examine the autocorrelation and
partial autocorrelation functions of the residuals from the P-W regression.
We do so by first generating residuals from the P-W regression (predict
residuals_PW, resid) and creating graphs of autocorrelations and partial
autocorrelations of those residuals.
Given Fig. 8.5, it appears as if there is still autocorrelation even after we ran
the P-W regression. The partial autocorrelation function in Fig. 8.6 further
provides evidence of first-order autocorrelation (AR1).
Because the alternative test D-W test for autocorrelation does not work
after running a P-W regression in Stata, we use the Cumby-Huizinga (C-H)
general test of the residuals. However, the Stata user-written program for
the C-H test has to be downloaded (ssc install actest). We will check for
autocorrelation of the residuals from the P-W regression (residuals_PW) for
up to four lags (lag (4)), specify the null hypothesis of no autocorrelation at
8.4 Time Series Regression Models with AR terms 155
any lag order (q=0), and take into account possible heteroscedasticity (rob).
So our Stata syntax for this test is: actest residuals_PW, lag(4) q0 rob).
The output is as follows.
156 8 Advanced Statistical Techniques: I
Looking at the panel on the right in output from the C-H test, we can
see that the null hypothesis of no autocorrelation is rejected at both the first
(chi2 5.369 = p < 0.5) and second (chi2 = 5.812, p < 0.5) lags, indicating
first-order (AR1) and second-order (AR2) autocorrelation.2 Unfortunately,
Stata’s Prais–Winsten (prais) regression allows for including only AR1.
Therefore, we have to use an autoregressive–moving-average (ARMA) model
with only autoregressive terms and exogenous independent variables, or
commonly known as an ARMAX model.3 ARMA models can accommodate
autoregressive disturbance terms with more than one lag and are reflected in
an expansion of Eq. (7.1) in the previous chapter to include the following:
Yt = βXi + μt
p
q
(8.2)
μt = ρi μt−1 + θj εt−j + εt
i=1 j−1
From the results of the ARMAX model shown above, we see that the
AR1 disturbance term is statistically significant (beta = 0.455, p < 0.01) but
not the AR2 disturbance term. However, we examine the residuals from the
ARMAX model to see if there is any autocorrelation and conduct a final test.
. predict residuals_ARMX12, resid
(1 missing value generated)
158 8 Advanced Statistical Techniques: I
Both Figs. 8.7 and 8.8 suggest there is no autocorrelation of the residuals
from the ARMAX model with AR1 and AR2 disturbance terms. Using the
C-H general test, we conduct a final test to detect autocorrelation.
. actest residuals_ARMX12 , lag(4) q0 rob
Cumby-Huizinga test for autocorrelation
H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 0.120 1 0.7288 | 1 | 0.120 1 0.7288
1 - 2 | 0.208 2 0.9014 | 2 | 0.089 1 0.7648
1 - 3 | 2.721 3 0.4367 | 3 | 2.567 1 0.1091
1 - 4 | 2.756 4 0.5995 | 4 | 0.746 1 0.3877
-----------------------------------------------------------------------------
Test robust to heteroskedasticity
We see from the C-H general test results above, the null hypothesis
of no autocorrelation cannot be rejected. So, the results of the C-H test
combined with the Figs. 8.7 and 8.8 allow us to definitively conclude there
is no autocorrelation when using our time series data and the ARMAX
model above. Given the ARMAX model results, it can now be stated with
confidence that enrollment in public 2-year colleges is negatively related to
8.4 Time Series Regression Models with AR terms 159
published tuition and fees at public 2-year colleges and positively related to
unemployment rates.
Because the ARMAX model was used with first-differenced variables, the
interpretation of the results is based on an average short-term (1 year) rather
than an average over the long-term (e.g., 47 years) relationship. If we wanted
to make a statement based on the latter, we would have to fit an ARMAX
model to the data levels rather than their first differences. Fortunately, we
can do this by using the diffuse option in Stata.4 (The nolog option is also
included to not show the iteration log.) We show the output below.
. arima lnenpub2yr lntupub2yr lnunemprate, ar(1 2 ) rob diffuse nolog
ARIMA regression
Sample: 1970 - 2016 Number of obs = 47
Wald chi2(4) = 30163.62
Log pseudolikelihood = 86.21852 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Semirobust
lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnenpub2yr |
lntupub2yr | -.1989154 .0891667 -2.23 0.026 -.3736788 -.024152
lnunemprate | .1478185 .029172 5.07 0.000 .0906425 .2049945
_cons | 16.97047 .7746909 21.91 0.000 15.45211 18.48884
4 Formore information on the diffuse option, see the Stata Reference Time-Series Manual,
Release 16 and Ansley and Kohn (1985) and Harvey (1989).
160 8 Advanced Statistical Techniques: I
-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | 1.305528 .0153133 85.25 0.000 1.275514 1.335542
L2. | -.3506986 .0030851 -113.67 0.000 -.3567453 -.3446518
-------------+----------------------------------------------------------------
/sigma | .022082 .0020042 11.02 0.000 .0181539 .0260101
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.
From the results above, we can see that all the independent variables are
statistically significant. We also see that both autocorrelation disturbance
terms (AR1 and AR2) are statistically significant. Like with the ARMAX
model using first-differences, the C-H test is used to detect any remaining
autocorrelation.
. predict residuals_nsARMA12dn, resid
. actest residuals_nsARMA12dn, q0 rob lag(4)
Cumby-Huizinga test for autocorrelation
H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 0.892 1 0.3450 | 1 | 0.892 1 0.3450
1 - 2 | 0.977 2 0.6136 | 2 | 0.067 1 0.7958
1 - 3 | 1.797 3 0.6156 | 3 | 0.503 1 0.4782
1 - 4 | 2.237 4 0.6923 | 4 | 0.256 1 0.6127
-----------------------------------------------------------------------------
Test robust to heteroskedasticity
-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | 1.225775 .190853 6.42 0.000 .8517105 1.59984
L2. | -.2997325 .1754243 -1.71 0.088 -.6435578 .0440928
-------------+----------------------------------------------------------------
/sigma | .02378 .0025161 9.45 0.000 .0188485 .0287114
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.
We see that the results are the same as in the previous output. One
final test after a final ARMAX model is to check the model’s stability.
More specifically, the estimated dependent variable should not increase with
time and its variance should be independent of time. More specifically, the
estimated parameters (ρ) in our second-order AR (AR2) model must meet
the following conditions:
ρ2 + ρ1 < 1
ρ2 − ρ1 < 1
−1 < ρ2 < 1
5 For more information on inverse roots, see the Stata Reference Time-Series Manual,
We include the option output to show the results of the regression of the
first-differenced variables.
. xtserial lnnetuit lnstapr lnfte lnpc_income, output
Linear regression Number of obs = 1,300
F(3, 49) = 266.10
Prob > F = 0.0000
R-squared = 0.3332
Root MSE = .09355
(Std. Err. adjusted for 50 clusters in stateid)
------------------------------------------------------------------------------
| Robust
D.lnnetuit | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnstapr |
D1. | -.2611843 .0510568 -5.12 0.000 -.3637868 -.1585818
|
lnfte |
D1. | .6437485 .1010357 6.37 0.000 .4407098 .8467873
|
lnpc_income |
D1. | 1.377408 .060067 22.93 0.000 1.256699 1.498117
------------------------------------------------------------------------------
Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
F( 1, 49) = 83.583
Prob > F = 0.0000
We see from the output that the AR1 (rho_ar) is 0.86. It should be noted
that in several ways, the xtregar command is rather limited. First, it does
not allow for the use of higher-order autoregressive (AR) disturbance terms.
Second, there is no option to estimate robust standard errors. Third, it cannot
take into account possible cross-sectional dependence in the data, which we
will discuss later in the chapter.
The use of xtregar, as shown above, is appropriate if we have stationary
time series data in our panel. However, if we are uncertain our data are
stationary, we should conduct a series of tests prior to using xtregar. Fortu-
nately, there are several first-generation panel unit root tests (PURTs) we can
choose from in Stata.6 However, we will use the Stata user-written routine
xtpurt, the most recently developed and available second-generation PURTs
that take into account autocorrelation (Herwartz et al. 2018). Herwartz and
Siedenburg (2008) contend that second-generation PURTs allow for cross-
sectional error correlation. (The xtpurt routine, however, requires a balanced
panel dataset.) We include the default option hs, reflecting the Herwartz and
Siedenburg test.7
. xtpurt lnnetuit
Herwartz and Siedenburg (2008) unit-root test for lnnetuit
-----------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 22
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 2.8239 0.9976
6 For more information on unit root tests for panel data, see Stata Longitudinal Data/Panel
------------------------------------------------------------------------------
. xtpurt lnstapr
Herwartz and Siedenburg (2008) unit-root test for lnstapr
----------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 23
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 1.3166 0.9060
------------------------------------------------------------------------------
. xtpurt lnfte
Herwartz and Siedenburg (2008) unit-root test for lnfte
--------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 23
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.6345 0.7371
------------------------------------------------------------------------------
. xtpurt lnpc_income
Herwartz and Siedenburg (2008) unit-root test for lnpc_income
--------------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 22
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=1 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.2176 0.5861
------------------------------------------------------------------------------
The results of the unit root tests show the null hypothesis of the panels
containing unit roots was not rejected, indicating nonstationary time series
data are in the panel. This suggests that we have to include first-differenced
variables in our final regression fixed- or random-effects model with an AR1
disturbance term. We run this model, using quiet (or qui for short), to omit
the output of the regression results.
. qui xtregar D1.lnnetuit D1.lnstapr D1.lnfte D1.lnpc_income, re
8.8 Cross-Sectional Dependence 167
We see from the results of the test that autocorrelation of the model’s
residuals is not present. Unfortunately, xtregar is limited in that its
estimated standard errors are not robust to heteroscedasticity and cross-
sectional dependence, which we will discuss in the next section.
There are several tests that use the uncommon factor approach to detect
cross-sectional dependence in panel data. These tests include the Pesaran
(2004), Friedman (1937), and Frees (1995) tests and made available in Stata
by De Hoyos and Sarafidis (2006). Each of these tests are based on the
correlation coefficients of residuals from OLS regression models of time series
data within each individual unit (e.g., institution, state, etc.) in a panel. After
running a fixed-effects (xtreg, fe) or random-effects (xtreg, re) regression
model in Stata, we can conduct the Pesaran, Friedman, and Frees tests by
using the post-estimation commands: xtcsd, pesaran; xtcsd, friedman;
and xtcsd, frees, respectively. Consequently, we demonstrate the use of all
three tests below.
First, we have to install the Stata user-written routine, xtcsd (De Hoyos
and Sarafidis 2006).
. ssc install xtcsd
checking xtcsd consistency and verifying not already installed
all files already exist and are up to date.
We change our working directory and open our dataset.
8.8 Cross-Sectional Dependence 169
. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 8\Stata
files“
. use ”Unbalanced panel data - institutional.dta“
We use the xtdescribe (or the shortened version, xtdes) command to get
a sense of the distribution of observations per unit (i.e., institution) in the
panel dataset.
. xtdes
opeid5_new: 1004, 1005, ..., 31703 n = 203
endyear: 2004, 2005, ..., 2013 T = 10
Delta(endyear) = 1 year
Span(endyear) = 10 periods
(opeid5_new*endyear uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
8 8 9 9 10 10 10
Freq. Percent Cum. | Pattern
---------------------------+------------
95 46.80 46.80 | 1111111111
43 21.18 67.98 | 1.11111111
33 16.26 84.24 | 1.1.111111
7 3.45 87.68 | 111.111111
7 3.45 91.13 | 1111111.11
4 1.97 93.10 | 1.111.1111
4 1.97 95.07 | 111.1.1111
3 1.48 96.55 | 11111.1111
2 0.99 97.54 | 1.11111.11
5 2.46 100.00 | (other patterns)
---------------------------+------------
203 100.00 | XXXXXXXXXX
From the output above, we see clearly that this is a slightly unbalanced
panel dataset with observations per institution ranging from eight to 10 years.
Next, we “quietly” run our fixed-effects regression model using the within
regression estimator (xtreg, with the fe option).
. qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac
lnptfac , fe
We run the Pesaran test and the Friedman test.
. xtcsd, pesaran
Pesaran’s test of cross sectional independence = 82.069, Pr = 0.0000
. xtcsd, friedman
Friedman’s test of cross sectional independence = 293.510, Pr = 0.0000
The results of both tests show that the null of cross-sectional independence
is rejected (p < 0.001), which indicates cross-sectional dependence.
The Frees test is also conducted.
. xtcsd, frees Frees’ test of cross sectional independence = 44.948
|--------------------------------------------------------|
Critical values from Frees’ Q distribution
alpha = 0.10 : 0.4892
170 8 Advanced Statistical Techniques: I
The Frees test statistic of 44.948 is above the critical values of α at different
levels, indicating a rejection of the null of cross-sectional independence. So,
based on all three tests, we can say with some degree of certainty there is
cross-sectional dependence.8
Using the common factor approach, Eberhardt (2011) extended the Stata
routine by De Hoyos and Sarafidis and developed a cross-sectional dependence
test (xtcd) that can be applied to variables in the pre-estimation rather than
the post-estimation stage. Below, we show how this test can be conducted
using a few variables from the same panel dataset. First, we download the
most recent version of xtcd (Eberhardt 2011).
. ssc install xtcd, replace
As we can see from the results above, the null hypotheses of cross-sectional
independence are rejected for all the variables. If we cannot or choose not
to include all of the variables at one time, we can test the residuals from
a regression model. Using the variables that we included in a fixed-effects
8 For more information on the use of these tests, see De Hoyos and Sarafidis (2006).
8.8 Cross-Sectional Dependence 171
CD = 77.124
p-value = 0.000
From the results of the test we can see there is at least weak cross-sectional
dependence across all the variables and residuals from the fixed-effects
regression model. The output also shows the mean and mean absolute
correlation (ρ) between institutions.
8.9 Panel Regression Models That Take Cross-Sectional Dependency. . . 173
------------------------------------------------------------------------------
Log of state appropriations 0.040** 0.019* 0.019
We see that, compared to models 1 and 2, model 3 does not produce statis-
tically significant beta coefficient estimates of the log of state appropriations
and log of FTE students. This suggests that when we don’t take into account
cross-sectional dependence, our regression models that we use to fit panel
data may produce biased estimates of some beta coefficients.
176 8 Advanced Statistical Techniques: I
8.10 Summary
8.11 Appendix
*create Fig. 8.2.1.1. Fig. 8.2.1.1. Enrollment, Tuition, and Unemployment, Changes Over ///
Time (1970 to 2017)
twoway (line lnenpub2yr year, lcolor(black) lpattern(solid)) (line lntupub2yr year, ///
lcolor(black) lpattern(dash)) (line lnunemprate year, lcolor(black) lpattern(dot)), ///
xlabel(1970 (6) 2017, labsize(small)) ytitle(Logs) title(”Trends in Enrollment in ///
2 YR, Tuition at 2 YR, and Unemployment Rates“ ”1970 to 2017“, size(medium))
*regress the first-differenced log of enrollment on the first-differenced log of tuition ///
and unemployment
reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate
*create an autocorrelation function or correlogram of the residuals from the regression model ///
racplot
*generate the residuals from the model
predict residuals, resid
8.11 Appendix 177
*DW test
estat dwatson
*alternative DW test
quietly: reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
estat durbinalt, force
*time series regression model with an AR term calibrated via the Prais-Winsten (P-W) ///
estimator prais D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
*estimate an ARMAX model with first-order (AR1) and second-order (AR2) ///
autoregressive terms
arima D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, ar(1 2 ) vce(robust)
*examine the residuals from the ARMAX model to see if there is any autocorrelation and ///
conduct a final test
predict residuals_ARMX12, resid
actest residuals_ARMX12 , lag(4) q0 rob
*fit an ARMAX model to the data levels rather than their first-differences using the ///
diffuse option, showing with no interations (nolog)
arima lnenpub2yr lntupub2yr lnunemprate, ar(1 2 ) rob diffuse nolog
*To avoid “reverse causality”, regress enrollment on at least a one year lag of tuition. ///
Include the lag operator (L1) in a re-calibrated ARMAX model and use data through
2017.
arima lnenpub2yr L1.lntupub2yr lnunemprate, ar(1 2 ) rob diff nolog
*Fit an ARIMA model to the same data, using slightly different Stata syntax where ///
the arima (2 0 0) indicates the model should include a first-order (AR1) and ///
second-order (AR2) autoregressive term,
*panel unit root tests (PURTs); install xtpurt (to install in Stata, ///
type ”search xtpurt, all“, click on ”st0519“ and install) or type:
net install st0519, replace
xtpurt lnnetuit
xtpurt lnstapr
178 8 Advanced Statistical Techniques: I
xtpurt lnfte
xtpurt lnpc_income
*get a sense of the distribution of observations per unit (i.e., institution) in the ////
panel dataset
xtdes
*run our fixed-effects regression model using the within regression ///
estimator (xtreg, with the fe option)
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac , fe
*Then we run the test on variables of interest from the same panel dataset.
xtcd lneg lntuition lnftfac lnptfac
*use xtcdf (Wursten 2017) to allow for a much faster estimation of the ///
Pesaran cross-sectional
*run fixed-effects regression model with D-K standard errors and 2 lags of the AR term
xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe lag(2)
*check for cross-sectional dependence in the residuals of the regression, including ///
year fixed-effects
qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2)
predict xtscc_residuals_fe2y, resid
xtcdf xtscc_residuals_fe2y
*Use esttab command (with the label, p[(fmt)], and keep options as well as the ///
Estout varwidth option) to create a table of the stored regression results to ///
compare the estimated beta coefficients of variables of interest ///
across the three models
esttab, label keep(lnstatea lntuition lntotfteiarep lnftfac lnptfac) varwidth(30) beta(%8.3f)
*end
References
Ansley, C. F., & Kohn, R. (1985). Estimation, Filtering, and Smoothing in State Space
Models with Incompletely Specified Initial Conditions. Ann. Statist, 13 (4), 1286–1316.
https://doi.org/10.1214/aos/1176349739
Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo
Evidence and an Application to Employment Equations. The Review of Economic
Studies, 58 (2), 277–297. https://doi.org/10.2307/2297968
Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis; forecasting and control.
Holden-Day. http://www.gbv.de/dms/hbz/toc/ht000495926.pdf
Breusch, T. S. (1978). Testing for autocorrelation in dynamic linear models. Australian
Economic Papers, 17 (31), 334–355.
Cumby, R. E., & Huizinga, J. (1992). Testing the Autocorrelation Structure of Disturbances
in Ordinary Least Squares and Instrumental Variables Regressions. Econometrica,
60 (1), 185–195.
Davidson, R., & MacKinnon, J. G. (1993). Estimation and Inference in Econometrics (1
edition). Oxford University Press.
De Hoyos, R. E., & Sarafidis, V. (2006). Testing for cross-sectional dependence in panel-
data models. The Stata Journal, 6 (4), 482–496.
Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive
time series with a unit root. Journal of the American Statistical Association, 74 (366a),
427–431.
Driscoll, J. C., & Kraay, A. C. (1998). Consistent covariance matrix estimation with
spatially dependent panel data. Review of Economics and Statistics, 80 (4), 549–560.
180 8 Advanced Statistical Techniques: I
Durbin, J., & Watson, G. S. (1950). Testing for serial correlation in least squares regression:
I. Biometrika, 37 (3/4), 409–428.
Eberhardt, M. (2011). XTCD: Stata module to investigate Variable/Residual Cross-Section
Dependence. https://econpapers.repec.org/software/bocbocode/s457237.htm
Elliott, G., Rothenberg, T. J., & Stock, J. H. (1996). Efficient Tests for an Autoregressive
Unit Root. Econometrica, 64 (4), 813–836.
Frees, E. W. (1995). Assessing cross-sectional correlation in panel data. Journal of
Econometrics, 69 (2), 393–414.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association, 32 (200), 675–701.
Godfrey, L. G. (1978). Testing against general autoregressive and moving average error
models when the regressors include lagged dependent variables. Econometrica: Journal
of the Econometric Society, 1293–1301.
Hamilton, J. D. (1994). Time Series Analysis (1 edition). Princeton University Press.
Harvey, A. C. (1989). Forecasting, structural time series models and the Kalman filter by
Andrew C. Harvey. CUP.
Herwartz, H., & Siedenburg, F. (2008). Homogenous panel unit root tests under cross sec-
tional dependence: Finite sample modifications and the wild bootstrap. Computational
Statistics & Data Analysis, 53 (1), 137–150. https://doi.org/10.1016/j.csda.2008.07.008
Herwartz, Helmut, Maxand, S., Raters, F. H., & Walle, Y. M. (2018). Panel unit-root tests
for heteroskedastic panels. Stata Journal, 18 (1), 184–196.
Hoechle, D. (2007). Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence. The Stata Journal: Promoting Communications on Statistics and Stata,
7 (3), 281–312. https://doi.org/10.1177/1536867X0700700301
Hoechle, D. (2018). XTSCC: Stata module to calculate robust standard errors for panels
with cross-sectional dependence. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s456787.html
Pesaran, M. H. (2004). General diagnostic tests for cross section dependence in panels.
Pesaran, M. H. (2015). Testing weak cross-sectional dependence in large panels. Econo-
metric Reviews, 34 (6–10), 1089–1117.
Toutkoushian, R. K., & Paulsen, M. B. (2016). Economics of Higher Education: Back-
ground, Concepts, and Applications (1st ed. 2016 edition). Springer.
Woodbridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. MIT
Press, 2002.
Wursten, J. (2017). XTCDF: Stata module to perform Pesaran’s CD-test for cross-sectional
dependence in panel context. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s458385.html
Chapter 9
Advanced Statistical
Techniques: II
9.1 Introduction
systematically test for those violations and then apply a specific technique
that may be used with macro panel data to address questions with regard to
short-run and long-run relationships between variables. The Stata commands
and syntax used to demonstrate the use of these advanced correlational
statistical techniques are included in an appendix at the end of the chapter.
panel datasets, one has to match data on the same state identification code,
such as the Federal Information Processing Standards (FIPS) code and same
year. Because of missing state or year data, the matching may yield a panel
dataset with a small number (N ) of states or short time (T ) period or both. A
likely outcome is T will be substantially smaller than the maximum number
(50) of states (N ).
With respect to state-level panel data for higher education, in only a very
few cases will T begin to approach N. However, when T approaches N or
results in a macro panel dataset, we can begin to address a variety of empirical
questions going forward. These questions may include the following: (1) What
are the average short-run and long-relationships between important state-
level policy variables? (2) Among individual states, what are the short-run
and long-relationships between important policy variables? (3) Given shocks
to the long-run relationship between policy variables or “equilibrium”, how
long does it take states to adjust to their “equilibrium”?
Using macro panel data, a HCR approach with DCCE and MG estimators,
we can address all of the above questions. This approach allows for consistent
estimates in the face of variables with nonstationary data (i.e., means and
variances do not remain constant over time) and takes into account cross-
sectional dependence or spillover effects between groups (e.g., states).
Pesaran (2006) laid the foundation for the use of panel regression models
augmented with cross-sectional averages or the common correlated effects
(CCE) estimation procedure, which are approximates of the common factors
or strong cross-sectional dependence. On the other hand, spatially correlated
common shocks that are geographically based or result in spillover effects
among specific regions are known as weak cross-sectional dependence (Chudik
et al. 2011).
Pesaran (2006) developed the common correlated effects (CCE) estimator
as a technique to address cross-sectional dependence. Building on his work,
Chudik and Pesaran (2015) extended the CCE estimator and employed it
in a dynamic panel data modeling framework. Pesaran (2006) as well as
Kapetanios et al. (2011) combined the CCE estimator with the mean group
(MG) estimator, referred to as a CCEMG estimator. Kapetanios contend the
CCEMG estimator is robust to variables that are composed of nonstationary
data.
Patel (2019) used regression models with dynamic CCE (DCCE) estima-
tors to examine the short-run and long-run relationship between state-level
minimum wage and the number of employees businesses plan to hire. Employ-
ing state-level data, Liddle (2017) applied regression models with DCCE
estimators to examine factors related to energy consumption. Passamani and
9.2 The Context of Macro Panel Data and an Appropriate Statistical. . . 185
ui,t = γi ft + ei,t
PT
yi,t = αi + λi yi,t−1 + βi xi,t + ui,t + θi,l z t−l + εi,t (9.2)
l=0
where β i,l is a vector of coefficients, θt−l y t−1 , xt−1 is a vector of the cross-
section means at time T, l is the number of lags, and PT is the maximum
number of cross-section lags. According to Chudik and Pesaran (2015), in
dynamic context where a lag of the dependent variable is an independent
variable, the minimum cross-section number
√ of lags should be the cube root
3
of the total number of time periods T . Because the averages y t−1 ,
xt−1 are solely to control for unobserved common factors between groups,
the vector of θi,l coefficients in Eq. (9.2) has no meaningful interpretation.
186 9 Advanced Statistical Techniques: II
p
p
Δ yi,t = γ0i + αi (yi,t−1 − β2i xi,t ) + β1li Δ yi,t−1 + β2li Δ xi,t−1
l=1 l=1
P T
+ θi z t−1 + ui,t + εi,t
l=0
(9.3)
where the short-run relationships involve the terms with Δs, while the
long-run relationships are represented within the parentheses. The ECM
framework also allows for an estimation of how changes in the short-
run adjusts to the long-run relationship between variables or otherwise
known as “equilibrium” or “steady state”. The estimated error correction
(EC) parameter indicates the extent to which disequilibrium is dissipated
before the next time period. In general, when used with panel data, the
underlying assumption of ECMs is that the short-run coefficients and long-
run coefficients are the same across groups (e.g., states) or homogeneous.
Equation (9.3) allows us to estimate how a change in an independent
variable (e.g. GSP) affects state appropriations both at impact (Δx → Δy)
and in the long-run through “disturbing” the equilibrium relationship within
parentheses. That disturbance to the equilibrium is “corrected” at a rate of
−100α% per year, which is interpreted as the extent to which states adjust to
their long-run trend. It should be noted that the long-run trend may be either
increasing or decreasing. Using Eq. (9.3), we can: (1) estimate both short-run
and long-run coefficients for each state and then average them; (2) restrict
the long-run coefficients to be the same across all states; or (3) assume all
coefficients across all states are homogeneous. If we relax the assumption of
homogeneous coefficients and estimate state-specific short-run and long-run
coefficients, then we will have to invoke the mean group (MG) estimator.
The mean group (MG) estimator was proposed by Pesaran and Smith (1995)
and developed by Pesaran et al. (1999). It involves calculating the mean
of the coefficients from separate regressions for each group (e.g., state) in
a panel dataset with long time series and a large number of groups. The
MG estimator requires that T is large enough to estimate an ordinary least
squares (OLS) regression model for each group (e.g., state). Consequently, the
9.3 Demonstration of HCR with DCCE and MG Estimators 187
1
N
π̂M G = π̂i (9.4)
N i=1
Equation (9.4) simply reflects the mean of the individual coefficients that
are estimated for each group.
The relaxation of the assumption of homogeneous coefficients warrants the
use of a HCR model. More specifically, a statistical test of the violation of
the homogeneity of coefficient assumption should be conducted (Pesaran et
al. 2008). The results of this test will also be shown below.
Using 40 years of state-level higher education finance and other data across
50 states, we demonstrate how a HCR model with a DCCE estimator can
be utilized to address the first question below by producing short-run and
long-run coefficients for selected variables of interest to higher education
policy analysts. We show how to utilize a HCR model with DCCE and MG
estimators to address the second question with regard to estimating short-
and long-run coefficients for individual states. So, now we demonstrate how
we can answer all three questions:
1. What are the average short-run and long-relationships between important
state-level policy variables?
188 9 Advanced Statistical Techniques: II
By looking at trends over time, we can check to see if data are stationary.
Figs. 9.1 and 9.2 show trends for log transformed state appropriations and
gross state product (GSP), respectively, by state. We can see there is an
upward trend in both series over time in all states. At least for these two
variables, the data appear to be nonstationary. What are the implications of
using nonstationary data in regression models? If we use an OLS regression
model (with state fixed-effects) to regress nonstationary state appropriations
data on nonstationary GSP data, we get an extremely high R2 (0.987). The
9.3 Demonstration of HCR with DCCE and MG Estimators 189
. xtpurt lny1, test(hs)Herwartz and Siedenburg (2008) unit-root test for lny1
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 2.3444 0.9905
------------------------------------------------------------------------------
. xtpurt lnx1, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx1
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 3.6476 0.9999
------------------------------------------------------------------------------
. xtpurt lnx2, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx2
9.3 Demonstration of HCR with DCCE and MG Estimators 191
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 1.0687 0.8574
------------------------------------------------------------------------------
. xtpurt lnx4, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx4
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.9808 0.8366
------------------------------------------------------------------------------
. xtpurt lny1, test(dh)Demetrescou and Hanck (2012) unit-root test for lny1
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 3.1686 0.9992
------------------------------------------------------------------------------
. xtpurt lnx1, test(dh)Demetrescou and Hanck (2012) unit-root test for lnx1
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 4.0982 1.0000
------------------------------------------------------------------------------
. xtpurt lnx2, test(dh)Demetrescou and Hanck (2012) unit-root test for lnx2
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
192 9 Advanced Statistical Techniques: II
------------------------------------------------------------------------------
. xtpurt lnx4, test(hmw) trendHerwartz et al. (2017) unit-root test for lnx4
-----------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 36
Constant: Included Prewhitening: BIC
Time trend: Included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hmw 3.0343 0.9988
------------------------------------------------------------------------------
(Tests for non-stationary data with first-differences)
The test results above suggest that levels of the variables are nonstationary
in states. The first-differences of the variables, however, are stationary. This
indicates that all the variables are a stationary process or integrated (I ) of
order one (1).
------------------------------------------------------------------------------
. xtcointtest westerlund lny1 lnx1 lnx2 lnx4, demean
Westerlund test for cointegration
---------------------------------
Ho: No cointegration Number of panels = 50
Ha: Some panels are cointegrated Number of periods = 40
Cointegrating vector: Panel specific
Panel means: Included
Time trend: Not included
AR parameter: Panel specificCross-sectional means removed
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Variance ratio -1.9944 0.0231
------------------------------------------------------------------------------
The results of the cointegration reveal that the variables in all the panels
(i.e., states) are cointegrated. We also conduct an ECM-based cointegration
test, developed by Westerlund (2007), which is robust to abrupt changes in
the estimated beta coefficients (i.e., structural breaks), serial correlation, and
heteroscedasticity.
. xtwest lny1 lnx1 lnx2 lnx4, constant lags(0 3)
Calculating Westerlund ECM panel cointegration tests..........
Results for H0: no cointegration
With 50 series and 3 covariates
Average AIC selected lag length: 1.02
Average AIC selected lead length: 0-----------------------------------------------+
Statistic | Value | Z-value | P-value |
-----------+-----------+-----------+-----------|
Gt | -2.903 | -5.023 | 0.000 |
Ga | -14.634 | -3.686 | 0.000 |
Pt | -17.658 | -3.858 | 0.000 |
Pa | -11.786 | -4.665 | 0.000 |
-----------------------------------------------+
------------------------------------------------------------------------------+
Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) |
----------------+--------------------------------------+----------------------|
lny1 + 210.504 0.000 40.00 + 0.95 0.95 |
lnx1 + 216.82 0.000 40.00 + 0.98 0.98 |
lnx2 + 193.755 0.000 40.00 + 0.88 0.88 |
lnx4 + 217.801 0.000 40.00 + 0.98 0.98 |
------------------------------------------------------------------------------+
Notes: Under the null hypothesis of cross-section independence, CD N(0,1)
P-values close to zero indicate data are correlated across panel groups.
Together, the results of the tests warrant the use of a heterogeneous coef-
ficient regression (HCR) approach. Because one of the independent variables
is a lag of the dependent variables and we want to take into account common
unobserved factors across states, the dynamic common correlated estimation
(DCCE) estimator is applied. Because we also want to estimate state-specific
coefficients, the MG estimator is applied. To address the question about both
the short-run and long-run dynamics existing between the dependent variable
(i.e., state appropriations) and selected independent variables, we utilize the
error correction modeling (ECM) framework as reflected in Eq. (9.3). The
ECM framework includes autoregressive distributed lags (ARDLs) of (1 1
1) and cross-sectional lags (3 3 3 3). The use of ARDLs will enable us
to simultaneously estimate the short-run and long-run relationship of the
variables, while the cross-sectional lags will allow us to take into account
cross-sectional dependence. (Three cross-sectional lags of the variables were
chosen to take into account cross-sectional dependence.)
---------------+---------------------------------------------------------------
Mean Group: |
LD.lny1| .0247934 .0480515 0.52 0.606 -.0693857 .1189726
LD.lnx1| -.0290142 .0436854 -0.66 0.507 -.114636 .0566075
LD.lnx2| .011547 .1082887 0.11 0.915 -.200695 .223789
LD.lnx4| .2466661 .1515284 1.63 0.104 -.0503242 .5436563
---------------+---------------------------------------------------------------
Long Run Est. |
---------------+---------------------------------------------------------------
Mean Group: |
ec| -.7831215 .0569085 -13.76 0.000 -.8946601 -.6715829
lnx1| -.555794 .240716 -2.31 0.021 -1.027589 -.0839993
lnx2| .5886828 .3961468 1.49 0.137 -.1877507 1.365116
lnx4| .4912524 .2131553 2.30 0.021 .0734756 .9090292
_cons| -4.048029 4.794766 -0.84 0.399 -13.4456 5.34954
-------------------------------------------------------------------------------
Mean Group Variables: LD.lny1 LD.lnx1 LD.lnx2 LD.lnx4 _cons
Cross-sectional Averaged Variables: lny1(3) lnx1(3) lnx2(3) lnx4(3)
Long Run Variables: ec lnx1 lnx2 lnx4 _cons
Cointegration variable(s): L.lny1Estimation of Cross-Sectional Exponent (alpha)
--------------------------------------------------------------
variable| alpha Std. Err. [95% Conf. Interval]
---------------+----------------------------------------------
residuals| .1295795 .0096851 .1105971 .1485619
--------------------------------------------------------------
0.5 <= alpha < 1 implies strong cross sectional dependence.
Above, we see the estimated EC coefficient, which has the same value
and statistical significance as in the previous output. But it is labeled EC
and located under in the mean group estimates of the long-run coefficients.
At the end of the output, we also see the residuals of the model. The
results indicate there is no strong, but perhaps “semi weak” cross sectional
dependence (Chudik et al. 2011). If we want to see the estimates for the
individual states, we include the option showindividual.
. xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc cr(_all) cr_lags(1 3 3 3)
lr(L1.l ny1 lnx1 lnx2 lnx4) lr_options(ardl) exponent showin
(Dynamic) Common Correlated Effects Estimator - Mean Group (CS-ARDL)
In the interest of space, only the EC coefficients for the individual states are shown below.
[output cut]
L.lny1| -.788311 .0568437 -13.87 0.000 -.8997226 -.6768995
-------------------------------------------------------------------------------
Individual Results
-------------------------------------------------------------------------------
L.lny1_1| -.5837888 .2024887 -2.88 0.004 -.9806593 -.1869183
L.lny1_2| -.5674347 .2964714 -1.91 0.056 -1.148508 .0136386
L.lny1_3| -.7773012 .6831592 -1.14 0.255 -2.116269 .5616663
L.lny1_4| -1.037638 .223311 -4.65 0.000 -1.475319 -.5999563
L.lny1_5| -1.565602 .4164949 -3.76 0.000 -2.381917 -.749287
L.lny1_6| -.8035877 .2823022 -2.85 0.004 -1.35689 -.2502855
L.lny1_7| -1.39013 .380831 -3.65 0.000 -2.136545 -.6437151
L.lny1_8| -1.018001 .444861 -2.29 0.022 -1.889912 -.1460892
L.lny1_9| -1.422223 .3332329 -4.27 0.000 -2.075347 -.7690981
L.lny1_10| -.8597892 .2950313 -2.91 0.004 -1.43804 -.2815385
L.lny1_11| -.6425055 .1664852 -3.86 0.000 -.9688106 -.3162005
L.lny1_12| -.7691972 .4746639 -1.62 0.105 -1.699521 .1611269
L.lny1_13| -1.058413 .2581337 -4.10 0.000 -1.564346 -.5524802
L.lny1_14| -.3722795 .433068 -0.86 0.390 -1.221077 .4765182
L.lny1_15| -.4163244 .2378935 -1.75 0.080 -.8825871 .0499383
L.lny1_16| -.4805138 .5863073 -0.82 0.412 -1.629655 .6686274
L.lny1_17| -1.087561 .3537436 -3.07 0.002 -1.780886 -.3942362
L.lny1_18| -1.010681 .2113508 -4.78 0.000 -1.424921 -.5964413
9.4 Summary 201
[output cut]
Based on the output shown above, we can see that there is a substantial
amount of variability across states with respect to the estimated EC
coefficients. In most states, the EC coefficient is statistically significant.
In some states, appropriations partially adjust to shocks in the long-run
relationship with other variables. In other states, appropriations over-adjust
in the short-run to shocks and their state of disequilibrium is exacerbated
over the long term.
9.4 Summary
This chapter demonstrated the use of macro panel data and appropriate
statistical techniques to examine dynamic relationships between variables
when examining state-level higher education policy-oriented issues. The
macro panel data is composed of long time series (e.g., >20 years) across
many states. These statistical techniques include heterogeneous coefficient
regression (HCR) with dynamic coefficient common correlated estimation
(DCCE) and mean group estimators, which allows for the distinguishing
between short-run and long-run relationships between variables. The HCR
with DCCE and MG estimators allows for the distinguishing between
202 9 Advanced Statistical Techniques: II
9.5 Appendix
*Use the Stata routine xtpurt, with test options proposed by Herwartz and ///
Siedenburg (2008), Demetrescu and Hanck (2012), and ///
Herwartz et al. (2019). In the three test options, the null ///
hypothesis is that the panels (i.e., states) contain non-stationary data ///
or unit roots.
* xtpurt, with test options proposed by Herwartz, Maxand, and Walle (hmw)
xtpurt lny1, test(hmw) trend
xtpurt lnx1, test(hmw) trend
xtpurt lnx2, test(hmw) trend
xtpurt lnx4, test(hmw) trend
*Tests using Stata user-written routine xtcdf (Wursten 2017) for ///
cross-sectional independence, using updated version
ssc install xtcdf, replace
xtcdf lny1 lnx1 lnx2 lnx4
*If we want to see the estimates for the individual states, then we include the ///
option showindividual.
xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, ///
reportc cr(_all) cr_lags(1 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) ///
lr_options(ardl) exponent showin
*end
References
Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.
Blomquist, J., & Westerlund, J. (2013). Testing slope homogeneity in large panels with
serial correlation. Economics Letters, 121 (3), 374–378.
Cheslock, J. J., & Rios-Aguilar, C. (2011). Multilevel analysis in higher education research:
A multidisciplinary approach. In J. Smart & M. B. Paulsen (Eds.), Higher education:
Handbook of theory and research (Vol. 46, pp. 85–123). Springer.
Chudik, A., & Pesaran, M. H. (2015). Common correlated effects estimation of hetero-
geneous dynamic panel data models with weakly exogenous regressors. Journal of
Econometrics, 188 (2), 393–420.
Chudik, A., Pesaran, M. H., & Tosetti, E. (2011). Weak and strong cross-section
dependence and estimation of large panels. The Econometrics Journal, 14 (1), C45–
C90.
Demetrescu, M., & Hanck, C. (2012). A simple nonstationary-volatility robust panel unit
root test. Economics Letters, 117 (1), 10–13.
Ditzen, J. (2016). xtdcce: Estimating Dynamic Common Correlated Effects in Stata. In
SEEC Discussion Papers (No. 1601; SEEC Discussion Papers). Spatial Economics and
204 9 Advanced Statistical Techniques: II
Westerlund, J. (2005). New simple tests for panel cointegration. Econometric Reviews,
24 (3), 297–316.
Westerlund, J. (2007). Testing for error correction in panel data. Oxford Bulletin of
Economics and Statistics, 69 (6), 709–748.
Wursten, J. (2017). XTCDF: Stata module to perform Pesaran’s CD-test for cross-sectional
dependence in panel context. In Statistical Software Components. Boston College
Department of Economics.
Chapter 10
Presenting Analyses
to Policymakers
10.1 Introduction
The analyses that were discussed and demonstrated in the previous chapters
range from simple descriptive statistics to advanced statistical techniques.
The consumers of the results of these analyses are also varied, and include,
but are not exclusive to, policymakers. The consumers of the results
of these analyses are also varied. Many consumers include, but are not
exclusive to policymakers. However, many analysts target their work toward
policymakers. Therefore, it is necessary to produce policymaker-friendly
presentations. Using some of the routines in Stata, this chapter demonstrates
how we can accomplish this critical part of higher education policy analysis
and evaluation. These routines, commands, and syntax are included in an
appendix at the end of the chapter.
The Stata user-written module asdoc (Shah 2019) is one of the most
comprehensive routines for creating presentation-ready tables in Microsoft
Word. For the most recent version of asdoc, in Stata, type:
net install asdoc, from(http://fintechprofessor.com) replace
To get a sense of the comprehensive nature of the asdoc module, type:
help asdoc
In this demonstration, we will use data (supplemented with state tax
revenue and personal income data) from the previous chapter. First, we
change our working directory to where we want to save our tables.
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables”
Then we invoke the sum command for previously noted variables of inter-
est: state appropriations (y); net tuition revenue (x1); full-time equivalent
enrollment (x2); state total personal income (x3); gross state product (x4)
and; state tax revenue (x5)
. sum y x1 x2 x3 x4 x5
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 1,900 9.73e+08 1.45e+09 1.81e+07 1.57e+10
x1 | 1,900 5.57e+08 7.38e+08 7900000 5.22e+09
x2 | 1,900 178357.1 213093.8 10530 1639923
x3 | 1,900 1.60e+11 2.28e+11 4.02e+09 2.26e+12
x4 | 1,900 1.87e+11 2.73e+11 4.40e+09 2.66e+12
-------------+---------------------------------------------------------
x5 | 1,900 1.70e+10 2.58e+10 4.60e+08 2.43e+11
10.2 Presenting Descriptive Statistics 209
With the exception of x2, most of the variables have values that are in
scientific notation format. Therefore, before we can create a presentation-
ready table, the data for x1, x3, and x4 need to be rescaled to millions. We
can either create new variables that are rescaled by hand or utilize the Stata
user-written routine rescale to automatically rescale the new variables. To
do the latter, type the following:
net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace
To rescale the y, x1, x3, x4, and x5 into millions, we use replace and the
millions option.
rescale y, millions
rescale x1, millions
rescale x3, millions
rescale x4, millions
rescale x5, millions
The rerun the sum command and see the results below.
. sum y x1 x2 x3 x4 x5
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 1,900 973.1793 1452.862 18.1 15692.18
x1 | 1,900 556.8281 738.1183 7.9 5216.492
x2 | 1,900 178357.1 213093.8 10530 1639923
x3 | 1,900 159660.9 228314.1 4015.1 2263890
x4 | 1,900 187043.9 273491.1 4398.6 2657798
-------------+---------------------------------------------------------
x5 | 1,900 16974.89 25750.39 459.909 243082.1
Above we see the values are no longer in scientific notation format, but
they are not what we typically present to policymakers and other users. Each
variable should also be normalized in a way that makes them comparable
across states and over time. For example, state appropriation (y) is divided
by either population or FTE enrollment. Net tuition revenue (x1) should be
divided by full-time equivalent (FTE) enrollment. State total personal income
(x3), gross state product (x4), and state tax revenue (x5) should be divided by
population. Tandberg and Griffith (2013) suggest that state appropriations
per capita is a measure of adequacy or effort which is easily understood by
policymakers and the general public. However, they also caution that the
measure is limited in that larger population states are not always higher
income states.
In most cases, users would like to see one or two statistics, the mean and
the median. Generally, we also do not want to include decimal places, but
do want to include commas (format(%9.0fc)). We use the Stata command
tabstat (for documentation, type help tabstat), with the options below, to
produce the following:
210 10 Presenting Analyses to Policymakers
| mean p50
-------------+----------------------
y_pop | 201.0187 189.0372
x1fte | 4177.586 3479.234
x3_pop | 32641.16 31753.6
x4_pop | 38230.36 37069.8
x5_pop | 3474.925 3268.786
(note: file Table 10.1.doc not found)
Click to Open File: Table 10.1.doc
When we click to open Table 10.1, we see the following:
gen MD=0
lab var MD ”Comparisons“
replace MD=1 if fips==24
label define MD1 1 Maryland 0 ”All Other States“
label values MD MD1
It is also useful to create a categorical variable that reflects different time
periods. In this example, we create a variable decade and code and label it
accordingly.
gen decade =0
lab var decade ”Decades“
replace decade =1 if fy>=1980 & fy<=1989
replace decade =2 if fy>=1990 & fy<=1999
replace decade =3 if fy>=2000 & fy<=2009
replace decade =4 if fy>=2010 & fy<=2018
label define decade1 1 ”1980 to 1989“ 2 ”1990 to 1999“ 3 ”2000 to 2009“ 4
”2010 to 2018“
label values decade decade1
We click on Table 10.2.doc, adjust the column widths, bold some of the
text and numbers, and see the following:
Table 10.2 enables policymakers and other users to easily compare
Maryland’s state appropriations to all other states in the U.S. during different
time periods. Policy analysts can modify the Stata syntax code to create
different time periods and comparison groups in similar Microsoft Word
tables.
Policymakers and other users may also be interested in how key indicators
or metrics look across individual states. A table with statistics of even one
variable across 50 states would not be aesthetically pleasing to the eye. The
table above may be informative if some of the key variables in a specific time
period are displayed by state in a choropleth map (i.e., a thematic map based
on the statistics connected to a variable). This, however, requires a number
of steps in Stata. These steps are shown below.
1. create a subdirectory “Map” in the current working directory.
2. change to the working to Map subdirectory.
3. install the Stata user-written map creation module, maptile (Stepner 2017)
maptile_install using ”http://files.michaelstepner.com/geo_state.zip“, replace
michaelstepner.com/maptile%20slides%202015-03%20_handout.pdf.
10.3 Choropleth Maps 213
Fig. 10.1 Map of appropriations per capita by state for fiscal year 2017
Using the options in maptile, the maps can be displayed using different
titles, legends, and color schemes.
214 10 Presenting Analyses to Policymakers
Fig. 10.2 Percent change in state appropriations per FTE enrollment” “between FY 2009
and FY 2017
10.4 Graphs
Graphs are also very useful tools to convey information to policymakers and
other users of data. They should, however, be simple and uncluttered. One
of the most informative graphs are simple line charts showing variables over
time. Using the Stata user-written module lgraph (Mak 2015), we can show
state appropriations per population for Maryland and the rest of the nation
over time (Fig. 10.3).
ssc install lgraph, replace
lgraph y_pop fy, nom by(MD) xlabel(1980(3)2018) bw title(”State Appropriations
Per Population“ ”FY 1980-2018“) ytitle(Dollars) legend(pos(12) col(2))
We should also show a state of interest compared to the rest of the states
within that state’s region or academic common market. In the following
example, we demonstrate how to create a graph with the appropriate labels
and titles showing Maryland state appropriations per FTE compared to other
states within the Southern Regional Education Board (SREB) (Fig. 10.4).
label define MDSREB1 0 ”All Other SREB States“ 1 ”Maryland“
label values MDSREB MDSREB1
lgraph yfte fy if region_compact==1 , nom by(MDSREB)
xlabel(1980(2)2018, labsize(vsmall)) bw title(”State Appropriations
Per FTE“ ”FY 1980-2018“) ytitle(Dollars) legend(pos(12) col(2))
Fig. 10.3 State appropriations per capita in Maryland and all other states, FY 1980–FY
2018
Fig. 10.4 State appropriations per capita in Maryland and all other SREB states, FY
1980 to FY 2018
example from Chap. 7, we can create a graph that depicts when Colorado
enacted Senate Bill 189 (SB 04-189) to establish the College Opportunity
216 10 Presenting Analyses to Policymakers
Fig. 10.5 Colorado net tuition revenue per FTE before and after SB 189 and all other
states
Fund (COF) program. We are able to see trends in Colorado’s net tuition
revenue before and after the enactment of SB 04-189, compared to net tuition
revenue in all other states during the same time period.
global y ”netuit_fte“
lgraph $y year, by(T) stat(mean) xline(2005) xlabel(1990(2)2016,
labsize(small)) ylab(, nogrid) scheme(s2mono) bw title(”Colorado’s Net
Tuition Revenue Per FTE“ ”Before and After Colorado Senate Bill 189“)
ytitle(Dollars) legend(pos(12) col(2))
While Fig. 10.5 shows Colorado compares to all other states, another
graph could show the Western Interstate Commission for Higher Education
(WICHE) states as a control group (Fig. 10.6).
It is also useful to show graphs of regression results in a simple clear way. One
of the easiest and most flexible ways to do this is to use the Stata user-written
module, coefplot (Jann 2019a). (To download the most recent version of
coefplot in Stata, type ssc install coefplot, replace). We demonstrate
the use of this routine within a broader context of providing information to
policymakers and other users.
10.4 Graphs 217
Fig. 10.6 Colorado net tuition revenue per FTE before and after SB 189 and all other
WICHE states
We start with the following question that state higher education policy-
makers may want to ask. On average, what is the short-run relationship
between changes in net tuition revenue and state appropriations? How
analysts answer this question may be based on the set of assumptions she
or he makes with regard to statistical techniques that are employed. How
the results are presented, however, should be based on the audience. If the
audience includes policymakers and others who are less interested in the
statistical methods and assumption of those methods and more interested
in the results, then a simple graph may suffice. An analyst may choose to
employ pooled ordinary least squares (OLS) or a more advanced statistical
technique such as heterogeneous coefficient regression (HCR) with dynamic
common correlated estimation (DCCE) and mean group (MG) estimators.
However, she or he should display results, in a simple and clear manner
for policymakers and other users who may or may not be familiar with
the technique employed. To demonstrate this, we provide examples that are
based on regression models ranging from pooled OLS regression to HCR
with DCCE and MG estimators. In these examples, we use macro panel
data spanning 38 years across 50 states. In each of the examples, all the
variables are log transformed for easier interpretation of the results. Because
we are interested in the short-run relationship, the first-difference of net
tuition revenue (lnnetut) is regressed on lag of the first-difference of state
appropriations (lnstateap), first-difference of full-time equivalent students
218 10 Presenting Analyses to Policymakers
Figure 10.7 resembles a box chart, but it is actually a bar chart. It shows
the independent variables on the vertical axis and the change (scaled up by
10) in the dependent variable on the horizontal axis. The bars (which reflect
95% confidence intervals) that touch the zero line indicate the regression
coefficients of those particular independent variables are not significantly
different from zero. (The lines extending from each of the bars reflect a 99%
confidence interval). We can see from Fig. 10.7 that the bar representing
state appropriations touches the zero line. Therefore, we can easily show and
explain the results from the pooled OLS regression in the figure above.
2 Mata is a programming language, For a complete description of mata, see Mata Reference
Fig. 10.7 Pct. change in appropriations, FTE and personal income due a Pct change in
net tuition revenue
But what if we use the Stata user-written routine xtmg (Eberhardt 2013)
which allows us to relax the OLS assumptions of homogeneous coefficients and
cross-sectional independence when using panel data? We will invoke xtmg to
run the regression model using Common Correlated Effects and Mean Group
(CCEMG) estimators. (To install the most recent version of xtmg in Stata,
type ssc install xtmg, replace). The CCE estimator takes into account
cross-sectional dependence. The MG estimator produces state-specific model
beta coefficients, which are averaged across the panel.
. xtmg Dlnnetut LDlnstateap LDlnfte LDlnperinc, cce
Pesaran (2006) Common Correlated Effects Mean Group estimator
All coefficients present represent averages across groups (newid)
Coefficient averages computed as unweighted means
Mean Group type estimation Number of obs = 1,900
Group variable: newid Number of groups = 50
Obs per group:
min = 38
avg = 38.0
max = 38
Wald chi2(3) = 26.72
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------------
Dlnnetut | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------------+----------------------------------------------------------------
LDlnstateap | -.0097169 .0411761 -0.24 0.813 -.0904206 .0709869
LDlnfte | .1521267 .0761399 2.00 0.046 .0028952 .3013583
LDlnperinc | -.466823 .1204983 -3.87 0.000 -.7029952 -.2306507
__00000M_Dlnnetut | .9814087 .1471938 6.67 0.000 .6929142 1.269903
220 10 Presenting Analyses to Policymakers
We repeat the Stata syntax to extract the estimated coefficients from the
matrix produced by the regression model with the CCEMG estimator:
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))
Then we reenter the coefplot syntax from above [not shown here]. The
result is the graph shown below.
We see from Fig. 10.9 the results are the same with regard to the influence
of a short-run change in net tuition revenue (not statistically significant) as a
3 For a complete description and examples of the options for coefplot, see Jann, B. (2019,
May 28). coefplot—Plotting regression coefficients and other estimates in Stata. http://
repec.sowi.unibe.ch/stata/coefplot/getting-started.html.
10.5 Marginal Effects (with Continuous Variables) and Graphs 221
Fig. 10.8 Pct. change in appropriations, FTE and personal income due a Pct. change in
net tuition revenue
Marginal effects and graphs are another way to present the results of
regression models to policymakers and other users. Combined with most
regression models that are composed of continuous variables, the Stata
commands margins and coefplot provide a way to carry this out. This
section will discuss and demonstrate the use of these very flexible commands
as a way to provide information to policymakers.
Marginal effects are the changes in the dependent variable due to changes
in a specific continuous independent variable, holding all other independent
variables constant. They are calculated for one variable (y) by defining the
marginal effect to be the change (Δ) in another variable (x ) or a partial
222 10 Presenting Analyses to Policymakers
Fig. 10.9 Pct. change in appropriations, FTE and personal income due a Pct. change in
net tuition revenue
C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data
Then we open a file with the relevant Stata file, based on IPEDS data.
. use ”Example 10.dta“, clear
Upon careful inspection of the data, we see that the dataset spans 46 states
and 13 years (2000–2012) with a 1 year gap (2001 is missing). So, we drop
the first year (2000). It is preferable that we have no yearly gaps in the data
when we are including 1-year lags of independent variables in our regression
models.
drop if year==2000
(46 observations deleted)
Descriptive statistics [not shown] indicate the data are highly skewed.
Because prior testing [not shown] revealed there is serial correlation and
cross-sectional dependence in the data among the variables we plan to use, a
pooled OLS regression model with Driscoll-Kraay (D-K) errors. Because we
want to avoid reverse causation, we lag the independent variables by 1 year
in the regression model.
. xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
Regression with Driscoll-Kraay standard errors Number of obs = 460
Method: Pooled OLS Number of groups = 46
Group variable (i): id F( 4, 9) = 184.89
maximum lag: 2 Prob > F = 0.0000
R-squared = 0.8083
Root MSE = 867.9764
-------------------------------------------------------------------------------------
| Drisc/Kraay
adminstaff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | 1.21e-06 1.24e-07 9.81 0.000 9.34e-07 1.49e-06
|
state_appro_adj |
L1. | 2.47e-07 1.27e-07 1.95 0.083 -3.99e-08 5.34e-07
224 10 Presenting Analyses to Policymakers
|
fedrev_r |
L1. | -1.67e-07 1.62e-07 -1.03 0.330 -5.35e-07 2.00e-07
|
FTE_enroll |
L1. | .0025754 .0014926 1.73 0.119 -.000801 .0059518
|
_cons | 136.6033 74.6967 1.83 0.101 -32.37241 305.5789
-------------------------------------------------------------------------------------
Given the very large numbers, the AME are difficult to interpret. So we
should make an effort to calculate the elasticities, using the option eyex.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4)
could not calculate numerical derivatives -- discontinuous region with missing
values
encountered
r(459);
This clearly does not work! Why? The “average” elasticity for none of the
independent variables can be calculated. Instead, we should try to calculate
the elasticities of each of the variables at their average or mean levels.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((mean) _all)
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 8.99e+08 (mean)
L.state_apj = 1.11e+09 (mean)
L.fedrev_r = 8.26e+08 (mean)
L.FTE_enroll = 200801.8 (mean)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .5804154 .0613005 9.47 0.000 .4602685 .7005622
10.5 Marginal Effects (with Continuous Variables) and Graphs 225
|
state_appro_adj |
L1. | .1456065 .0764509 1.90 0.057 -.0042345 .2954474
|
fedrev_r |
L1. | -.0734286 .0703213 -1.04 0.296 -.2112558 .0643986
|
FTE_enroll |
L1. | .2748182 .1556239 1.77 0.077 -.0301991 .5798356
-------------------------------------------------------------------------------------
Because we know that data are highly skewed, we should also calculate
elasticities for variables at the median rather than the mean to see if the
results are substantially different.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at(( median) _all)
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 6.07e+08 ( median)
L.state_apj = 7.86e+08 ( median)
L.fedrev_r = 5.95e+08 ( median)
L.FTE_enroll = 150336 ( median)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .5436902 .067622 8.04 0.000 .4111535 .6762268
|
state_appro_adj |
L1. | .1432976 .0776547 1.85 0.065 -.0089028 .2954979
|
fedrev_r |
L1. | -.0735113 .0701083 -1.05 0.294 -.210921 .0638984
|
FTE_enroll |
L1. | .2857197 .1582156 1.81 0.071 -.0243771 .5958164
-------------------------------------------------------------------------------------
We see that at the median, only the change in net tuition revenue has
an effect on the change in the number of administrators. The results suggest
that a 1% increase in net tuition revenue contributes to a 0.54% increase in
administrators at public colleges and universities. This is only slightly less
than the 0.58% increase at the mean tuition revenue.
Next, we should display these results in a graph similar to Fig. 10.9. To do so,
we save the marginal effects (in terms of elasticities) by including the option
post in the following syntax.
margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all) post
226 10 Presenting Analyses to Policymakers
We then modify the coefplot syntax to produce the graph with the
relevant titles.
coefplot, xline(0) keep(L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r
L.FTE_enroll) mlabel format(%9.2g) mlabposition(0) msymbol(i)
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white)
lwidth(. medium)) levels(95 99) coeflabels(L.net_tuition_rev_adj
= ”{bf:Net Tuition Revenue}“ L.state_appro_adj = ”State
Appropriations“ L.fedrev_r = ”Federal Revenue“ L.FTE_enroll
= ”FTE Enrollment“) title(”Percent Change in {bf:Administrators}
Due to a 1% Change in“ ”{bf:Net Tuition Revenue} (controlling for
other factors)“, size(medium) margin(small) justification
(center))
Fig. 10.10 Pct. change in administrators due to a Pct. change in net tuition revenue
10.5 Marginal Effects (with Continuous Variables) and Graphs 227
To carry out steps 1 through 3, we enter a very long line of syntax that
produces the graph below.
coefplot (., keep(L.net_tuition_rev_adj) color(black))
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r)
color(gray)) (., keep (L.FTE_enroll) color(gray)), legend(on) xline(0)
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8)
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal
Revenue“ L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) title(”Percent
Change in {bf:Administrators} Due to a 10% Change in“ ”{bf:Net Tuition
Revenue} (controlling for other factors)“, size(medium) margin(small)
justification(center)) addplot(scatter @b @at, ms(i) mlabel(@b)
mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) rescale(10)
p2(nokey) p3(nokey) p1(label(”Different from Zero“)) p4(label(”Ignore -
not different zero“)) ytitle(Percent) xtitle(”At the Median“,
size(small))
Fig. 10.11 Pct. change in administrators due to a Pct. change in net tuition revenue
228 10 Presenting Analyses to Policymakers
Fig. 10.12 Pct. change in administrators due to a Pct. change in net tuition revenue
10.6 Marginal Effects and Word Tables 229
After rerunning the regression model, the margins syntax is then changed
to reflect elasticities at the 75th percentile of net tuition revenue and other
variables.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p75) _all) post
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 1.21e+09 (p75)
L.state_apj = 1.32e+09 (p75)
L.fedrev_r = 1.02e+09 (p75)
L.FTE_enroll = 232360.3 (p75)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .6224404 .0570452 10.91 0.000 .5106339 .7342469
|
state_appro_adj |
L1. | .1381845 .0706619 1.96 0.051 -.0003102 .2766792
|
fedrev_r |
L1. | -.0719685 .0693115 -1.04 0.299 -.2078166 .0638796
|
FTE_enroll |
L1. | .2534847 .1466091 1.73 0.084 -.0338638 .5408332
-------------------------------------------------------------------------------------
Fig. 10.13 Pct. change in administrators due to a Pct. change in net tuition revenue
Table 10.3 Percent change in administrators due to a 1% change in net tuition revenue,
controlling for other factors (state appropriations, federal revenue, and FTE enrollment)
25th percentile Median 75th percentile
L.Net tuition revenue 0.427*** 0.531*** 0.590***
(0.079) (0.078) (0.076)
L.State appropriations 0.135 0.149 0.150
(0.073) (0.080) (0.079)
L.Federal revenue -0.0195 -0.0246 -0.0258
(0.085) (0.107) (0.112)
L.FTE enrollment 0.193 0.211 0.202
(0.151) (0.164) (0.159)
Observations 322 322 322
Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001
Marginal effects can also be used with categorical variables to answer a range
of policy questions. For example, suppose higher education policymakers
would like to know if the relationship between administrators and net tuition
revenue differ by the extent to which higher education is regulated by the
state. In this example, we measure regulation by whether (Yes = 1) or not
(No = 0) a state has a higher education consolidated governing board (CGB).
The following steps are carried out to produce a graph of the marginal
effects by whether or not states have a consolidated governing board.
1. Shorthand notation and global macros are used to save keystrokes.
gen y = adminstaff
global x ”L1.net_tuition_rev_adj L1.state_appro_adj L1.fedrev_r L1.FTE_enroll“
232 10 Presenting Analyses to Policymakers
2. We “quietly” run a pooled OLS regression with D-K standard errors for
states with no consolidated governing board.
qui xtscc y $x if CGB==0
The graph, with appropriate labels and titles, is then created using the
following syntax.
coefplot NoCGB CGB, xline(0) format(%9.0f) rescale(10) recast(bar)
barwidth(0.3) fcolor(*.5) coeflabels(L.net_tuition_rev_adj = ”{bf:Net
Tuition Revenue}“ L.state_appro_adj = ”State Appropriations“ L.fedrev_r
= ”Federal Revenues“ L.FTE_enroll = ”FTE Enrollment“, labsize(small))
vertical p1(label(”No CGB“)) p4(label(”CGB“)) ytitle(Percent) ylabel(-
4(2)10) title(”Percent Change in {bf:Administrators} Due to a 10% Change
in“ ”{bf:Net Tuition Revenue} (controlling for other factors)“,
size(medium) margin(small) justification(center))
10.8 Summary
Fig. 10.14 Pct. change in administrators due to 10% Change In Net Tuition Revenue
(and other factors) by Consolidated Governing Board (CGB)
chapter demonstrates how the Stata commands margins and coefplot can
be used to create graphs to show the results to policymakers and others who
may not be familiar with or interested in regression models.
10.9 Appendix
*Chapter 10 Syntax
*Use the Stata user-written module asdoc (Shah, 2019) to create ///
presentation-ready tables in Word
net install asdoc, from(http://fintechprofessor.com) replace
*open a dataset
use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data\Example 10.1.dta“
*We can either create new variables that are rescaled original variables by ///
hand or utilize the Stata user-written routine rescale automatically to ///
rescale the new variables
*install rescale
net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace
234 10 Presenting Analyses to Policymakers
*rescale the y, x1, x3, x4,and x5 into millions, we use replace and the ///
millions option
rescale y, millions
rescale x1, millions
rescale x3, millions
rescale x4, millions
rescale x5, millions
*combine the use of asdoc and tabstat, use replace and ///
and option abb(.) and options
asdoc tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median) ///
column(statistics) format(%9.0fc) dec(0) long ///
title(Table 10.1 Descriptive Statistics) save(Table 10.1.doc) ///
replace label abb(.) replace
*label variable
lab var decade ”Decades“
replace decade =1 if fy>=1980 & fy<=1989
replace decade =2 if fy>=1990 & fy<=1999
replace decade =3 if fy>=2000 & fy<=2009
replace decade =4 if fy>=2010 & fy<=2018
*create a Word table comparing Maryland to the rest of the nation ///
asdoc table decade MD, contents(mean y_pop) format(%9.0fc) dec(0) ///
title(Table 10.2 Average State Appropriations per Population) ///
save(Table 10.2.doc) replace label abb(.) replace
*Choropleth Maps
*change to the working to Map sub-directory
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Maps“
*install the Stata user-written map creation module, maptile (Stepner, 2017)
maptile_install using ”http://files.michaelstepner.com/geo_state.zip“, replace
*run statastates to add U.S. state identifiers (abbreviation, FIPS code, ///
and name)
statastates, name(state)
10.9 Appendix 235
*Create a choropleth map showing the values of one variable in one year ///
or change between two time periods, using the Stata-user written module ///
maptile (Stepner, 2017)
ssc install maptile, replace
*create a map - Fig. 10.1 Map of Appropriations per Capita by State for ///
Fiscal Year 2017
maptile y_pop if fy==2017, geo(state) geoid(statefips) nquantiles(5) ///
rangecolor(gray*0.075 gray*1.0) legd(0) ///
twopt(title(”State Appropriations per Capita, 2017“ ”(in dollars)“))
*create a map - Fig. 10.2 Percent Change in State Appropriations per ///
FTE Enrollment“ ”Between FY 2009 & FY 2017
maptile pctchnge , geo(state) geoid(statefips) ///
rangecolor(gray*0.01 gray*1.2) nq(7) legd(0) ///
twopt(title(“Percent Change in State Appropriations per FTE Enrollment” ///
“Between FY 2009 & FY 2017” ))
*create a graph with the appropriate labels and titles showing Maryland ///
state appropriations per FTE compared to other states within the Southern ///
Regional Education Board (SREB)
*create Fig. 10.4. State Appropriations per Capita in Maryland and All ///
Other SREB States, FY 1980 to FY 2018
lgraph yfte fy if region_compact==1 , nom by(MDSREB) xlabel(1980(2)2018, ///
labsize(vsmall)) bw title(”State Appropriations Per FTE“ ”FY 1980-2018“) ///
ytitle(Dollars) legend(pos(12) col(2))
*create a graph that depicts when Colorado enacted Senate Bill 189 ///
(SB 04-189) to establish the College Opportunity Fund (COF) program to see ///
trends in Colorado’s net tuition revenue before and after the enactment of ///
SB 04-189, compared to net tuition revenue in all other states during the ///
same time period.
*create Fig. 10.5. Colorado Net Tuition Revenue per FTE Before and ///
After SB 189 and All Other States
global y ”netuit_fte“
lgraph $y fy & fy>1999, by(T) stat(mean) xline(2005) xlabel(2000(2)2016, ///
labsize(small)) ylab(, nogrid) scheme(s2mono) bw ///
title(”Colorado’s Net Tuition Revenue Per FTE“ ///
”Before and After Colorado Senate Bill 189“) ytitle(Dollars) ///
legend(pos(12) col(2))
*create Fig. 10.6. Colorado Net Tuition Revenue per FTE Before and ///
After SB 189 and All Other WICHE States
gen COWICHE =0 if region_compact==2
replace COWICHE = 1 if fips==8
label define COWICHE1 0 ”All Other WICHE States“ 1 ”Colorado“
label values COWICHE COWICHE1
global y ”netuit_fte“
lgraph $y fy if region_compact==2 & fy>1999, nom by(COWICHE) ///
stat(mean) xline(2005) ///
xlabel(2000(2)2016, labsize(small)) ylab(, nogrid) scheme(s2mono) bw ///
title(”Colorado’s Net Tuition Revenue Per FTE“ ///
”Before and After Colorado Senate Bill 189“) ytitle(Dollars) ///
legend(pos(12) col(2))
*we use Stata’s mata syntax to extract the estimated coefficients from the ///
matrix produced by the regression models
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))
*After slightly modifying the coefplot syntax provided by Jann, we create ///
a graph of the coefficients from the OLS regression results above.
*Fig. 10.7. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue
coefplot, xline(0) drop(_cons) mlabel format(%9.2g) mlabposition(0) ///
msymbol(i) ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LD.lnstateap = ”State Appropriations“ ///
LD.lnfte = ”FTE Enrollment“ LD.lnperinc = ”State Personal Income“) ///
ytitle(10 Percent Change in . . .) xtitle(Change in Net Tuition Revenue)
*We use the Stata user-written routine xtmg (Eberhardt, 2013) that allows ///
us to relax the OLS assumptions of homogeneous coefficients and ///
cross-sectional independence when using panel data
*we use the CCE estimator, which takes into account ///
cross-sectional dependence. The MG estimator produces ///
state-specific model beta coefficients, which are averaged across the panel
xtmg Dlnnetut LDlnstateap LDlnfte LDlnperinc, cce
10.9 Appendix 237
*We repeat the Stata syntax to extract the estimated coefficients from ///
the matrix produced by the regression model with the CCEMG estimator
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))
*we modify the coefplot syntax to include the variables of interest from ///
the regression model with the CCEMG estimator. We also change the ///
orientation from horizontal to vertical, add titles, and bold the text //
we want to bring attention to it in the graph.
*create Fig. 10.8. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue (controlling for other factors)
coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) ///
mlabel format(%9.2g) mlabposition(0) msymbol(i) ///
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LDlnstateap ///
= ”{bf:State Appropriations}“ ///
LDlnfte = ”FTE Enrollment“ ///
LDlnperinc = ”State Personal Income“, ///
labsize(medium)) ///
vertical title(”Short-Run Change in {bf:Net Tuition Revenue} Due to a 10% Change in“ ///
”{bf:State Appropriations} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))
*create a graph from a HCR model with DCCE and MG estimators and ///
a first-order autoregressive distributed lag (ARDL), of each of the variables.
qui xtdcce2 Dlnnetut L1.Dlnnetut LDlnstateap LDlnfte LDlnperinc, ///
reportc cr(_all) cr_lags(3 3 3 3) lr(L1.Dlnnetut LDlnstateap LDlnfte ///
LDlnperinc) lr_options(ardl)
*create Fig. 10.9. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue (controlling for other factors)
coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) ///
mlabel format(%9.2g) mlabposition(0) msymbol(i) ///
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LDlnstateap ///
= ”{bf:State Appropriations}“ ///
LDlnfte = ”FTE Enrollment“ ///
LDlnperinc = ”State Personal Income“, ///
labsize(medium)) ///
vertical title(”Short-Run Change in {bf:Net Tuition Revenue} Due to a 10% Change in“ ///
”{bf:State Appropriations} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))
*we use a pooled OLS regression model with Driscoll-Kraay (D-K) errors. ///
and lagged independent variables by one year.
xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
*calculate elasticities for variables at the median rather than the mean //
to see if the results are substantially different
margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all)
*show the percent change (a rescaled to 10) on the vertical axis; show the ///
independent variables on the horizontal axis and; create custom legends ///
with regard to the significance of the independent variables
*create Fig. 10.11
coefplot (., keep(L.net_tuition_rev_adj) color(black)) ///
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r) color(gray)) ///
(., keep (L.FTE_enroll) color(gray)), legend(on) xline(0) ///
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ ///
L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) ///
margin(small) justification(center)) addplot(scatter @b @at, ms(i) ///
mlabel(@b) mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) ///
rescale(10) p2(nokey) p3(nokey) p1(label(”Different from Zero“)) ///
p4(label(”Ignore - not different zero“)) ytitle(Percent) ///
xtitle(”At the Median“, size(small))
*change the working directory to where we would like to place a Word table
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“
*Fig. 10.14. Pct. Change in Administrators Due to 10% Change in Net ///
Tuition Revenue (and other factors) by Consolidated Governing Board (CGB)
coefplot NoCGB CGB, xline(0) format(%9.0f) rescale(10) recast(bar) ///
barwidth(0.3) fcolor(*.5) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ ///
L.fedrev_r = ”Federal Revenues“ ///
L.FTE_enroll = ”FTE Enrollment“, labsize(small)) ///
vertical p1(label(”No CGB“) color(gray)) ///
p4(label(”CGB“) color(black)) ytitle(Percent) ylabel(-4(2)10) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))
*end
References
Eberhardt, M. (2013). XTMG: Stata module to estimate panel time series models with
heterogeneous slopes. https://econpapers.repec.org/software/bocbocode/s457238.htm
Jann, B. (2014). Plotting regression coefficients and other estimates. The Stata Journal,
14 (4), 708–737.
Jann, B. (2019a). COEFPLOT: Stata module to plot regression coefficients and other
results. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s457686.html
Jann, B. (2019b). ESTOUT: Stata module to make regression tables. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s439301.html
Mak, T. (2015). LGRAPH: Stata module to draw line graphs with optional error bars. In
Statistical Software Components. Boston College Department of Economics. https://
ideas.repec.org/c/boc/bocode/s456849.html
Pisati, M. (2018). SPMAP: Stata module to visualize spatial data. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s456812.html
Schpero, W. L. (2018). STATASTATES: Stata module to add US state identifiers to
dataset. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s458205.html
Shah, A. (2019). ASDOC: Stata module to create high-quality tables in MS Word from
Stata output. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s458466.html
Stepner, M. (2017). MAPTILE: Stata module to map a variable. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s457986.html
Tandberg, D. A., & Griffith, C. (2013). State Support of Higher Education: Data, Measures,
Findings, and Directions for Future Research. In M. B. Paulsen (Ed.), Higher Education:
Handbook of Theory and Research (Vol. 28, pp. 613–685). Springer Netherlands. https:/
/doi.org/10.1007/978-94-007-5836-0_13
Index
H
Hausman test, 136, 138–140, 143 O
Heterogeneous coefficient regression, 183, Ordinary least squares (OLS), 1, 4, 103,
184, 199, 203, 217 188, 217
Heteroscedasticity, 105, 117, 118, 120, 121, Other Sources of National Data, 22
139, 153, 154, 167, 173, 198, 199, 204
High School Longitudinal Study of 2009
(HSLS: 09), 19 P
Histogram, 92, 93, 101, 102 Partial autocorrelations, 150, 154, 163, 177
Homoscedasticity, 117, 120, 142, 151 Pesaran cross-sectional dependence test,
172
Pesaran’s test of cross sectional
I independence, 169
Interaction effect, 112 Pooled (POLS) regression model with
Interaction terms, 111–114, 141 dummy variables, 127
Pooled OLS (POLS) regression, 108
Prais-Winsten (P-W) estimator, 153
L Presentation-ready tables in Microsoft
Levene test, 120 Word, 208
Line charts, 213
Long-run, 183–186, 188, 189, 198, 199,
201–203 R
Random-effects regression, 4, 103, 134–136,
141, 163, 174
M Regional Compacts, 23
Marginal effect at the average (MEA), 222 Regression models with Driscoll and Kraay
Mean group, 183, 186, 188, 200–203, 217 (D-K) standard errors, 173
Median, 10, 13, 80, 81, 85, 90, 93, 95, 209, Residual-versus-fitted plot, 116
210, 222, 225–227, 230, 233, 234,
237, 239
S
Modified Dickey-Fuller test, 147
Scatter plots, 79, 96, 98, 102
Moving-average parameter, 156
Short-run, 183–186, 188, 189, 199, 201–203,
Moving average terms, 156
216, 221
Multivariate regression, 103, 121, 122
Skewed distributions, 93
Standard deviation, 80, 85, 88, 107, 108,
127
N
Index 243
T
Test for assumption of homogeneous W
coefficients, 199 Weighted least squares (WLS), 121
Test for weak cross-sectional dependence, Westerlund (2005) test, 196
171, 201, 204 Within-group estimator, 122
The College Board, 22
Time-invariant categorical variables, 91, 92
Time series regression model, 153, 177 Y
Two-way tables, 88 Year fixed-effects, 175, 178, 179