You are on page 1of 249

Quantitative Methods in the Humanities

and Social Sciences

Marvin Titus

Higher
Education Policy
Analysis Using
Quantitative
Techniques
Data, Methods and Presentation
Quantitative Methods in the Humanities
and Social Sciences

Editorial Board
Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich,
Alyn Rockwood

Quantitative Methods in the Humanities and Social Sciences is a book series


designed to foster research-based conversation with all parts of the university
campus – from buildings of ivy-covered stone to technologically savvy walls
of glass. Scholarship from international researchers and the esteemed editorial
board represents the far-reaching applications of computational analysis, statistical
models, computer-based programs, and other quantitative methods. Methods are
integrated in a dialogue that is sensitive to the broader context of humanistic
study and social science research. Scholars, including among others historians,
archaeologists, new media specialists, classicists and linguists, promote this
interdisciplinary approach. These texts teach new methodological approaches for
contemporary research. Each volume exposes readers to a particular research
method. Researchers and students then benefit from exposure to subtleties of
the larger project or corpus of work in which the quantitative methods come to
fruition.
More information about this series at http://www.springer.com/series/11748
Marvin Titus

Higher Education Policy


Analysis Using Quantitative
Techniques
Data, Methods and Presentation
Marvin Titus
Counseling, Special, and Higher Education
University of Maryland
College Park, MD, USA

ISSN 2199-0956 ISSN 2199-0964 (electronic)


Quantitative Methods in the Humanities and Social Sciences
ISBN 978-3-030-60830-9 ISBN 978-3-030-60831-6 (eBook)
https://doi.org/10.1007/978-3-030-60831-6

© Springer Nature Switzerland AG 2021


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgments

There are many individuals who encouraged and inspired me, over the past
few years, to write this book. I am grateful to my colleagues, Alberto Cabrera
and Sharon Fries-Britt, at the University of Maryland who encouraged me to
write this book. I am grateful to my students, whom I taught in a graduate
course that covered many of the topics that are introduced in this book. They
contributed to my deeper understanding of the topics that I introduced in
the course and also inspired me to take on this project. I am particularly
grateful to my former students who reviewed the draft chapters of this book.
They are as follows: Christie De Leon, MacGregor Obergfell, Matt Renn,
and Liz Wasden. They provided valuable comments, edits, and suggestions
for improvement.
I thank Ozan Jaquette, who graciously makes institution- and state-level
data available to me and other researchers. Some of these data are used in
many of the examples in this book.
I would also like to thank Springer Publishing for their support in
publishing this book and their patience as I wrote and revised the book.
I would also like to thank my academic department, college, and university
for a semester-long sabbatical, which I used to develop the book proposal.
Finally, I would like to thank my wife Beverly, who provided encourage-
ment and support as I spent an enormous amount of time away from her
working on the draft manuscript of this book. I owe a great deal of gratitude
to her.

v
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The What Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 The How Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 The How Questions and Quantitative Techniques . . 15
2.2.4 So Many Answers and Not Enough Time . . . . . . . . . . . 17
2.2.5 Answers in Search of Questions . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Identifying Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 International Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 National Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 State-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Institution-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Creating Datasets and Managing Data . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Stata Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Primary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Secondary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

vii
viii Contents

5 Getting to Know Thy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Getting to Know the Structure of Our Datasets . . . . . . . . . . . . . 54
5.3 Getting to Know Our Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Missing Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Missing Data—Missing Completely at Random. . . . . 71
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Using Descriptive Statistics and Graphs . . . . . . . . . . . . . . . . . . . . . . 79
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.1 Measures of Central Tendency. . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Graphs—EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Introduction to Intermediate Statistical Techniques . . . . . . . 103
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Review of OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2.1 The Assumptions of OLS Regression . . . . . . . . . . . . . . . . 104
7.2.2 Bivariate OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.3 Multivariate OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.4 Multivariate Pooled OLS Regression . . . . . . . . . . . . . . . . . 110
7.3 Weighted Least Squares and Feasible Generalized
Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Fixed-Effects Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1 Unobserved Heterogeneity and Fixed-Effects
Dummy Variable (FEDV) Regression . . . . . . . . . . . . . . . . 122
7.4.2 Estimating FEDV Multivariate POLS
Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.3 Fixed-Effects Regression
and Difference-in-Differences . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5 Random-Effects Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5.1 Hausman Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Contents ix

8 Advanced Statistical Techniques: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2 Time Series Data and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 Testing for Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.3.1 Examples of Autocorrelation Tests—Time
Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 Time Series Regression Models with AR terms . . . . . . . . . . . . . . 153
8.4.1 Autocorrelation of the Residuals
from the P-W Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.5 Summary of Time Series Data, Autocorrelation,
and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6 Examples of Autocorrelation Tests—Panel Data. . . . . . . . . . . . . 163
8.7 Panel-Data Regression Models with AR Terms . . . . . . . . . . . . . . 164
8.8 Cross-Sectional Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.8.1 Cross-Sectional Dependence—Unobserved
Common Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.8.2 Tests to Detect Cross-Sectional
Dependence—Unobserved Common Factors. . . . . . . . . 168
8.9 Panel Regression Models That Take Cross-Sectional
Dependency into Account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.11 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9 Advanced Statistical Techniques: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 The Context of Macro Panel Data
and an Appropriate Statistical Approach . . . . . . . . . . . . . . . . . . . . . 182
9.2.1 Heterogeneous Coefficient Regression . . . . . . . . . . . . . . . . 182
9.2.2 Macro Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.2.3 Common Correlated Effects Estimators . . . . . . . . . . . . . 184
9.2.4 HCR with a DCCE Estimator . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2.5 Error Correction Model Framework . . . . . . . . . . . . . . . . . . 186
9.2.6 Mean Group Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.3 Demonstration of HCR with DCCE and MG Estimators . . . 187
9.3.1 Macroeconomic Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.3.2 Tests for Nonstationary Data . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.3.3 Tests for Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.3.4 Tests for Cross-Sectional Independence . . . . . . . . . . . . . . 196
9.3.5 Test of Homogeneous Coefficients . . . . . . . . . . . . . . . . . . . . 197
9.3.6 Results of the HCR with DCCE and MG
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
x Contents

10 Presenting Analyses to Policymakers . . . . . . . . . . . . . . . . . . . . . . . . . . 207


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.2 Presenting Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.2.1 Descriptive Statistics in Microsoft Word Tables . . . . 208
10.3 Choropleth Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
10.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.4.1 Graphs of Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.5 Marginal Effects (with Continuous Variables) and Graphs . . 221
10.5.1 Marginal Effects (Elasticities) and Graphs . . . . . . . . . . 225
10.6 Marginal Effects and Word Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.7 Marginal Effects (with Categorical Variables) and Graphs . . 231
10.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
About the Author

Marvin Titus His research focuses on the economics and finance of higher
education and quantitative methods. While he has explored how institutional
and state finance influences student retention and graduation, Dr. Titus’
most recent work is centered on examining the determinants of institutional
cost and productivity efficiency. He investigates how state higher education
finance policies influence degree production. Through the use of a variety
of econometric techniques, Dr. Titus is also exploring how state business
cycles influence volatility in state funding of higher education. Named a
TIAA Institute Fellow in 2018, Dr. Titus has published in top-tier research
journals, including the Journal of Higher Education, Research in Higher
Education, and Review of Higher Education. He is an associate editor of
Higher Education: Handbook of Theory and Research and has served on
the editorial board of Research in Higher Education, Review of Higher
Education, and the Journal of Education Finance. Dr. Titus also serves on
several technical review panels for national surveys produced by the National
Center for Education Statistics. To conduct his research utilizing national
and customized state- and institution-level datasets, Dr. Titus uses several
statistical software packages such as Stata, Limdep, and HLM. He earned a
BA in economics and history from York College of the City University of
New York, MA in economics from the University of Wisconsin-Milwaukee,
and a PhD in higher education policy, planning, and administration from the
University of Maryland.

xi
Chapter 1
Introduction

Keywords Introduction · Chapters

Why write a book about using quantitative techniques in higher education


policy analysis? There are several reasons why I decided to write this book.
First, the idea for this book evolved out of a graduate-level course that I have
been teaching over the past few years at the University of Maryland. In that
course, I instruct students on how to conduct state-level higher education
policy research that addresses such questions as how college enrollment rates
across states are influenced by the economic and political context of state
higher education policy or how college completion rates across states are
affected by state governance and the regulation of higher education. Based
on their interests, students are instructed on how to design and manage panel
datasets. Students are introduced to, discuss, and may draw from such data
sources. In the course, students are encouraged to think deeply about higher
research policy questions within the context of the concerns of policymakers
and the broader public. This prompted me to think about how quantitative
techniques in higher education policy research should be rigorous, relevant,
accessible to policymakers as well as the public in general, but also forward-
looking (hence, Chap. 9).
Second, the idea for this book emerged out of a realization that higher
education policy research involves the use of many different quantitative
techniques. These techniques mostly include descriptive statistics, ordinary
least squares (OLS) regression, panel data analysis (e.g., fixed-effects and
random-effects regression), and most recently differences-in-differences. Com-
prehensive discussions of some of these techniques have appeared in separate

© Springer Nature Switzerland AG 2021 1


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_1
2 1 Introduction

volumes of Higher Education: Handbook of Theory and Research. What is


missing in the higher education literature on quantitative methods is a
comprehensive discussion and demonstration of some of the techniques that
have been recently developed in other disciplines and fields. Discussions
are also needed with respect to why higher education policy analysts
and researchers should begin to think about what would warrant the
need to apply the most appropriate technique to address policy questions.
Additionally, as data are made more available over a longer period of time,
some of these techniques become appropriate for use in higher education
policy research. Hence, there is a need to have a comprehensive discussion
about and demonstrate the commonly used and recently developed policy
research techniques. A book provides a good venue for those discussions and
demonstrations.
Third, there is a need to discuss and demonstrate how we should present
the results of higher education policy research to policymakers and the public
in general. Much of what is done in higher education policy research remains
in academic journals and technical reports. Many of those articles and reports
present the results of regression models in ways that are not easily digestible
for some policy analysts, policymakers, and other lay people. Additionally,
there have been claims that a “disconnect” exists between higher education
policy researchers and policymakers (Birnbaum 2000). These claims, however,
are not unique to the field of higher education. In general, higher policy
analysts should be able to conduct research and seek information from that
research that is useful. In other words, there should be no divide. Based on
a study conducted several years ago, there tends to be divergence between
the expectations of individuals working as higher education policy analysts
and the training of graduate students by faculty with respect to quantitative
skills (Arellano and Martinez 2009). This study alluded to the divergence
being the result of how quantitative techniques are taught to master’s and
doctoral students in higher education and public policy programs. It is
claimed that, on one hand, master’s students are “undereducated” with
respect to quantitative skills that are required to conduct policy analysis. On
the other hand, the study claims that doctoral students are “overeducated”
with respect to quantitative skills that are needed by policy analysts. The
most likely “truth” is that both higher education graduate students and policy
analysts should be provided with a comprehensive reference that can provide
a unified approach to understanding the use of quantitative techniques.
Additionally, this reference should provide guidance to higher education
graduate students and policy analysts with respect to the presentation of
the results of quantitative policy research to a lay audience. In the form of a
book, this reference introduces and demonstrates to both audiences the use of
quantitative techniques and the presentation of results from those techniques
to policymakers and the general public. This book, it is hoped, will help
bridge the gaps between graduate students, practicing policy analysts, and
policymakers.
1 Introduction 3

This book will also touch on the subject of policy research questions.
Higher education policy analysis is not only about asking the right questions,
it’s also about using the appropriate quantitative techniques to answer
those questions. While acknowledging and touching on the former, this
book focuses on the latter. Some books on the higher education policy
analysis show how to frame a research agenda (e.g., Hillman et al. 2015).
A plethora of literature in a variety of journals addresses a wide range
of higher education policy areas such as state funding, tuition, student
financial aid, governance, accountability, and college completion. A smaller
body of literature introduces higher education researchers to the use of
specific quantitative techniques. As pointed out above, whole chapters in
Higher Education: Handbook of Theory and Research have been devoted to
a particular quantitative research method in higher education. However, to
date, there is no comprehensive reference text that provides guidance to
higher education policy analysts, researchers, and students with respect to
the research design that may be necessary to answer important questions
using quantitative techniques. A research design would include asking the
“right” questions, identifying existing data sources or creating a customized
dataset, and using the appropriate statistical techniques.
This book goes beyond providing guidance to higher education policy
analysts with respect to research design. On the front end, it also covers
the identification of data sources, management and exploration of data.
On the back end, the book introduces advanced quantitative techniques
but also demonstrates how to present research results to higher education
policymakers and other lay people. Consequently, the book is organized in
the following fashion. Chapter 2 discusses the questions that higher education
policy analysts and researchers, who use quantitative methods should ask,
and may not be able to answer. These questions may involve the use of a
variety of data and statistical techniques.
Chapter 3 introduces the reader to various secondary data sources that can
be used to answer policy or research questions or build custom datasets. This
chapter will provide an overview of easily accessible data for higher education
policy analysis across countries, U.S. states, institutions, and students. Most
of these data are publicly available but others are restricted and require a
license. In this book, only data from publicly available sources are accessed
and used in examples. Many higher education analysts and researchers have
used data from these publicly available sources to examine various policy-
related topics. It should be noted that this chapter does not provide an
exhaustive list of higher education data sources.
Chapter 4 shows how to create analytic datasets, organize, and manage
datasets that can be used to answer specific higher education policy questions.
By way of step-by-step instructions on how to build a custom dataset, this
chapter shows how to import data into Stata datasets for analysis. Using
examples, the organization and management of customized datasets are also
4 1 Introduction

demonstrated. This chapter discusses and demonstrates the use of Excel as


well as Stata to create, organize, and manage datasets.
Chapter 5 discusses the importance of getting to “know thy data” even
before doing any kind of data analysis. Because many higher education policy
analysts and researchers import data from other sources, it is important to
“clean” and “prep” such data before use. Utilizing examples, this chapter
demonstrates how to address the nuances of imported data (e.g., missing
data, string variables) before they are analyzed.
Chapter 6 demonstrates the use of various descriptive statistical methods
and graphs that can be used to provide basic descriptive information to higher
education policymakers and lay people. Building on the previous chapter,
this chapter shows how exploratory data analysis (EDA) techniques can be
used to present descriptive statistics presented in such a way that enables
policymakers and others to better understand the nature of the data that are
used to inform higher education policymakers. In many ways, EDA is the most
important part of higher education policy analysis. It precedes and determines
the extent to which intermediate or advanced level analyses are needed or
required. If intermediate or advanced level analyses are needed, EDA also
provides guidance with respect to the specific quantitative techniques that
should be employed.
Chapter 7 shows how intermediate level statistical techniques can be
used to answer higher education policy-oriented questions. In this chapter,
the use of statistical techniques that include ordinary least squares (OLS),
fixed-effects, and random-effects regression models are introduced to address
the “what” questions with respect to the relationship of policy variables to
outcomes of interest to policymakers. This chapter also introduces the use of
regression-based models that can be modified to infer causation and address
the “what effect” questions with respect to the adoption or changes in specific
policies on policy outcomes. More specifically, Chap. 7 introduces differences-
in-differences regression.
Chapter 8 introduces advanced statistical techniques to address violations
of the assumptions of OLS regression. This chapter covers time series analysis
and autocorrelation, including autoregressive–moving-average (ARMA and
ARMAX) regression models, which are not but should be used in higher
education policy analysis. Chapter 8 also introduces advanced statistical
techniques that address cross-sectional dependence.
Chapter 9 introduces additional advanced statistical techniques that could
be used to address higher education policy questions. In this chapter,
it is demonstrated how these advanced techniques can take into account
the complex nature of data that are increasingly becoming available to
policy analysts. This is particularly the case with respect to cross-sectional
dependence inherent in geographic-oriented units of analysis such as higher
education institutions and jurisdictions such as states. So, a good part of
Chap. 9 addresses how to deal with cross-sectional dependence in panel
data by using recently developed advanced statistical techniques. In this
sense, Chap. 9 is more forward-looking with respect to the “state-of-art” of
1 Introduction 5

quantitative techniques in higher education policy analysis and evaluation.


Given the development of longer time series and larger panel datasets, the
chapter lays out a set of methodological tools that policy analysts and
researchers should use now and in even more so in the future.
Chapter 10, the final chapter, demonstrates how to present the results
of policy research to policymakers and other lay people. This chapter
demonstrates how the results of descriptive statistics can be presented in
Word files and thematic maps. In Chap. 10, it is also shown how the
most relevant results from intermediate and advanced statistical techniques
can be presented in simple graphs. These graphs make the results of
sometimes complex analyses available to policymakers and the general public
in “pictures” rather than numbers and technical jargon.
Beginning in Chap. 4 and continuing throughout the remainder of the
book, Stata code and output are provided to demonstrate how we can conduct
the analyses being discussed. Rather than relying on Stata’s menus, I use
Stata’s code in an interactive mode. This will enable readers to copy, paste,
and modify in text or ado files for future use in their own work. Beginning in
Chap. 4, an appendix is provided with the Stata code that was used in the
respective chapter.
This book does not comprehensively cover all quantitative techniques that
have or could be used in higher education policy analysis and research. I do
not discuss event history analysis (EHA), which has mainly been employed
to explain when a state higher education policy is adopted. Others (e.g.,
DesJardins 2003; Lacy 2015) have provided comprehensive descriptions and
demonstrations of the use of EHA in higher education.
With the exception of difference-in-differences (DiD) regression, quantita-
tive techniques that infer correlation rather than causation are not covered
in this book. More specifically, I do not cover instrumental variable (IV)
regression, synthetic control methods (SCM), and regression discontinuity
(RD). Bielby et al. (2013) provide a good discussion and demonstration
of IV regression, while McCall and Bielby (2012) present a comprehensive
exposition of how RD can be used in higher education policy research.
Because it has only recently been introduced in the higher education policy
literature, I will not discuss the use of SCM to evaluate higher education
policy. For those who are interested in how SCM has been applied to
examining policy outcomes in higher education, I refer them to the work of
Jaquette and associates (e.g., Jaquette et al. 2018; Jaquette and Curs 2015).
While I introduce and demonstrate the use of difference-in-differences (DiD)
regression, Furquim et al. (2020) provide a more comprehensive discussion
of how to apply that technique when conducting higher education policy
evaluation.
I do not cover spatial analysis and regression, quantitative techniques
that are emerging in and increasingly being applied to higher education
policy research. Several higher education scholars have begun to discuss
(e.g., Rios-Aguilar and Titus 2018) and apply (Fowles and Tandberg 2017)
the application of spatial techniques to higher education policy analysis and
6 1 Introduction

evaluation. Like quantitative techniques that infer causality, spatial analysis


in higher education policy evaluation is a topic for another book.
Beginning with Chap. 4, I provide an appendix with the Stata commands
and syntax that were used to demonstrate procedures and in examples
throughout the chapter. The syntax is meant to provide a template for
modification specific to the reader’s data and statistical techniques rather
than a guide to programming in Stata. This approach to the use of Stata is
consistent with that of Acock (2018), who encourages readers to use help
commandname in an interactive mode to get more information about
specific Stata commands and routines.
Because of my research interests, most of the examples in this book involve
the use of higher education finance-oriented policy data. But the statistical
methods and techniques presented in this book can be applied to other
quantitative data used in other areas as well.

References

Acock, A. C. (2018). A Gentle Introduction to Stata (6th ed.). A Stata Press Publication,
StataCorp LLC.
Arellano, E. C., & Martinez, M. C. (2009). Does Educational Preparation Match
Professional Practice: The Case of Higher Education Policy Analysts. Innovative Higher
Education, 34 (2), 105–116. https://doi.org/10.1007/s10755-009-9097-0
Bielby, R. M., House, E., Flaster, A., & DesJardins, S. L. (2013). Instrumental variables:
Conceptual issues and an application considering high school course taking. In Higher
education: Handbook of theory and research (pp. 263–321). Springer.
Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The
Review of Higher Education, 23 (2), 119–132. https://doi.org/10.1353/rhe.2000.0002
DesJardins, S. L. (2003). Event history methods: Conceptual issues and an application to
student departure from college. In J. C. Smart (Ed.), Higher Education: Handbook of
Theory and Research (Vol. 18, pp. 421–471). Springer.
Fowles, J. T., & Tandberg, D. A. (2017). State Higher Education Spending: A Spatial
Econometric Perspective. American Behavioral Scientist, 61 (14), 1773–1798. https://
doi.org/10.1177/0002764217744835
Furquim, F., Corral, D., & Hillman, N. (2020). A Primer for Interpreting and Designing
Difference-in-Differences Studies in Higher Education Research. In L. W. Perna (Ed.),
Higher Education: Handbook of Theory and Research: Volume 35 (pp. 667–723).
Springer International Publishing. https://doi.org/10.1007/978-3-030-31365-4_5
Hillman, N. W., Tandberg, D. A., & Sponsler, B. A. (2015). Public Policy and Higher
Education: Strategies for Framing a Research Agenda. ASHE Higher Education Report,
41 (2), 1–98.
Jaquette, O., & Curs, B. R. (2015). Creating the Out-of-State University: Do Public
Universities Increase Nonresident Freshman Enrollment in Response to Declining
State Appropriations? Research in Higher Education, 56 (6), 535–565. https://doi.org/
10.1007/s11162-015-9362-2
Jaquette, O., Kramer, D. A., & Curs, B. R. (2018). Growing the Pie? The Effect
of Responsibility Center Management on Tuition Revenue. The Journal of Higher
Education, 89 (5), 637–676.
References 7

Lacy, T. A. (2015). Event history analysis: A primer for higher education researchers. In
M. Tight & J. Huisman (Eds.), Theory and Method in Higher Education Research (Vol.
1, pp. 71–91). Emerald Publishing Group.
McCall, B. P., & Bielby, R. M. (2012). Regression discontinuity design: Recent develop-
ments and a guide to practice for researchers in higher education. In Higher education:
Handbook of theory and research (pp. 249–290). Springer.
Rios-Aguilar, C., & Titus, M. A. (Eds.). (2018). Spatial Thinking and Analysis in Higher
Education Research: New Directions for Institutional Research: Vol 2018, No 180 (Vol.
2018). Wiley Press. https://onlinelibrary.wiley.com/toc/1536075x/2018/2018/180
Chapter 2
Asking the Right Policy Questions

Abstract This chapter discusses asking the right policy questions. It points
out how the nature of those questions and answers are shaped by the policy
context. With the most appropriate methodological tools, policy analysts
should be prepared to address follow-up questions. These include “what” and
“how” questions. The chapter discusses how academic researchers have to
simultaneously use rigorous methods and provide results of their research
that is of use to policymakers and the general public.

Keywords Policy questions · The why questions · The how questions

2.1 Introduction

This chapter discusses higher education policy analysis and evaluation with
respect to the nature policy questions. The first part of chapter discusses the
policy context within which the right policy question is addressed by policy
analysts. The next section provides a perspective on the “what” questions
that policymakers ask policy analysts to address. The chapter then discusses
the “how” question, followed by the next section that explains how academic
researchers may also provide answers in search of questions. The chapter ends
with some concluding remarks in the summary section.

© Springer Nature Switzerland AG 2021 9


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_2
10 2 Asking the Right Policy Questions

2.2 Asking the Right Policy Questions

Policy analysis involves asking the right questions and providing those
answers. But how does one determine what constitutes the right questions?
It is necessary to clearly identify the policy issue at hand, who are concerned
about the issue, how to frame questions about the issue, and the possibility
of providing the relevant answers. Identification of a policy issue in higher
education is not as straightforward as one may think. Take for example the
issue of college affordability. The context and focus of that same policy issue
differs by who is discussing it. In the popular press, college affordability may
be presented in terms of the increase in the price of college (i.e., tuition
and fees). Among higher education advocacy groups such as the Institute
for Higher Education Policy, college affordability may be discussed within
the context of the extent to which students from low-income families are
being priced out of the higher education market. Therefore, with respect to
identifying policy issues, the audience also matters. Even if the issue and
audience have been identified, policy research and the policy issue have to be
bridged (Ness 2010). With regard to an identified policy issue, the question
that policy researchers and policymakers are asking may not be one in the
same. Moreover, the decisions of policymakers may not be linked to answers
to questions addressed by policy researchers. According to Ness (2010), a
direct application of policy research to policymaking process is more closely
connected to the rational choice model. But a more realistic policy making
process is the “multiple streams” model (Kingdon 2011). Policy analysts who
operate under the assumptions of the “multiple streams” model of the policy
process produce research for multiple audiences such as academics, advocacy
groups, policymakers, the media, as well as the general public. Consequently,
research findings have to be clearly articulated or written for a wide audience
of users who may or may not influence the policy process or policymakers.
Given the variety of groups, a variety of questions and answers may
have to be posed and addressed. This is rather challenging for the policy
analyst who must be cognizant of her or his audience, the policy process,
as well as a variety of analytical techniques, modes of communicating
the results, and the possible implications for policy. Different questions
will require different methods and analytical techniques. In general, the
“why” questions usually require a qualitative research design. The “what”
and “how” questions generally necessitate a quantitative research design,
which includes continuous and categorical data, measures or variables, and
statistical techniques. But to answer the questions, the policy analyst or
researcher must choose the appropriate data and statistical techniques, which
depend on several factors.
2.2 Asking the Right Policy Questions 11

2.2.1 The What Questions

In some cases, policymakers may want to know how an outcome or phe-


nomenon was related to a set of policy-oriented variables. For example, a
state higher education policymaker may inquire about the following:
1. When changes in resident undergraduate tuition occurred at public 4-year
state colleges and universities, what was the outcome with respect changes
in resident undergraduate college enrollment at those institutions in the
state?
A similar question about specific groups could also be posed:
2. When changes in resident undergraduate tuition occurred at public 4-year
state colleges and universities, what was the outcome with respect changes
in resident undergraduate college enrollment among low-income students
at those institutions in the state?
Some policymakers may even go further and expand on the question above
and ask the following:
3. When changes in resident undergraduate tuition at public 4-year state
colleges and universities and changes in state need-based aid occurred,
what was the outcome with respect to changes in resident undergraduate
college enrollment among low-income students at those institutions in the
state?
It may be prudent for policy analysts to anticipate the second and third
“what” questions. In some cases, a cascade of questions may follow an initial
“what” question. Consequently, the policy analyst must be prepared to answer
the “what” questions that have not been posed but may be coming.
The astute reader may have already noticed that the “what” questions
above are retrospective and relational in nature. But in many instances,
policymakers may want to have answers to “what if” questions. For example,
policymakers may want to know the following:
4. If resident undergraduate tuition increased (decreased) at public 4-year
state colleges and universities, then what would be the outcome with
respect changes in resident undergraduate college enrollment at those
institutions in the state?
At first glance, this may appear to be a rather challenging question
to answer. But a skilled policy analyst could approach this question in
several ways. First, the question could be approached by observing the
history (i.e., time series) of resident undergraduate tuition at state colleges
and universities and resident undergraduate college enrollment at those
institutions in the state. Based on historical trends, the analyst could
then project the changes going forward. A second approach could be to
12 2 Asking the Right Policy Questions

examine the relationship between resident undergraduate tuition at state


colleges/universities and resident undergraduate college enrollment at those
institutions across institutions or states in a particular year. Based on a
snapshot (i.e., cross-section) in time, the analyst would be able to determine if
a relationship exists and then make an assumption about the particular state
of interest. While the first approach involves an extrapolation over time, the
second approach involves an extrapolation across units of analysis or groups
(institutions or states). A third approach could include the use of data across
time and units of analysis. All three approaches involve a set of implicit
assumptions regarding the relationship among variables. Those assumptions
are originally embedded in the “what” questions. In our example above, the
policymaker has to implicitly assume or hypothesize there is a relationship
between enrollment in college and tuition price that should be tested. This
assumption or hypothesis is based on an underlying theory with regard to
the relationship between enrollment in college and tuition price.
In an effort to address the policy questions above, the policy analyst must
make an effort to test the implicit hypothesis regarding the relationship
between college enrollment and tuition price or more generally, the market
demand for higher education within a state or across states. This may seem
like a straight forward task. It, however, is quite complex and involves a set
of underlying questions. If the policy analyst chooses to answer the initial
question by looking at how resident undergraduate college enrollment has
changed over time, how does she or he know whether the trend reflects
changes in the demand for college, changes in the supply of college (e.g.,
physical capacity, admissions, etc.), or both?1 If the analyst ignores possible
changes in supply and focuses on changes in demand with respect to tuition
price, then she or he is implicitly assuming that “all else is held constant”.
But what if median family income or the traditional college-age population
(18- to 24-year-old) or the college wage premium (the difference between
the wages earned by college graduates and high school graduates) or the
“tastes” (based on expected social norms) for attending college changed over
time? Obviously, the analyst cannot possibly attribute a change in resident
undergraduate college enrollment solely due to a change in tuition price if
all of these other things are assumed to change as well. Therefore, she or
he must simplify reality by assuming the other variables did not change
or propose an alternative set of policy questions. Perhaps that alternative
set policy questions could be centered on changes in college enrollment and
changes in affordability rather than tuition price.2 This set of questions would

1 For more discussion on this, see Toutkoushian, R. K., & Paulsen, M. B. (2016). Economics

of Higher Education: Background, Concepts, and Applications (first ed. 2016 edition).
Springer.
2 The issue of college affordability has increasingly received attention at the state and

national level. For example, see Miller, G., Alexander, F. K., Carruthers, G., Cooper, M.
A., Douglas, J. H., Fitzgerald, B. K., Gregoire, C., & McKeon, H. P. “Buck.” (2020). A
2.2 Asking the Right Policy Questions 13

require data on affordability, which could be measured as a ratio of average


tuition price to income. But this would require some agreement with respect
to the “right” measure of tuition price. Should the policy include the sticker
(before financial aid) tuition or net (after financial aid) tuition price? It may
also require agreement with regard to the “right” measure of income. Should
the analyst use average family income or median family income? Even if
there is agreement with regard to the use of college affordability, the other
variables mentioned above will still have to be ignored or held constant.
Ignoring the other variables overly simplifies reality, while holding constant
the other variables has implications for what statistical techniques the analyst
will use to answer the policy questions. Additionally, while it may be the
most useful information to the general public, college affordability may not
always be what higher education policymakers can directly change. Why?
College affordability is composed of both tuition price (and possibly other
prices related to attending college, such as housing, meals, books, etc.) and
family income. State-level higher education policymakers may have varying
control of tuition prices. For example, in 38 states, tuition price setting was
controlled by multicampus or single campus boards during 2012 (Zinth and
Smith 2012). Clearly, state higher education policymakers do not have control
over changes (at least not in the short run) in family income.
For the same reasons mentioned above, the third policy question may also
have to be re-stated in terms of affordability, but with a slight modification.
For example, one could ask:
5. As changes in state need-based financial aid and resident tuition for
undergraduates at public 4-year colleges, what changes in resident under-
graduate enrollment occurred at those same institutions?
If the policymaker is interested in changes in both tuition and state need-
based aid as well as their implicit influence on changes in enrollment, then
the answer to question 5 becomes a bit more nuanced. Question 5 generates
three analytical questions:
5(a) What is the relationship between changes in resident undergraduate
enrollment and changes in resident undergraduate tuition at public 4-year
colleges and universities in the state?
5(b) What is the relationship between changes in resident undergraduate
enrollment and changes in state need-based financial for undergraduate
students at public 4-year colleges and universities in the state?
5(c) What changes in state need-based financial for undergraduate students
at public 4-year colleges and universities in the state condition (influence)
changes in the relationship between changes in resident undergraduate

New Course for Higher Education. Bipartisan Policy Center. https://bipartisanpolicy.org/


wp-content/uploads/2020/01/WEB_BPC_Higher_Education_Report_RV8.pdf.
14 2 Asking the Right Policy Questions

enrollment and changes in tuition at public 4-year colleges and universities


in the state?
In this example, it is not clear if question 5(a) and 5(b) can be addressed
without addressing question 5(c). It is quite possible that the relationship
between changes in resident undergraduate enrollment and changes in tuition
at public 4-year colleges and universities in the state can only be discerned
by observing changes in state need-based financial aid for undergraduate
students at public 4-year colleges and universities in the state. Therefore, a
“how” question may actually be embedded within a “what” question.
It is also possible that a particular type of “what” policy question also
requires qualitative data and techniques or a mixed methods approach to
addressing the question (Creswell and Creswell 2018). For example, if a
policymaker asks if there are differences in interpretation of articulation
policy between high school administrators and college administrators and
if they exist, “what” do those differences mean for students in terms of their
enrollment in college courses? To address the first part of this question, the
analyst will have to interview high school and college administrators. To
answer the second part of the question, the analyst will have to examine
student enrollment in college courses.

2.2.2 The How Questions

Many higher education policy inquiries are “how” questions. A state policy-
maker may inquire how a particular policy may have affected a particular
outcome or output. For example, policymakers in Maryland may want to
know how the adoption of a state-wide policy on articulation has affected
transfer rates from community colleges to 4-year institutions in Maryland.
Using quantitative techniques, the policy analyst can approach this question
in several different ways. First, he or she may want to answer this question
from the perspective of Maryland’s transfer rates before and after the
adoption of a state-wide policy on articulation, without comparison to
other states that have articulation policies. This is probably the easiest but
not necessarily the best way to answer this question. The second way to
answer this question is to compare Maryland’s transfer rates before and
after the adoption of a state-wide policy on articulation, with comparison
to comparable states that have no articulation policy. This approach to
answering the question involves collecting data on comparable states. But
this prompts the analyst to ask the following set of questions:
What states are considered to be comparable to Maryland?
Are only border states comparable to Maryland?
Are states in the same regional compact, the Southern Regional Education
Board (SREB) comparable to Maryland?
2.2 Asking the Right Policy Questions 15

Are states with characteristics similar to Maryland comparable to Maryland?


The answers to these more analytical questions follow the original policy-
oriented question and have implications for the data used and quantitative
techniques employed.
But quantitative techniques may not be appropriate for some “how”
questions that policymakers may ask. For example, a policymaker may
ask how is a particular state-wide higher education policy (e.g., dual
enrollment) being implemented across a state. The answer to this inquiry
would require interviews with different stakeholders (e.g, high school and
college administrators) across the state. Therefore, to address that particular
type of “how” question, it is necessary for an analyst, who does not have
qualitative or interviewing skills, to consult with her or his colleagues who
do possess those skills.

2.2.3 The How Questions and Quantitative


Techniques

With respect to “how” questions, descriptive statistics of data on current


patterns or past trends of policy indicators or variables, may be necessary
but is certainly not sufficient to show relationships or test hypotheses. On
the other hand, most regression models used in higher education policy
research are correlational and used to examine the relationships among
variables. The relationships may be between variables that policymakers can
influence (e.g., undergraduate resident tuition at public 4-year colleges and
universities) and variables (e.g., the enrollment of undergraduate resident
students in public 4-year colleges and universities) that are implicitly or
explicitly theorized to be influenced by the actions of policy. The regression
models are used to take into account other observed variables (e.g., state
median family income, traditional college-age population, etc.) or unobserved
variables (state culture or habitus with regard to college enrollment) that
may be related to the outcome of policy action. Policymakers, however,
may not be able to “control” those other related variables. Consequently,
policy analysts should include “control” variables and take into account
“unobservable” factors. Most regression models are used in higher education
policy research to examine relationships among variables. They involve asking
the “what” question. More specifically, the questions being asked are: what
policy-oriented variables (controlling for other variables) are important with
respect to the policy outcome?
When using most quantitative techniques, the answers do not prove
cause. Therefore, it is very important to use the appropriate language
when presenting the results. The results from regression models, such as
instrumental variable (IV), difference-in-differences (DiD), and discontinuity
16 2 Asking the Right Policy Questions

regression (RD) have more causal inferences than from ordinary least squares
(OLS), fixed-effects (FE), and random-effects (RE) regression models. Other
quantitative techniques that suggest a causal inference include synthetic
control methods (SCM), a recently developed technique. An experimental
research (ER) design or “scientific” method is utilized to establish the cause-
effect relationship among a group of variables, with a random assignment to
treatment and control groups. While it is considered the “gold standard” of
research design where the researcher can manipulate the policy intervention
or “treatment”, in most instances, ER cannot be used to conduct policy
analysis or evaluation, due to legal, ethical, or practical reasons. Therefore,
the vast majority of analyses of higher education policy is conducted using
either descriptive statistics or correlational methods such as OLS, FE, and
RE or quasi-experimental methods such as IV, DiD, RD regression, or SCM.
The nature of the policy research question and data should determine the
most appropriate method to be utilized by the analyst. For example, if the
question is referring to the incidence of the adoption of a state policy (e.g.,
free tuition for community college students) across the United States by year,
the use of descriptive statistics or exploratory data analysis (EDA) may be all
that is needed. If the question is about the relationship between an outcome
(e.g., the enrollment of full-time students in community colleges within a
state) and a state higher education policy (e.g., free tuition at community
colleges), an ordinary least squares (OLS) regression model may be more
appropriate.3 If few states (e.g., 20) have implemented similar free tuition
policies among all 50 states and across many (e.g.,10) years, then a fixed-
effects regression model may be the most appropriate technique to address
the question in terms of the “average” influence of such policies.4 If complete
data are available in only a subset of states, then a random-effects regression
model probably should be employed.5 Finally, if the question is referring
to how the adoption of a particular policy in a particular state affected an
outcome in that state (compared to similar states without no such policy),
then a difference-in-difference (DiD) regression may be the most appropriate
method.6
If one chooses to address the question with respect to the effect of the
policy in a specific state (e.g., Tennessee) or group of states (e.g., Tennessee
and Maryland) compared to states that did not adopt the policy, has access
to data for only a few comparable states (e.g., members of the Southern
Regional Education Board—SREB) and a few years, and wants to address
the question in terms of the effect of the policy compared to states that did
not adopt the policy, DiD regression or SCM may be the method of choice.

3 OLS regression is discussed in Chap. 7.


4 Fixed-effects regression is presented in Chap. 7.
5 Random-effects regression is explained in Chap. 7.
6 DiD regression is discussed in Chap. 7.
2.2 Asking the Right Policy Questions 17

If the analyst is aware of the assumptions of DiD regression and chooses to


relax them, the SCM may be the preferred method. In addition to the nature
of the question and available data, the skill level of the analyst may also
determine the method that is actually used to address the question. In many
instances, the higher education analyst or researcher makes a judgment with
regard to what method is used to answer a question, based on the set of tools
that are in her or his “toolbox”. Therefore it is important that policy analysts
and researchers have a full set of “tools” in her or his “toolbox” to answer
different questions in different ways to different audiences.
In many ways, the presentation of the analyses is one of the most important
aspects of policy analysis and evaluation. Whether conducting EDA or more
advanced statistical techniques like DiD regression or SCM, the use of tables
and graphs should also be employed to clearly convey the results of the
analyses. This is particularly pertinent when presenting the results of an
analysis to policymakers, who may have a limited amount of time available
to consume the information. Therefore, it is imperative that the results of
analyses or evaluation be presented in such a way that clearly and succinctly
highlights key points.

2.2.4 So Many Answers and Not Enough Time

When responding to questions from policymakers, policy analysts run the


risk of providing “many answers” that distract from the main answer. As
discussed above, there may be several reasons for providing answers. But
policy analysts have to be careful not to lose the audience by providing too
many answers to the analytical questions that may arise during the analysis
or evaluation of policy. Many of the analytical questions posed by the analyst
may be interesting from a methodological perspective but of less importance
to the world of policymakers. In the interest of time (and space), it may
be prudent for policy analysts to provide answers to only the main policy-
oriented questions. For many policymakers, time is of the essence with regard
to being provided answers. It does not mean that secondary analytical-related
answers should never be provided to policymakers and the general public. It
may be possible to include those questions and answers in appendices or
supplementary reports.

2.2.5 Answers in Search of Questions

Many policy analysts, particularly academic researchers, provide answers that


are in search of questions. These answers may be very important to the
analyst and possibly others, particularly academic researchers, with regard
18 2 Asking the Right Policy Questions

to philosophical or theoretical questions. As Birnbaum (2000) asserts, it takes


time for academic researchers to develop theories. So, we cannot expect
academic researchers who are interested in higher education policy and theory
to simply abandon the latter in favor of the former. Birnbaum also claims that
policymakers may eventually make use of policy research that later enters the
policy world. Hence when academic researchers publish the results of their
policy analyses or evaluations, they should do so with a mixed audience in
mind. This requires providing rigorous research for the academic community
and language, free of technical jargon, which informs policy discussions. In
many ways, this is more challenging to achieve than presenting the results of
a policy analysis or evaluation to policymakers or the general public. Those
who are able to strike this balance are the most successful in influencing the
world of policymakers over the long run.

2.3 Summary

This chapter discussed the asking and answering higher education policy
questions. It was pointed out how the nature of those questions and answers
are shaped by their context. These questions are not always straightforward
and may lead to additional questions by policymakers. Policy analysts should
be prepared to address follow-up questions. This chapter also discussed the
nature of policy inquiries, which may include “what” questions or “how”
questions or both. Policy analysts have to choose the appropriate methods to
address these questions. The chapter ended with a discussion of how academic
researchers may have to simultaneously use rigorous methods and provide
results of their research that is of use to policy and the general public.

References

Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The
Review of Higher Education, 23 (2), 119–132.
Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and
mixed methods approaches (5th ed.). Sage Publications.
Kingdon, J. W. (2011). Agendas, Alternatives, and Public Policies. Netherlands: Longman.
Ness, E. C. (2010). The role of information in the policy process: Implications for the
Examination of Research Utilization in Higher Education Policy. In J. C. Smart (Ed.),
Higher education: Handbook of theory and research (Vol. 25, pp. 1–49). Springer.
Zinth, K., & Smith, M. (2012). Tuition-Setting Authority for Public Colleges and
Universities (p. 10). Education Commission of the States.
Chapter 3
Identifying Data Sources

Abstract There are varied sources of data available to higher education


analysts and researchers at the international, national, state, and institutional
levels. These data are provided by international organizations, the federal
government, regional compacts, and independent organizations. Most of these
data are available to the public without restrictions. Many higher education
analysts and researchers have used these data to examine numerous topics.

Keywords Data sources · International · National · State-level ·


Institutional

3.1 Introduction

This chapter identifies and discusses some of the major data sources that are
available to conduct higher education policy research. The first part of the
chapter introduces sources of data that include international organizations.
The next section discusses the U.S. national-level data from the U.S.
Department of Education and other sources. Higher education institutional-
level data are introduced and discussed in the following section of the chapter.
The last section of the chapter provides concluding statements on data
sources.

© Springer Nature Switzerland AG 2021 19


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_3
20 3 Identifying Data Sources

3.2 International Data

Many international organizations provide data on a variety of topics,


including higher education (sometimes referred to as tertiary education).
One of those premier organizations is the World Bank, which collects and
makes comprehensive higher education data available to policy analysts and
researchers. World Bank (WB) education data are compiled by the United
Nations Educational, Scientific, and Cultural Organization (UNESCO) Insti-
tute for Statistics from the surveys and reports provided by education
officials in each country. These data are accessible via its website at: https://
data.worldbank.org/topic/education. A Stata add-on module, wbopendata
(Azevedo 2020), can be used to access WB data. The user can access a menu of
specific data across countries or a set of data from a specific country. The data
can be downloaded directly into Stata or Excel files. Using wbopendata, WB
education data can be joined or merged with other data on other WB “topics”,
such as Agriculture and Rural Development, Aid Effectiveness, Economy and
Growth, Environment, Health, and Social Development. World Bank data can
also be accessed with other Stata user-written programs getdata (Gonçalves
2016) and sdmxuse (Fontenay 2018).
Many of the higher education-oriented studies that used World Bank data
have focused on the relationship between economic growth and educational
attainment (e.g., Chatterji 1998; Holmes 2013; e.g., Knowles 1997). World
Bank data enable policy analysts and researchers to examine the relationship
between higher education and other topical areas across countries.
The Organisation for Economic Co-operation and Development (OECD)
also provides international education data. The OECD covers 37 countries
that span Europe, Asia, and North and South America. Like the World
Bank, OECD organizes its data by topic. The OECD topic of education
covers higher (tertiary) education data, which can be accessed in several
ways. In addition to going to the OECD webpage on education (https://data.
oecd.org/education.htm), analysts can utilize Stata programs (i.e., getdata
and sdmxuse) to extract data. Education policy analysts, researchers,
advocacy organizations, and government agencies have used OECD to
examine educational attainment rates and spending across countries.

3.3 National Data

Countries collect and provide varying amounts of data on higher education.


This section will focus on the United States. The U.S. government collects a
tremendous amount of data from a variety of sources. Much of the education
data are collected by the U.S. Department of Education (DOE). The U.S.
DOE provides data from several nationally representative surveys that can
3.3 National Data 21

be used to examine higher education policy at many different levels. The


U.S. DOE’s National Center for Education Statistics (NCES) is primarily
responsible for collecting information on education. NCES collects, via
surveys, information on primary (elementary), secondary, and postsecondary
students. It also collects data on primary (elementary) and secondary schools
and postsecondary education institutions. Brief descriptions of NCES surveys
that focus on aspects of postsecondary education are provided below.
National Education Longitudinal Study of 1988 (NELS:88). NELS:88 is
a nationally representative longitudinal survey of eighth graders in 1988
that included follow-up questionnaires through their secondary education,
postsecondary education, and/or labor market years. NELS:88 is based on a
multistage sampling frame in which middle schools were first selected then
followed by random sampling of students within each school. In addition
to providing information on students, NELS:88 also provides information
on the schools they attended. The NELS:88 follow-up in 2000 also includes
postsecondary education transcripts (PETS). The NELS:88 base year and
follow-up microdata are accessible as a public use file (PUF) and restricted
use file (RUF). The NELS:88 PUF is a subset of the NELS:88 and does not
include all the student variables that are available in the NELS:88 RUF.
Users have to apply for a license from the Institute of Education Sciences
(IES) and NCES to obtain the NELS:88 RUF. (The IES is the research
and evaluation arm of the U.S. Department of Education.) Information on
accessing the NELS:88 PUF and RUF can be found at: https://nces.ed.gov/
surveys/nels88/data_products.asp.
Many higher education policy analysts and researchers have used NELS:88
data to examine determinants of college enrollment. The final NELS:88
follow-up, however, was completed in 2000. Consequently, the most recent
NELS:88 data are at least 20 years old.
Education Longitudinal Study of 2002 (ELS:2002). The ELS:2002 is a
nationally representative longitudinal study of tenth graders in 2002 and
12th graders in 2004 who were followed throughout their secondary educa-
tion, postsecondary education, and/or labor market years. Like NELS:88,
ELS:2002 is based on a multistage sampling of schools and students. The
ELS:2002 final follow-up was completed in 2012. ELS:2002 includes PETS
information for 2013. Information on accessing ELS:2002 can be found
at: https://nces.ed.gov/surveys/els2002/. With the exception of associated
PETS information, ELS:2002 is available as a PUF.
Several higher education policy analysts and researchers have used
ELS:2002 to examine college enrollment (e.g., D. Kim and Nuñez 2013;
Lee et al. 2013; Savas 2016; You and Nguyen 2012), college choice (e.g.,
Belasco and Trivette 2015; Hemelt and Marcotte 2016; Kim and Nuñez 2013;
Lee et al. 2013), and college retention (e.g., Glennie et al. 2015; Morgan et
al. 2015; Rowan-Kenyon et al. 2016; Schudde 2011, 2016).
High School Longitudinal Study of 2009 (HSLS: 09). The HSLS:09 is
a nationally representative longitudinal study that surveyed 23,000 plus
22 3 Identifying Data Sources

students in 2009 beginning in the ninth grade at 944 schools. The first follow-
up of the HSLS:09 was in 2012. In 2013, there was an update to HSLS:09.
A second follow-up, conducted in 2016, collected information on students
in postsecondary education and/or the workforce. In 2017, HSLS:09 was
supplemented with PETS. Information on accessing HSLS:09 can be found at:
https://nces.ed.gov/surveys/hsls09/. Recently, a few higher education policy
analysts and researchers have used HSLS:09 to examine college readiness
(e.g., Alvarado and An 2015; George Mwangi et al. 2018; Kurban and Cabrera
2020; Pool and Vander Putten 2015) and college enrollment (e.g., Engberg
and Gilbert 2014; Goodwin et al. 2016; Nienhusser and Oshio 2017; Schneider
and Saw 2016).
National Postsecondary Student Aid Study (NPSAS). NPSAS is a nation-
ally representative cross-sectional survey, with a focus on financial aid, of
students enrolled in postsecondary education institutions. Beginning in 1987,
the NPSAS survey has been conducted almost every other year. A NPSAS
survey is planned for 2020, which will include state-representative data for
most states. In addition to student interviews, NPSAS includes data from
institution records and government databases. Analysts and researchers can
perform analysis on NPSAS data only through NCES, via its Datalab at:
https://nces.ed.gov/surveys/npsas/. NPSAS microdata or restricted use file
data are only available to analysts and researchers who have been granted a
license from IES/NCES. The federal government, higher education advocacy
groups, and researchers have used NPSAS data to produce reports to help
inform policy on federal financial aid.
Beginning Postsecondary Students Longitudinal Study (BPS). The BPS,
a spin-off of the NPSAS, is a nationally representative survey, based on
a multistage sample of postsecondary education institutions and first-time
students. Drawing on cohorts from the NPSAS, the BPS surveys collect
data on student demographic characteristics, PSE experiences, persistence,
transfer, degree attainment, entry into the labor force and/or enrollment
in graduate or professional school. The first BPS survey was first conducted
1990 (BPS:90/94) and followed a cohort of students through 1994. Since then,
BPS surveys of students have been conducted at the end of their first, third
and sixth year after entering a postsecondary education (PSE) institution.
The BPS has been repeated every few years. Beginning with the BPS:04/09,
PETS information is also provided. The most recent BPS (BPS:12/17) survey
followed a cohort of 2011–2012 first-time beginning students, with a follow-
up in 2017. The next BPS survey will collect information on students who
began their postsecondary education in the academic year 2019–2020 and
will follow that cohort in surveys to be conducted in 2020, 2022, and 2025.
Users can access a limited amount of BPS data through the NCES Datalab.
Information on the accessing data from the BPS can be obtained from NCES
at: https://nces.ed.gov/surveys/bps/. The complete BPS with microdata are
available to restricted use file license holders. Many higher education policy
3.3 National Data 23

analysts and researchers (too numerous to mention) have used the BPS to
investigate college student persistence and completion.
The Baccalaureate and Beyond Longitudinal Study (B&B) is a nationally
representative survey, based on a sample of postsecondary education students
and institutions, of college students’ education and labor force experiences
after they complete a bachelor’s degree. Drawing from cohorts in the NPSAS,
the B&B surveys also collect information on degree recipients’ earnings,
debt repayment, as well as enrollment in and completion of graduate and
professional school. Students in the B&B survey are followed up in their
first, fourth, and tenth year after receiving their baccalaureate degree. The
first B&B survey was conducted in 1993, with follow-ups in 1994, 1997, and
2003. The second B&B survey (B&B:2000/01) had only one follow-up, which
was in 2001. The B&B:2008/12, which focuses on graduates from STEM
education programs, was completed in 2008 and included follow-ups in 2009
and 2012. The B&B:2008/18 will include a follow-up in 2018. Using the NCES
Datalab, analysts can perform limited analyses of B&B data. Microdata from
the B&B surveys, which include PETS information, are only available to users
who are given a license by IES/NCES to use restricted use files. Numerous
analysts and researchers have used the B&B to examine such topics as: labor
market experiences and workforce outcomes of college graduates (e.g., Bastin
and Kalist 2013; Bellas 2001; Joy 2003; Strayhorn 2008; Titus 2007, 2010);
graduate and professional school enrollment and completion (e.g., English
and Umbach 2016; Millett 2003; Monaghan and Jang 2017; Perna 2004;
Strayhorn et al. 2013; Titus 2010); student debt and repayment (e.g., Gervais
and Ziebarth 2019; Millett 2003; Scott-Clayton and Li 2016; Velez et al. 2019;
Zhang 2013); and family formation (e.g., Velez et al. 2019) and career choices
(e.g., Xu 2013, 2017; Zhang 2013) of bachelor’s degree recipients.
Digest of Education Statistics. In addition to providing microdata on
institutions and students, the U.S. Department of Education (DOE) also pro-
duces statistics at an aggregated or macro level on postsecondary education.
For example, IES/NCES publishes the Digest of Education Statistics, which
provides national- and state-level statistics on various areas of education,
including postsecondary education (PSE). For PSE, these areas include:
institutions; expenditures; revenues; tuition and other student expenses;
financial aid; staff; student enrollment; degrees completed; and security and
crime. The statistics on PSE are mostly based on aggregated data from NCES
surveys discussed above (e.g., IPEDS, NPSAS, BPS, B&B). The statistics,
aggregated over time and in some cases across states, are provided in tables.
The tables can be downloaded in an Excel format, which can be used to either
produce reports or merge with data from other sources to conduct statistical
analyses.
Current Population Survey (CPS). The U.S. Census Bureau also provides
national-level postsecondary education data to the public in the form of
the CPS. U.S Census Bureau microdata sample data files are available to
researchers who are given authorization to use specific datasets at one of the
24 3 Identifying Data Sources

secure Federal Statistical Research Data Centers. For example, the restricted
use dataset of the CPS, School Enrollment Supplement provides microdata
at the household level. The CPS School Enrollment Supplement has been
used to examine demographic differences in postsecondary enrollment (e.g.,
Hudson et al. 2005; Jacobs and Stoner-Eby 1998; Kim 2012).
Other Sources of National Data. The U.S. government collects and
disseminates national higher education data that focuses on specific areas.
The Office of Postsecondary Education of the U.S. Department of Education
(DOE) provides data on campus safety and security. The College Scorecard,
which is maintained by the U.S. DOE, produces a national database on
student completion, debt and repayment, earnings, and other data.
The College Board . There are other sources of aggregate national postsec-
ondary education data, such as the College Board, that draw on nationally
representative surveys and federal administrative information. The College
Board data, however, are focused mainly on tuition price and college student
financial aid across years and to limited extent across states. The data can
be accessed at the College Board website (https://research.collegeboard.org/
trends/trends-higher-education) and can be downloaded in Excel format.
Policy analysts use the College Board data to explain patterns and trends
in average higher education tuition prices (e.g., Baum and Ma 2012; Heller
2001; Mitchell 2017; Mitchell et al. 2016) and college student financial aid
(e.g., Baum and Payea 2012; Deming and Dynarski 2010).

3.4 State-Level Data

There are a variety of sources of state-level postsecondary education data.


Based on NCES collection efforts (see the Integrated Postsecondary Educa-
tion Data System below) and federal government administrative data, the
Digest of Education Statistics is a source of much of the state-level post-
secondary education data. These data are institutional-level postsecondary
education data aggregated at the state level. Some of the state-level data
are available across years. For example, the Digest of Education Statistics
provides state-level postsecondary education statistics by year on enrollment,
degrees, institutional revenues, and institutional expenditures.
National Association of State Student Grant and Aid Programs. Other
sources of state-level postsecondary education data include the National
Association of State Student Grant and Aid Programs (NASSGAP). Data
from the NASSGAP surveys (https://www.nassgapsurvey.com/), which focus
on state financial aid, are available in Excel file format.1 Many higher
education policy analysts and researchers have used NASSGAP data to

1 Surveys prior to 2015–2016 are available in pdf format.


3.4 State-Level Data 25

examine state need- and merit-based financial aid (e.g., Cohen-Vogel et al.
2008; Doyle 2006; Hammond et al. 2019; Titus 2006).
State Higher Education Executive Officers. Another source of state-level
higher education data is the State Higher Education Executive Officers
(SHEEO). SHEEO provides data on higher education finance (i.e., state
appropriations and net tuition revenue) and postsecondary student unit
record systems. The SHEEO finance data, some of which go as far back as fis-
cal year 1980, can be downloaded (https://shef.sheeo.org/data-downloads/)
in an Excel file format. SHEEO finance data have been used by several higher
education policy analysts and researchers to produce reports and studies on
state support for higher education (e.g., Doyle 2013; Lacy and Tandberg 2018;
Lenth et al. 2014; Longanecker 2006).
National Science Foundation (NSF). The National Science Foundation
(NSF) is another source of state-level higher education data. More specifically,
NSF provides statistics based on Science and Engineering Indicators (SEI)
State Indicators (https://ncses.nsf.gov/indicators/states/). These statistics
include the number of science and engineering (S&E) degrees conferred,
academic research and development (R&D) expenditures at state colleges and
universities, academic S&E article output, and academic patents awarded.
The data are available to the public and can be downloaded in Excel file
format. Utilizing NSF/SEI state-level data, a few analysts and researchers
(e.g., Coupé 2003; Fanelli 2010; Wetter 2009) have addressed the topic of
academic R&D.
Regional Compacts. There are several academic common market or
regional compacts that provide state-level higher education data. The
Southern Regional Education Board (SERB) is a regional compact that
includes 16 member states in the South which provides state-level information
to the public. With respect to higher education, SREB produces a “factbook”
(https://www.sreb.org/fact-book-higher-education-0) which contains tables
on data such as the population and economy, enrollment, degrees, student
tuition and financial aid, faculty, administrators, revenue, and expenditures.
These tables can be downloaded in an Excel file format.
The Western Interstate Commission for Higher Education (WICHE) is an
academic common market that is composed of 15 Western states and member
U.S. Pacific Territories and Freely Associated States (which currently include
the Commonwealth of the Northern Mariana Islands and Guam). WICHE
produces a regional “factbook” for higher education that contains “policy
indicators” (https://www.wiche.edu/pub/factbook). Similar to SREB’s, the
WICHE higher education factbook provides state-level data in the following
areas: demographics (including projections); student preparation, enrollment,
and completion; affordability and; finance.
The Midwest Higher Education Compact (MHEC) is an academic common
market that is composed of 12 states in the Midwest. MHEC, via its Inter-
active Dashboard (https://www.mhec.org/policy-research/mhec-interactive-
dashboard), provides state-level data and key performance indicators of
26 3 Identifying Data Sources

college context, preparation, participation, affordability, completion, finance,


and the benefits of college. Data for all states in the MHEC common market
are provided and can be downloaded in several formats, except in Excel.
The New England Board of Higher Education (NEBHE) is an academic
common market that is composed of six New England states. The state-
level data from NEBHE, which is limited to tuition and fees across
six states (accessed via https://nebhe.org/policy-research/policy-analysis-
research-home/data/), can be downloaded to Excel files.
The Education Commission of the States (ECS) is an interstate compact
that provides information on education policy, including postsecondary
education policy. ECS maintains a database of information on state-level
postsecondary education governance structures, policies, and regulations.
Education analysts and researchers can track and identify changes in those
areas. That information, however, has to be manually entered into the indi-
vidual datasets that are created and analyzed by analysts. Several researchers
have used ECS information to examine the role of governance structures on
the adoption (e.g., Mokher and McLendon 2009) and conditioning effect (e.g.,
Tandberg 2013) of state higher education policies on policy outcomes.
Other State-Level Sources. There are few other sources of state-level
postsecondary education data and information. These sources include the
National Association of State Budget Officers (NASBO) and the Center for
the Study of Education Policy (University of Illinois). While both sources
of data cover all states over several years, those data are limited to higher
education finance (i.e., state spending on higher education).

3.5 Institution-Level Data

The federal government collects postsecondary education institutional-


level data, via the Integrated Postsecondary Education Data System
(IPEDS). Colleges and universities as well as proprietary institutions submit
institutional-level data to IPEDS, via 12 different surveys. These surveys
include:
1. 12-month Enrollment (E12);
2. Academic Libraries (AL);
3. Admissions (ADM);
4. Completions (C);
5. Fall Enrollment (EF)
6. Finance (F)
7. Graduation Rates (GR)
8. Graduation Rates (GR200)
9. Human Resources (HR)
10. Institutional Characteristics (IC)
3.6 Summary 27

11. Outcome measures (OM)


12. Student Financial Aid (SFA)
A brief description of each of the 12 IPEDS surveys, which are
collected over different collection periods, can be found at this website:
https://nces.ed.gov/ipeds/report-your-data/overview-survey-components-
data-cycle. IPEDS data can be accessed by users at: https://nces.ed.gov/
ipeds/use-the-data. Using either SAS, SPSS or Stata, users can download
statistical program routines to extract data from an entire survey or selected
variables within any of the 12 IPEDS surveys. Additionally, longitudinal
data (Delta Cost Project) from several IPEDS surveys are made available
by the American Institutes for Research on a NCES webpage at https://
nces.ed.gov/ipeds/deltacostproject/. The data can be downloaded using
statistical software packages (SAS, SPSS, Stata) or Excel. IPEDS data
can be aggregated at the state and national level. Higher education policy
analysts, institutional researchers, and others have relied heavily on IPEDS
data to produce thousands of reports and studies on many topics related to
postsecondary education institutions. Researchers have also merged IPEDS
data with data from other NCES datasets such as NELS, BPS, and B&B, to
address questions related to students, taking into account the characteristics
and policies of the institutions they attend and/or the states in which they
reside.
IPEDS data are limited in that they do not provide detailed information
within an institution at the academic unit (e.g., college, school, or depart-
ment) level. Over several years, there have been changes in the way in which
data on selected items in some the IPEDS surveys have been collected.2

3.6 Summary

This chapter provided an overview of data sources available for higher


education analysts and researchers. These sources include international,
national (U.S.), state, and institutional data, surveys, and datasets. The
national data include public use and restricted use files from nationally
representative surveys conducted by IES/NCES and the CPS. The state-
level data are largely aggregated data originating from national surveys and
federal administrative datasets. Independent organizations also collect and
make available state-level data. Those data, however, are limited to finance-
related information. Regional compacts also collect and make available higher
education data. In most cases, those data are only collected for certain states
within specific regions of the country. Institutional data collected by NCES,

2 This is particularly the case with respect to the Finance (F) survey.
28 3 Identifying Data Sources

via IPEDS, are made available to the public but also have limitations. The
overview of data sources that are provided in this chapter is, by no means,
an exhaustive list of all sources of data.

References

Alvarado, S. E., & An, B. P. (2015). Race, Friends, and College Readiness: Evidence from
the High School Longitudinal Study. Race and Social Problems, 7 (2), 150–167. https:/
/doi.org/10.1007/s12552-015-9146-5
Azevedo, J. P. (2020). WBOPENDATA: Stata module to access World Bank databases.
In Statistical Software Components. Boston College Department of Economics. https:/
/ideas.repec.org/c/boc/bocode/s457234.html
Bastin, H., & Kalist, D. E. (2013). The Labor Market Returns to AACSB Accreditation.
Journal of Labor Research, 34 (2), 170–179. https://doi.org/10.1007/s12122-012-9155-
8
Baum, S., & Ma, J. (2012). Trends in College Pricing, 2012. Trends in Higher Education
Series. (pp. 1–40). College Board Advocacy & Policy Center. https://files.eric.ed.gov/
fulltext/ED536571.pdf
Baum, S., & Payea, K. (2012). Trends in Student Aid, 2012. Trends in Higher Education
Series. (pp. 1–36). College Board Advocacy & Policy Center. https://files.eric.ed.gov/
fulltext/ED536570.pdf
Belasco, A. S., & Trivette, M. J. (2015). Aiming low: Estimating the scope and predictors
of postsecondary undermatch. The Journal of Higher Education, 86 (2), 233–263.
Bellas, M. L. (2001). Investment in higher education: Do labor market opportunities differ
by age of recent college graduates? Research in Higher Education, 42 (1), 1–25.
Chatterji, M. (1998). Tertiary education and economic growth. Regional Studies, 32 (4),
349–354.
Cohen-Vogel, L., Ingle, W. K., Levine, A. A., & Spence, M. (2008). The “Spread” of Merit-
Based College Aid: Politics, Policy Consortia, and Interstate Competition. Educational
Policy, 22 (3), 339–362. https://doi.org/10.1177/0895904807307059
Coupé, T. (2003). Science Is Golden: Academic R&D and University Patents. The Journal
of Technology Transfer, 28 (1), 31–46. https://doi.org/10.1023/A:1021626702728
Deming, D., & Dynarski, S. (2010). College aid. In P. B. Levine & D. J. Zimmerman
(Eds.), Targeting investments in children: Fighting poverty when resources are limited:
Vol. Targeting Investments in Children: Fighting Poverty When Resources are Limited
(pp. 283–302). University of Chicago Press. https://www.nber.org/chapters/c11730.pdf
Doyle, W. R. (2006). Adoption of merit-based student grant programs: An event history
analysis. Educational Evaluation and Policy Analysis, 28 (3), 259–285.
Doyle, W. R. (2013). Playing the Numbers: State Funding for Higher Education: Situation
Normal? Change: The Magazine of Higher Learning, 45 (6), 58–61.
Engberg, M. E., & Gilbert, A. J. (2014). The Counseling Opportunity Structure:
Examining Correlates of Four-Year College-Going Rates. Research in Higher Education,
55 (3), 219–244. https://doi.org/10.1007/s11162-013-9309-4
English, D., & Umbach, P. D. (2016). Graduate school choice: An examination of individual
and institutional effects. The Review of Higher Education, 39 (2), 173–211.
Fanelli, D. (2010). Do Pressures to Publish Increase Scientists’ Bias? An Empirical Support
from US States Data. PLoS ONE, 5 (4). https://doi.org/10.1371/journal.pone.0010271
Fontenay, S. (2018). SDMXUSE: Stata module to import data from statistical agencies
using the SDMX standard. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s458231.html
References 29

George Mwangi, C. A., Cabrera, A. F., & Kurban, E. R. (2018). Connecting School and
Home: Examining Parental and School Involvement in Readiness for College Through
Multilevel SEM. Research in Higher Education. https://doi.org/10.1007/s11162-018-
9520-4
Gervais, M., & Ziebarth, N. L. (2019). Life After Debt: Postgraduation Consequences of
Federal Student Loans. Economic Inquiry, 57 (3), 1342–1366. https://doi.org/10.1111/
ecin.12763
Glennie, E. J., Dalton, B. W., & Knapp, L. G. (2015). The influence of precollege access
programs on postsecondary enrollment and persistence. Educational Policy, 29 (7), 963–
983.
Gonçalves, D. (2016). GETDATA: Stata module to import SDMX data from several
providers. In Statistical Software Components. Boston College Department of Eco-
nomics. https://ideas.repec.org/c/boc/bocode/s458093.html
Goodwin, R. N., Li, W., Broda, M., Johnson, H., & Schneider, B. (2016). Improving College
Enrollment of At-Risk Students at the School Level. Journal of Education for Students
Placed at Risk, 21 (3), 143–156. https://doi.org/10.1080/10824669.2016.1182027
Hammond, L., Baser, S., & Cassell, A. (2019, June 7). Community Col-
lege Governance Structures & State Appropriations for Student Financial Aid.
36th Annual SFARN Conference. http://pellinstitute.org/downloads/sfarn_2019-
Hammond_Baser_Cassell.pdf
Heller, D. E. (2001). The States and Public Higher Education Policy: Affordability, Access,
and Accountability. JHU Press.
Hemelt, S. W., & Marcotte, D. E. (2016). The changing landscape of tuition and enrollment
in American public higher education. RSF: The Russell Sage Foundation Journal of
the Social Sciences, 2 (1), 42–68.
Holmes, C. (2013). Has the expansion of higher education led to greater economic growth?
National Institute Economic Review, 224 (1), R29–R47.
Hudson, L., Aquilino, S., & Kienzl, G. (2005). Postsecondary Participation Rates by Sex
and Race/Ethnicity: 1974–2003. Issue Brief. NCES 2005-028. (NCES 2005–028; pp.
1–3). National Center for Education Statistics, Institute of Education Sciences, U.S.
Department of Education.
Jacobs, J. A., & Stoner-Eby, S. (1998). Adult Enrollment and Educational Attainment.
The Annals of the American Academy of Political and Social Science, 559, 91–108.
JSTOR.
Joy, L. (2003). Salaries of recent male and female college graduates: Educational and labor
market effects. ILR Review, 56 (4), 606–621.
Kim, D., & Nuñez, A.-M. (2013). Diversity, situated social contexts, and college enrollment:
Multilevel modeling to examine student, high school, and state influences. Journal of
Diversity in Higher Education, 6 (2), 84.
Kim, J. (2012). Welfare Reform and College Enrollment among Single Mothers. Social
Service Review, 86 (1), 69–91. https://doi.org/10.1086/664951
Knowles, S. (1997). Which level of schooling has the greatest economic impact on output?
Applied Economics Letters, 4 (3), 177–180. https://doi.org/10.1080/135048597355465
Kurban, E. R., & Cabrera, A. F. (2020). Building Readiness and Intention Towards STEM
Fields of Study: Using HSLS: 09 and SEM to Examine This Complex Process among
High School Students. The Journal of Higher Education, 91 (4), 1–31.
Lacy, T. A., & Tandberg, D. A. (2018). Data, Measures, Methods, and the Study of
the SHEEO. In D. A. Tandberg, A. Sponsler, R. W. Hanna, J. Guilbeau P., & R.
Anderson E. (Eds.), The State Higher Education Executive Officer and the Public
Good: Developing New Leadership for Improved Policy, Practice, and Research (pp.
282–299). Teachers College Press.
30 3 Identifying Data Sources

Lee, K. A., Leon Jara Almonte, J., & Youn, M.-J. (2013). What to do next: An exploratory
study of the post-secondary decisions of American students. Higher Education, 66 (1),
1–16. https://doi.org/10.1007/s10734-012-9576-6
Lenth, C. S., Zaback, K. J., Carlson, A. M., & Bell, A. C. (2014). Public Financing of
Higher Education in the Western States: Changing Patterns in State Appropriations
and Tuition Revenues. In Public Policy Challenges Facing Higher Education in the
American West (pp. 107–142). Springer.
Longanecker, D. (2006). A tale of two pities. Change: The Magazine of Higher Learning,
38 (1), 4–25.
Millett, C. M. (2003). How undergraduate loan debt affects application and enrollment in
graduate or first professional school. The Journal of Higher Education, 74 (4), 386–427.
Mitchell, J. (2017, July 23). In reversal, colleges rein in tuition. The Wall
Street Journal. http://opportunityamericaonline.org/wp-content/uploads/2017/07/IN-
REVERSAL-COLLEGES-REIN-IN-TUITION.pdf
Mitchell, M., Leachman, M., & Masterson, K. (2016). Funding down, tuition up. Center on
Budget and Policy Priorities. https://www.cbpp.org/sites/default/files/atoms/files/5-
19-16sfp.pdf
Mokher, C. G., & McLendon, M. K. (2009). Uniting Secondary and Postsecondary
Education: An Event History Analysis of State Adoption of Dual Enrollment Policies.
American Journal of Education, 115 (2), 249–277. https://doi.org/10.1086/595668
Monaghan, D., & Jang, S. H. (2017). Major Payoffs: Postcollege Income, Graduate School,
and the Choice of “Risky” Undergraduate Majors. Sociological Perspectives, 60 (4), 722–
746. https://doi.org/10.1177/0731121416688445
Morgan, G. B., D’Amico, M. M., & Hodge, K. J. (2015). Major differences: Modeling
profiles of community college persisters in career clusters. Quality & Quantity, 49 (1),
1–20. https://doi.org/10.1007/s11135-013-9970-x
Nienhusser, H. K., & Oshio, T. (2017). High School Students’ Accuracy in Estimating
the Cost of College: A Proposed Methodological Approach and Differences Among
Racial/Ethnic Groups and College Financial-Related Factors. Research in Higher
Education, 58 (7), 723–745. https://doi.org/10.1007/s11162-017-9447-1
Perna, L. W. (2004). Understanding the decision to enroll in graduate school: Sex and r
racial/ethnic group differences. The Journal of Higher Education, 75 (5), 487–527.
Pool, R., & Vander Putten, J. (2015). The No Child Left Behind Generation Goes to
College: A Longitudinal Comparative Analysis of the Impact of NCLB on the Culture
of College Readiness (SSRN Scholarly Paper ID 2593924). Social Science Research
Network. https://doi.org/10.2139/ssrn.2593924
Rowan-Kenyon, H. T., Blanchard, R. D., Reed, B. D., & Swan, A. K. (2016). Predictors of
Low- SES Student Persistence from the First to Second Year of College. In Paradoxes
of the Democratization of Higher Education (Vol. 22, pp. 97–125). Emerald Group
Publishing Limited. https://doi.org/10.1108/S0196-115220160000022004
Savas, G. (2016). Gender and race differences in American college enrollment: Evidence
from the Education Longitudinal Study of 2002. American Journal of Educational
Research, 4 (1), 64–75.
Schneider, B., & Saw, G. (2016). Racial and Ethnic Gaps in Postsecondary Aspirations
and Enrollment. RSF: The Russell Sage Foundation Journal of the Social Sciences,
2 (5), 58–82. JSTOR. https://doi.org/10.7758/rsf.2016.2.5.04
Schudde, L. (2016). The Interplay of Family Income, Campus Residency, and Student
Retention (What Practitioners Should Know about Cultural Mismatch). Journal of
College and University Student Housing, 43 (1), 10–27.
Schudde, L. T. (2011). The causal effect of campus residency on college student retention.
The Review of Higher Education, 34 (4), 581–610.
Scott-Clayton, J., & Li, J. (2016). Black-white disparity in student loan debt more than
triples after graduation. Economic Studies, 2 (3), 1–9.
References 31

Strayhorn, T. L. (2008). Influences on labor market outcomes of African American college


graduates: A national study. The Journal of Higher Education, 79 (1), 28–57.
Strayhorn, T. L., Williams, M. S., Tillman-Kelly, D., & Suddeth, T. (2013). Sex Differences
in Graduate School Choice for Black HBCU Bachelor’s Degree Recipients: A National
Analysis. Journal of African American Studies, 17 (2), 174–188. https://doi.org/
10.1007/s12111-012-9226-1
Tandberg, D. A. (2013). The Conditioning Role of State Higher Education Governance
Structures. The Journal of Higher Education, 84 (4), 506–543. https://doi.org/10.1353/
jhe.2013.0026
Titus, M. A. (2006). No college student left behind: The influence of financial aspects of a
state’s higher education policy on college completion. The Review of Higher Education,
29 (3), 293–317.
Titus, M. A. (2007). Detecting selection bias, using propensity score matching, and
estimating treatment effects: An application to the private returns to a master’s degree.
Research in Higher Education, 48 (4), 487–521.
Titus, M. A. (2010). Exploring Heterogeneity in Salary Outcomes Among Master’s Degree
Recipients: A Difference-in-Differences Matching Approach (SSRN Scholarly Paper ID
1716049). Social Science Research Network. https://papers.ssrn.com/abstract=1716049
Velez, E., Cominole, M., & Bentz, A. (2019). Debt burden after college: The effect of
student loan debt on graduates’ employment, additional schooling, family formation,
and home ownership. Education Economics, 27 (2), 186–206.
Wetter, J. (2009). Policy effect: A study of the impact of research & develop-
ment expenditures on the relationship between Total Factor Productivity and US
Gross Domestic Product performance [PhD Thesis, The George Washington Uni-
versity]. https://search.proquest.com/openview/2461d479184375adc5db8da77a9ebc15/
1?pq-origsite=gscholar&cbl=18750&diss=y
Xu, Y. J. (2013). Career Outcomes of STEM and Non-STEM College Graduates:
Persistence in Majored-Field and Influential Factors in Career Choices. Research in
Higher Education, 54 (3), 349–382. https://doi.org/10.1007/s11162-012-9275-2
Xu, Y. J. (2017). Attrition of Women in STEM: Examining Job/Major Congruence in
the Career Choices of College Graduates. Journal of Career Development, 44 (1), 3–19.
https://doi.org/10.1177/0894845316633787
You, S., & Nguyen, J. (2012). Multilevel analysis of student pathways to
higher education. Educational Psychology, 32 (7), 860–882. https://doi.org/10.1080/
01443410.2012.746640
Zhang, L. (2013). Effects of college educational debt on graduate school attendance and
early career and lifestyle choices. Education Economics, 21 (2), 154–175.
Chapter 4
Creating Datasets and Managing
Data

Abstract This chapter provides a discussion and demonstration of creating


datasets. The management of Excel and Stata datasets is also presented.
These datasets include primary and secondary data. While this chapter
discusses and demonstrates how to create datasets based on primary data, it
focuses on the creation and management of the datasets based on secondary
data.

Keywords Creating datasets · Managing datasets · Primary data ·


Secondary data

4.1 Introduction

A substantial amount of time that is spent conducting higher education


policy research and analysis includes dataset creation and management. Even
though they may draw on secondary data sources such as those discussed in
the previous chapter, researchers and analysts may need to create and manage
customized datasets to address specific policy-related questions. This chapter
discusses customized datasets that are based on secondary sources of data. It
also demonstrates how to create, organize, and manage datasets using Excel
and Stata.1 The Stata commands and syntax that are used throughout this
chapter are included in an appendix.

1 It is assumed the reader is familiar with Excel.

© Springer Nature Switzerland AG 2021 33


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_4
34 4 Creating Datasets and Managing Data

4.2 Stata Dataset Creation

Data that can be used in Stata may be generated from surveys created and
inputted by the analyst or imported from an external source. The former
is a primary data source while the latter is a secondary data source. Data
produced by the analyst from original surveys are considered primary data,
while secondary data originally compiled by another party are secondary
data. In the sections below, we discuss both.

4.2.1 Primary Data

If we are entering data from a very short survey, then we use the input com-
mand. The example below shows how data for three variables (variable_x,
variable_y, and variable_z) can be entered in Stata by typing the following:
input variable_x variable_y variable_z
31 57 18
25 68 12
35 60 13
38 59 17
30 59 15
end
To see the data that was entered above, type.
list
which would show the following:
. list
+--------------------------------+
| variabx variaby variabz |
|--------------------------------|
1. | 31 57 18 |
2. | 25 68 12 |
3. | 35 60 13 |
4. | 38 59 17 |
5. | 30 59 15 |
+--------------------------------+
To save the above data, type:
save “Example 1.0.dta”
4.2 Stata Dataset Creation 35

To use the Stata editor to enter additional data in Example 1.0, type:
edit
Importing data from a data management (e.g., dBase) file or a spreadsheet
(e.g., Excel) file would be a more efficient way to enter data in Stata. There
are several ways we can do this. We can import data from comma delimited
Excel files (csv). For example, the data above may be imported from an Excel
comma delimited file (csv) by typing in the following:
insheet using “Example 1.csv”, comma
The use of primary data requires careful planning and a well-developed
data collection process. Many of these processes involve conducting computer-
assisted personal interviews (CAPI). If we need to collect data, there are
several Stata-based tools available to assist in such an effort. One such tool
is a Stata-user created package of Stata commands, iefieldkit, developed
by the World Bank’s Development Research Group Impact Evaluations team
(DIME). The most recent version of the package can be installed in Stata by
typing in “ssc install iefieldkit, replace”. Information on iefieldkit
can be found at the website address: https://dimewiki.worldbank.org/wiki/
Iefieldkit. Once installed, iefieldkit allows for the automatic creation of
Excel files containing the collected data.

4.2.2 Secondary Data

Using secondary data sources, customized datasets can be easily created


for use when conducting higher education policy research. The most basic
dataset is one that captures a snapshot in time or is cross-sectional in
nature. For example, we can download a table containing data on the
participation rate of U.S. high school graduates in 2012 who attended degree-
granting postsecondary education institutions in the same year, by state,
from the 2017 version of The Digest of Education Statistics (Table 302.50)
in an Excel format: (https://nces.ed.gov/programs/digest/d16/tables/xls/
tabn302.50.xls). The data, however, need to be reformatted before it can
be imported or copied and pasted into Stata for further analysis. Hence, the
following steps need to be taken:
1. Blank columns and rows should be deleted.
2. All rows with text containing titles, subtitles, notes, and footnotes should
be deleted.
3. A column needs to be inserted to include the state id number.
4. Each column should have an appropriate simple one-word title.
For example, the columns from left to right could be named stateid,
state, total, public, private, anystate, homestate, anyrate, and
homerate.
36 4 Creating Datasets and Managing Data

5. Because the total number (N ) of cases 51 (50 states plus the District of
Columbia,) the state id numbers should be entered, ranging from 1 to 51
to reflect N. (If the analyst chooses to delete one or more cases, then the
range of the state id would reflect the modified N ).
6. All numbers should be formatted as numeric with the appropriate decimal
places and not as text characters.
7. Any characters that are not alpha-numeric should be removed from all
cells.
8. After steps 1–7, the file should be saved in an Excel format in the “working”
directory (as discussed in the previous chapter).
9. Open Stata and change to the same “working directory” as in step 8.
Based on writing this chapter for this book, the Stata command to change
to the “working directory” which contains the Excel file is as follows:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”

10. The entire Excel file can be imported into Stata. Be sure to indicate the
first row as a variable name as an option. Using the same file from above,
the Stata command is:

import excel “tabn302.50 - reformatted.xls”, firstrow

11. Open the Stata Data editor either in edit or browse mode to look at the
imported data.
In the Stata Data editor, you should see the following:

Fig. 4.1 Stata dataset, based on Excel tabn302.50


4.2 Stata Dataset Creation 37

In Fig. 4.1, take note of the column with the State names, which are in
red text. This indicates State is a string variable. We may want to include
Federal Information Processing Standard Publication (FIPS) codes and the
abbreviations of state names in a state-level dataset. Using the user-created
Stata program “statastates”, the FIPS codes and state abbreviations can be
easily added to any state-level data set that includes the state name. (In our
example from above, the state name is “States”.) This is demonstrated in the
two steps below:
1. ssc install statastates
2. statastates, name(<State name>).
3. We can delete the variable _merge, which was created when we added the
FIPS codes and state abbreviations. This is done by simply typing
drop _merge
We may also want to move the FIPS codes and state abbreviations
somewhere near the front of our dataset. This can be accomplished by typing
the following Stata command:
order state_abbrev state_fips, before( state)
The dataset should look like Fig. 4.2:

Fig. 4.2 Stata dataset, based on the modified Excel tabn302.50

We can then save this file with a new more descriptive name, such as “US
high school graduates in 2012 enrolled in PSE, by state”, in a working direc-
tory containing Stata files (e.g., C:\Users\Marvin\Dropbox\Manuscripts\
Book\Chapter 4\Stata files). After changing to the working directory and
38 4 Creating Datasets and Managing Data

reopening the new Stata file, we can show a description of our dataset by
typing:
describe
The output is the following:
. describe
Contains data
obs: 51
vars: 11
----------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------
Stateid byte %10.0gc Stateid
state_abbrev str2 %9s
state_fips byte %8.0g
state str20 %20s state
total long %10.0g total
public long %10.0g public
private int %10.0g private
anystate long %10.0g anystate
homestate long %10.0g homestate
anyrate double %10.0g anyrate
homerate double %10.0g homerate
-----------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Contains data from US high school graduates in 2012 enrolled in
PSE, by state.dta
Take note, that none of the variables have labels. To create labels, based
on the column names in the Excel file, we use the label variable (lab var)
command for each variable. Here is an example:
lab var Stateid “Stateid”
lab var state_abbrev “State abbreviation”
lab var state_fips “FIPS code”
lab var state “State name”
lab var total “Total number of graduates from HS located in
the state”
lab var public “Number of graduates from public HS
located in the state”
lab var private “Number of graduates from private HS
located in the state”
lab var anystate “Number of first-time freshmen graduating
4.2 Stata Dataset Creation 39

from HS 12 months enrolled in any state”


(Notice that labels cannot be more than 80 characters. So
we have to shorten the label.)
lab var anystate “Number of 1st-time freshmen graduating
from HS enrolled in any state”
lab var homestate “Number of 1st-time freshmen graduating
from HS enrolled in home state”
lab var anyrate “Estimated rate of HS graduates going to
college in any state”
lab var homerate “Estimated rate of HS graduates going to
college in home state”
Typing the describe command, the output is this:
. describe

Contains data from C:\Users\Marvin\Dropbox


\Manuscripts\Book\Chapter 4\Stata\US high school graduates
in 2012 enrolled in PSE, by state.dta
obs: 51
vars: 11 25 Jun 2018 16:16
----------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------
Stateid byte %10.0gc Stateid
state_abbrev str2 %9s State abbreviation
state_fips byte %8.0g FIPS code
state str20 %20s State name
total long %10.0g Total number of
graduates from
HS located in the state
public long %10.0g Number of graduates from
public
HS located in the state
private int %10.0g Number of graduates from
private
HS located in the state
anystate long %10.0g Number of 1st-time
freshmen graduating from
HS enrolled in any state
homestate long %10.0g Number of 1st-time
freshmen graduating from
HS enrolled in home state
anyrate double %10.0g Estimated rate of HS
graduates going to
40 4 Creating Datasets and Managing Data

college in any state


homerate double %10.0g Estimated rate of HS
graduates going to
college in home state
----------------------------------------------------------------
Sorted by:

We then re-save the dataset with the same name.


The example above is a cross-sectional dataset that can be used to provide
descriptive statistics, which will be discussed in the next chapter. Time-
series datasets can be used to observe changes in phenomena over time. For
example, data on the enrollment of recent high school completers in college
from 1960 through 2016 is a time series. These data are also provided by the
National Center of Education Statistics (NCES) in The Digest (2019) and
can be downloaded to an Excel file by going to https://nces.ed.gov/programs/
digest/d17/tables/dt17_302.10.asp. Focusing on the percent of recent high
school completers who enrolled in college between 1960 and 2016, the data
can be copied directly from the downloaded Excel table into Stata. More
specifically, we can copy data in column H (total percent of recent high school
completers who enrolled in college) into the Stata data editor. In Stata, if
we type describe, we will see there are 68 instead of 57 observations (1960–
2016). Because the Excel file had blank rows, Stata treated those blank rows
as cases with missing data (.). Therefore, we drop the cases with missing
data.
drop if var1==.
We rename var1 totalpct by typing:
rename var1 totalpct
We then create a year variable that has values that range from 1960
(1959 + 1) to 2016.
gen year = 1959 + _n
We relocate the year variable to the beginning of the dataset by typing:
order year, first
Then we declare the dataset to be a time series.
tsset year, yearly
(This has to be done only once before saving the file.)
We then change the working directory to the one with our Stata files.
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata”
Finally, we save the file with a descriptive name (Fig. 4.3).
save “Percent of US high school graduates in PSE, 1960 to 2016.dta”
4.2 Stata Dataset Creation 41

A word of caution when using secondary data such as the NCES Excel
files. Many of those files contain non-numeric characters, such as commas
and dollar signs which will yield string variables in Stata. Before we copy
and paste data from those types of files, we have to properly reformat the
cells with data so they contain no non-numeric characters. Using the time-
series data, we can create graphs (which we will discuss in the next chapter).
In many instances, cross-sectional time-series or panel data are used
to conduct higher education policy research. In some cases, analysts have
direct access to data in a panel format, such as some tables that are
published in the Digest of Education Statistics. Consistent with the examples
above and the use of panel data, we download the Excel version of Table
304.70 from the 2018 version of The Digest.2 Because the data on total
fall enrollment of undergraduate students in degree-granting postsecondary
education institutions by state are for selected years 2000 through 2017, we
can characterize those data as panel in nature. Unlike the above example
of time-series data, none of the panel data in this format can be easily
copied and pasted into the Stata Data Editor. Prior to copying or importing
them into Stata, the data have to be properly formatted. The easiest way to
reformat data is in Excel worksheets, containing data on each of the variables
to be subsequently analyzed in Stata. For example, some of the data on
undergraduate students by state from Table 304.70 of the 2018 version of

Fig. 4.3 Stata dataset, based on Excel Table 302.10

2 Table 304.70—Total fall enrollment in degree-granting postsecondary institutions, by level

of enrollment and state or jurisdiction: Selected years, 2000 through 2017. The table can
be found at: https://nces.ed.gov/programs/digest/d18/tables/dt18_304.70.asp.
42 4 Creating Datasets and Managing Data

The Digest can be stored in an Excel worksheet named, “Undergrads”. This


worksheet could be one of many in an Excel workbook named, “Enrollment”.
The
Excel worksheet looks like this (Fig. 4.4):
Before we can import or copy and paste the data into Stata, we must to
do the following:
1. Copy the worksheet “Digest 2018 Table 304.70” to another worksheet
and rename it Ugrad in the same workbook. In the Ugrad worksheet:
2. Remove all borders and unmerge all cells.
3. Remove all irrelevant (e.g., table titles, United States, District of
Columbia, table footnotes, etc.) and blank rows.
4. Remove all irrelevant columns.
5. Insert a new column and create a column header named, “id”.
6. Beginning with the number 1, create an id number for each state.
7. Create a column header for the State names, “State”.
8. Create variable labels for each year of data, beginning with “Ugrad”,
which reflects undergraduate enrollment (e.g., Ugrad2000 for the year
2000, Ugrad2010 for the year 2010, etc.).
9. Reformat all data cells so they contain no non-numeric characters (e.g.,
commas, dollar signs, etc.). Note—If we copy and paste or import
numbers with non-numeric characters into Stata, it will treat them as
string variables, cannot be analyzed.
10. Save the Excel workbook to a new name, such as “Undergraduate
enrollment data”.

Fig. 4.4 Digest 2018 Table 304.70 (Excel)


4.2 Stata Dataset Creation 43

As a result of steps 1–10, our new worksheet should look like this (Fig.
4.5):
As we can see, this worksheet allows us to view and manage the data
that we are interested in and if necessary, access the source of that data
in the other worksheet (i.e., Digest 2018 Table 304.70). We can import this
worksheet from this Excel workbook into Stata, via the following syntax (all
on one line):
import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 4\Excel files\College enrollment data.xls”,
sheet(“Ugrad”) firstrow
The result is as follows:
Take note that the option sheet(“Ugrad”)refers to the specific worksheet
we would like to import. The option firstrow tells Stata that we would like
to designate the first row of the worksheet as variable labels.
. import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4
\Excel files\College enrollment data.xls”,sheet(“Ugrad”) firstrow
(8 vars, 50 obs)

We then save this Stata file with our panel data:


save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata\
Undergraduate enrollment data - Wide.dta”
Take note that we used “Wide” as part of the naming convention. This
is because our panel dataset is in a “wide” format. While it is easier to
manage in this format, it is difficult if not impossible to conduct analysis

Fig. 4.5 Digest 2018 Table 304.70 (Excel)—modified


44 4 Creating Datasets and Managing Data

on “wide” format panel data. To conduct panel data analysis, we have to


convert the data from a “wide” to a “long” format using the reshape or
the much faster Stata user-created sreshape (Simons 2016). In Stata, type
search sreshape, all. Click on dm0090, install and type:
sreshape long Ugrad, i(id) j(year)
Notice the results:
. import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 4\Excel files\College enrollment data.xls”,sheet(“Ugrad”)
firstrow
(8 vars, 50 obs)
. sreshape long Ugrad, i(id) j(year)
(note: j = 2000 2010 2012 2015 2016 2017)Data
wide -> long
----------------------------------------------------------------
Number of obs. 50 -> 300
Number of variables 8 -> 4
j variable (6 values) -> year
xij variables:
Ugrad2000 Ugrad2010 ... Ugrad2017 -> Ugrad
-----------------------------------------------------------------

We now have 300 observations and four variables, including a year variable.
This new dataset, in a long format, now has to be “declared” a panel dataset
by typing:
xtset id year, yearly
The result is:
. xtset id year, yearly
panel variable: id (strongly balanced)
time variable: year, 2000 to 2017, but with gaps
delta: 1 year
The example above is a strongly balanced panel dataset with gaps in the
years. Panel datasets can be strongly balanced, strongly balanced with gaps,
weakly balanced, or unbalanced. In a panel dataset, the total number (N ) of
observations equals the number of units (e.g., states or institutions, etc.) or
panels (p) multiplied by the number of time (t) points (e.g., days, or weeks,
or months, or years) or where N = p x t. A strongly balanced dataset is
one in which all the panels have been observed for the same number of time
points. Panel datasets in which all the panels have been observed for the same
number of time points but have gaps in time points are known as strongly
balanced with gaps in the years. A weakly balanced dataset exists if each
panel has the same number of observations but not the same time points. An
unbalanced dataset is when each panel does not have the same number of
4.2 Stata Dataset Creation 45

observations. In order of priority, we should strive for strongly balanced then


weakly balanced panel datasets. For reasons we will discuss this later in the
book, we should try to avoid using unbalanced panel datasets.
Once we “declare” our data to be a panel dataset (which only has to be
done one time), we save it to a new file. A good practice is to save it with
“Long” as part of its naming convention. For example, we save our “declared”
panel data file as follows:
save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata\
Undergraduate enrollment data - Long.dta”
We then close Stata, by typing:
exit
Unlike our example above, most panel datasets have more than one
variable that can be analyzed, such as Ugrad. We can add more variables
for the same years to our dataset in a few ways. We can manually add
more variables. But as pointed out above, this is a very time-consuming
and possibly error-prone process. The additional variables can be added in
another Excel worksheet in the same or another Excel workbook. We use a
similar naming convention for the variable in the worksheet. For example, we
can download data on state appropriations for public high school graduation
data from the NCES’ Digest of Education Statistics (https://nces.ed.gov/
programs/digest/d18/tables/dt18_219.20.asp) in an Excel file and follow the
steps above, including “HSGrad” as a naming convention for the same years
as in the Ugrad worksheet. Our updated Excel workbook (which we saved as
Example 4) now looks like this (Fig. 4.6):
We open Stata, change our working directory to the directory that contains
the Excel file and import the HDGrad worksheet into Stata, by typing:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel
files”
import excel “Example 4.xls”, sheet(“HSGrad”) firstrow
We then change our working directory to where we want to save our Stata
file and save it, by typing:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata”
save “HSGrad - Wide.dta”
Using the same syntax as above but simply substituting HSGrad for
FirsTim, we can reformat our file from wide to long, declare it a panel data
set, and save it to a new file. We show these steps below.
sreshape long HSGrad, i(id) j(year)
xtset id year, yearly
save “HSGrad - Long.dta”
Like the dataset of first-time college students, this dataset is also strongly
balanced but with gaps in the years. This is not a problem, if we want to merge
46 4 Creating Datasets and Managing Data

Fig. 4.6 Digest 2018 Table 219.20 (Excel)

the two datasets, based on id, into one that would contain two variables that
we can analyze: FirsTim and HSGrad. We do this by specifying the dataset
(“First-Time - Long.dta”)that contains the two variables we would like
to add to the dataset that is currently open. We carry out this procedure by
typing the following:
joinby id year using “First-Time - Long.dta”, unmatched(none)
Because the file contains the same yearly data on first-time college students
as the data on public high school students, we do not have to specify “year”
as a variable. But as shown in the next example below, it is a good practice
to include that variable as well. Given our example, our new Stata file looks
like this (Fig. 4.7):
In the data editor, we can see the same two variables (HSGrads and
FirsTim) that we can later analyze.
If we have data for additional variables in other worksheets located in the
same working directory (e.g., “C:\Users\Marvin\Dropbox\Manuscripts
\Book\Chapter 4\Excel files”), we would simply repeat the steps above
referring to the specific Excel files/worksheets that we want to import and
the Stata files that we want to reshape from wide to long and ultimately join
to our current file in memory. Similarly, we would reshape these datasets
from wide to long and ultimately join them to our current file in memory.
We could also join two or more Stata files that were reshaped from wide
to long and have the variables State, id, and year. For example, if in our
current directory, we have a file that contains state-level undergraduate need-
based financial aid (Undergraduate state financial aid - need.dta) and
another that has merit-based financial aid (Undergraduate state financial
4.2 Stata Dataset Creation 47

Fig. 4.7 Stata file based on Digest 2018 Table 219.20 (Excel)

aid - need.dta) data, we could add the data from those two those files to
our long-format panel dataset on undergraduate college enrollment (College
enrollment data.dta) by executing the following commands:
use “Undergraduate enrollment data - Long.dta”, clear
joinby id year using “Undergraduate state financial aid - need”
joinby id year using “Undergraduate state financial aid - merit”
xtset id year, yearly
save “Example - 4.1.dta”
Notice that in the joinby, syntax we did not have to include the option
unmatched(none). We also did not have to include the extension dta as
a part of the names of the Stata files. We did, however, have to declare our
dataset as a panel data and save it with a new file name (e.g., Example 4.1).
We can see in our Stata data editor, we now have six variables in our new
panel dataset (Fig. 4.8).
After closing the Stata data editor, we can see how our new panel dataset
is structured, by typing the command xtdescribe or the shortened version
(xtdes):
. xtdes id: 1, 2, ..., 50 n = 50
year: 2000, 2010, ..., 2016 T = 5
Delta(year) = 1 year
Span(year) = 17 periods
(id*year uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
48 4 Creating Datasets and Managing Data

Fig. 4.8 Modified Stata file based on Digest 2018 Table 219.20 (Excel)

5 5 5 5 5 5 5
Freq. Percent Cum. | Pattern
---------------------------+-------------------
50 100.00 100.00 | 1.........1.1..11
---------------------------+-------------------
50 100.00 | X.........X.X..XX
We can see that like our original dataset of only undergraduate college
enrollment, our new appended panel dataset spans 17 years, has 250
observations (50 states × 5 years), is strongly balanced, but with gaps in
the years. This structure is acceptable when conducting basic data analysis
such as descriptive statistics, and running some regression models (which we
will cover in other chapters). But as we shall see in the other chapters, a
strongly balanced panel data set with no gaps in the time periods is required
to conduct more advanced statistical analyses.

4.3 Summary

This chapter discussed and demonstrated how higher education policy


analysts can create primary and secondary datasets to address specific policy-
related questions. More specifically, the chapter demonstrated how to create
and organize datasets, using Excel and Stata. Furthermore, it described how
these customized datasets need to be managed.
4.4 Appendix 49

4.4 Appendix

*Chapter 4 Syntax
*Primary data
*example below shows how data for three variables (variable_x, variable_y, ///
and variable_z) can be entered in Stata
input variable_x variable_y variable_z
31 57 18
25 68 12
35 60 13
38 59 17
30 59 15
end

*To see the data that was entered above, type


list

*To save the above data, type:


save “Example 1.0.dta”

*To use the Stata editor to enter additional data in Example 1.0, type:
edit

*the data above may be imported from an Excel comma delimited file (csv) ///
by typing in the following:
insheet using “Example 1.csv”, comma

*Secondary data
to change to the “working directory” which contains the Excel file ///
is as follows:
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”

*Using the same file from above, the Stata command is:
import excel “tabn302.50 - reformatted.xls”, firstrow

*Using the user-created Stata program “statastates”, the FIPS codes ///
and state abbreviations can be easily added to any state-level data ///
set that includes the state name. (In our example from above, the ///
state name is “States”.) This is demonstrated in the two steps below:
ssc install statastates
statastates, name(<State name>)

*We can delete the variable _merge, which was created when we added ///
the FIPS codes and state abbreviations. This is done by simply typing:
drop _merge

*We may also want to move the FIPS codes and state abbreviations ///
somewhere near the front of our dataset. This can be accomplished ///
typing the following Stata command:
order state_abbrev state_fips, before( state)

*Stata dataset, based on the modified Excel tabn302.50


describe

*To create labels, based on the column names in the Excel file, ///
we use the label variable (lab var) command for each variable. ///
Here is an example:
lab var Stateid “Stateid”
lab var state_abbrev “State abbreviation”
lab var state_fips “FIPS code”
lab var state “State name”
lab var total “Total number of graduates from HS located in the state”
lab var public “Number of graduates from public HS located in the state”
lab var private “Number of graduates from private HS located in the state”
lab var anystate ///
50 4 Creating Datasets and Managing Data

“Number of first-time freshmen graduating from HS 12 months enrolled in any state”

*Labels cannot be more than 80 characters. So we have to shorten the label.


lab var anystate ///
“Number of 1st-time freshmen graduating from HS enrolled in any state”
lab var homerate ///
“Estimated rate of HS graduates going to college in home state”
describe

*we drop the cases with missing data.


drop if var1==.

*We rename var1 totalpct by typing:


rename var1 totalpct
gen year = 1959 + _n

*We relocate the year variable to the beginning of the dataset by typing:
order year, first

*Then we declare the dataset to be a time series.


tsset year, yearly
*change the working directory to the one with our Stata files and save the file using.
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata”

*Finally, we save the file with a descriptive name.


save “Percent of US high school graduates in PSE, 1960 to 2016.dta”

*import worksheet from Excel workbook into Stata, via the following syntax
clear all
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”
import excel “College enrollment data.xls”,sheet(“Ugrad”) firstrow

*save this Stata file with our panel data:


cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata
save ”Undergraduate enrollment data - Wide.dta“

*convert the data from a “wide” to a “long” format using the reshape ///
or the much faster user-created sreshape (Simons 2016)

*install sreshape
net install dm0090.pkg, replace
sreshape long Ugrad, i(id) j(year)

*declare a panel dataset by typing:


xtset id year, yearly

*save our “declared” panel data file as follows:


save ”Undergraduate enrollment data - Long.dta“

*cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files“


import excel ”Example 4.xls“, sheet(”HSGrad“) firstrow

*change our working directory to where we want to save our Stata ///
file and save it, by typing:
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
save ”HSGrad - Wide.dta“

*reformat our file from wide to long, declare it a panel data set, ///
and save it to a new file
sreshape long HSGrad, i(id) j(year)
xtset id year, yearly
save ”HSGrad - Long.dta“

*join the two datasets, based on id, into one dataset that would ///
contain two variables
joinby id year using ”First-Time - Long.dta“, unmatched(none)
References 51

*join two or more Stata files that were reshaped from wide to long ///
and have the variables State, id, and year.
use ”Undergraduate enrollment data - Long.dta“, clear
joinby id year using ”Undergraduate state financial aid - need“
joinby id year using ”Undergraduate state financial aid - merit“
xtset id year, yearly
save ”Example - 4.1.dta“

*see how our new panel dataset is structured, by typing the command ///
xtdescribe or the shortened version:
xtdes
*end

References

Simons, K. L. (2016). A sparser, speedier reshape. The Stata Journal, 16 (3), 632–649.
Chapter 5
Getting to Know Thy Data

Abstract This chapter discusses and demonstrates the importance of


getting to know the data that we use to conduct higher education policy
analysis and evaluation. More specifically, this chapter addresses the need to
know the structure of datasets. The identification and exploration of missing
data are also discussed in this chapter.

Keywords Dataset structure · Missing data · Missing data analysis

5.1 Introduction

In the first section of this chapter, we demonstrate how to explore the


structure of a dataset. The next section, we explore or “get to know thy
data”. But this is part of a broader point with regard to our datasets. We
should be well acquainted with all aspects of our data, including the strengths
and limits of their use. The limitations include the extent to which we have
missing data. Therefore, the last part of this chapter discusses how to identify
and analyze missing data patterns. This chapter presents ways in which we
can determine and discuss the strengths and limitations of the data we use to
conduct higher education policy analysis. The Stata commands and syntax
that are used throughout this chapter are included in an appendix.

© Springer Nature Switzerland AG 2021 53


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_5
54 5 Getting to Know Thy Data

5.2 Getting to Know the Structure of Our Datasets

In Chap. 4, we discussed the types of data (primary and secondary) and


how we construct our own dataset from secondary sources. We ended that
chapter by showing we can explore the structure of our panel data set using
the xtdescribe (or xtdes) command in Stata. We can also use the describe
command to show information on data storage and with respect to variables
in any type of dataset. Using that command, we can look at the structure of
our time series data that we introduced in the previous chapter.
. describe

Contains data
obs: 56
vars: 2
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
year float %ty
totalpct float %8.0g
---------------------------------------------------------------
Sorted by: year
We see that “year” and “totalpct” are stored as a floating or float type. By
default, Stata stores all numbers as floats, also known as single-precision or 4-
byte reals (StataCorp 2019). Compared to the integer storage type, the float
storage type uses more memory.1 While it may be necessary for the “totalpct”
variable, this level of precision is not necessary for the year variable, which
is an integer. So we can reduce the amount of memory required by float
by compressing the data using the compress command.2 The use of this
command automatically changes the storage type for the year variable from
float to integer (int). We see from the output below that we save 112 bytes.
. compress
variable year was float now int
(112 bytes saved)
. describe

Contains data

1 For a complete description of the storage types, see page 89 of Stata User’s Guide Release

16.
2 For more information on compress, see pages 77–78 of the Stata User’s Guide Release

16.
5.2 Getting to Know the Structure of Our Datasets 55

obs: 56
vars: 2
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
year int %ty
totalpct float %8.0g
---------------------------------------------------------------
Sorted by: year
Note: Dataset has changed since last saved.
Therefore, it is a good practice to invoke the compress command,
particularly using large datasets with numeric variables that are actually
integers. As an example, we will use an enhanced version of one of our panel
data files that we created in the previous chapter and saved to a new file name,
Example 5.0. With the exception of how state expenditures on financial aid
for undergraduates is measured in millions of dollars, this file contains the
same data as in the file we used as in example Chap. 4 (Example 4.1). Most
likely, we would have either imported these data on state expenditures on
financial aid for undergraduates from National Association of State Student
Grant and Aid Programs (NASSGAP) Excel files or copied and pasted the
data, or manually entered the data from NASSGAP pdf files into a Stata file.
Because it has implications for how our variables are stored, it is important
that we are aware of whether or not the state financial aid data in our dataset
are also measured in millions. If they are measured in millions, then those
variables are not stored as integers. We can verify this by typing the describe
command.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata
files“
use ”Example 5.0.dta“
. describe
Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 5\Stata files\Example 5.0.dta
obs: 250
vars: 6 18 Jul 2020 15:57
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
id float %10.0gc id
year int %ty
State str20 %20s State
Ugrad long %10.0g Undergraduate enrollment
need float %9.0g State spending on need-
56 5 Getting to Know Thy Data

based aid (in millions)


merit float %9.0g State spending on merit-
based aid (in millions)
---------------------------------------------------------------
Sorted by: id year
We see that only the “year” variable is stored as an integer. Because we
know that “id” is also numeric and an integer, we can compress the data.
This results in the following output.
. compress
variable id was float now byte
variable Ugrad was long now int
variable State was str20 now str14
(2,750 bytes saved)
We see this reduces the required memory for the dataset by changing the
storage type for “id” to byte and the storage type for “State” from a string
variable with 20 bytes to one with 14 bytes, resulting in a total of 2750 bytes
saved. After typing the command describe again, we see the following:
Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book
\Chapter 5\Stata files\Example 5.0.dta
obs: 250
vars: 6 18 Jul 2020 15:59
---------------------------------------------------------------
variable storage display value
name type format label variable label
---------------------------------------------------------------
id byte %10.0gc id
year int %ty
State str14 %14s State
Ugrad int %10.0g Undergraduate enrollment
need float %9.0g State spending on need-
based aid (in millions)
merit float %9.0g State spending on merit-
based aid (in millions)
---------------------------------------------------------------
Sorted by: id year
We see that “id” is not stored as an integer. So we will have to use another
Stata command, recast, to accomplish that task.
recast int id
. recast int id
We then retype describe, and see the following:
. describe
5.2 Getting to Know the Structure of Our Datasets 57

Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book


\Chapter 5\Statafiles\Example 5.0.dta
obs: 250
vars: 6 18 Jul 2020 15:59
---------------------------------------------------------------
variable storage display value
name type format label variable label
---------------------------------------------------------------
id int %10.0gc id
year int %ty
State str14 %14s State
Ugrad int %10.0g Undergraduate enrollment
need float %9.0g State spending on need-
based aid (in millions)
merit float %9.0g State spending on merit-
based aid (in millions)
---------------------------------------------------------------
Sorted by: id year
We then save our file with the same name (e.g., Example 5.0).
In some cases, we may be using a large amount of data from secondary
data sources such as the National Center for Education Statistics’ public-use
High School Longitudinal Study of 2009 (HSLS:09) student dataset. Because
this dataset has several thousand variables, we set the maximum variables to
10,0000 (set maxvar 10,000) in Stata.3 We download and import all the
student data from the HSLS:09 dataset in Stata and we use the command
describe, short. We can see that we have 23,503 observations and 8509
variables. If we use the memory command, we can also see that this huge
dataset uses about 1 gigabyte of memory.
. set maxvar 10000
. use ”C:\Users\Marvin\Google Drive\HSLS\hsls_16_student_v1_0.dta“
. describe, short
Contains data from C:\Users\Marvin\Google Drive\HSLS\hsls_16_
student_v1_0.dta
obs: 23,503
vars: 8,509
Sorted by:. memory

3 If we are using Stata/IC, then the maximum number of variables is 798. If we are using

Stata/MP, then the maximum number of variables is 65,532. In this example, we are using
Stata/SE which has as a maximum 10,998 variables.
58 5 Getting to Know Thy Data

Memory usage
used allocated
------------------------------------------------------------
data 1,026,681,549 1,241,513,984
strLs 0 0
------------------------------------------------------------
data & strLs 1,026,681,549 1,241,513,984

------------------------------------------------------------

[the rest of the output omitted]


If we compress the data, we would fail to save any bytes.
. compress
(0 bytes saved)
This suggests that the dataset is structured in such a way that it is
efficiently using our computer’s memory.
In yet another example, we download, reformat, modify, and import some
state higher education finance data from Excel files provided by the State
Higher Education Executive Officers (SHEEO). These data are saved to
a Stata file (Example 5.2.dta)4 This particular example will include post-
Great Recession (i.e., after fiscal year 2009) state-level data on net tuition
revenue (gross tuition revenue minus state financial aid), state appropriations
to public higher education, state financial aid to students, full-time equivalent
(FTE) students (net of medical students) and cost indices (COLI, EMI,
HECA), which SHEEO uses to adjust the data when comparing the data
across years and states. (For the purposes of this example, we will not
use those indices.) To access the data from the worksheet within the
Excel workbook that contains our downloaded data, we change the working
directory:
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Excel
files“
and import the Excel file (type the import command in its entirety on
one line)
import excel ”SHEEO_SHEF_FY18_Nominal_Data.xlsx“, sheet
(”State and U.S. Nominal Data (2“) firstrow
Because we want to use only post-Great Recession data, we drop obser-
vations if they are prior to fiscal year (FY) 2010. (According to the National

4 These data can be found at: https://sheeo.org/project/state-higher-education-finance/.


5.2 Getting to Know the Structure of Our Datasets 59

Bureau of Economic Research, the Great Recession in the U.S. began in


December 2007 and ended in June 2009).
Based on the output below, we see that 1532 observations were dropped.
. drop if FY<2010
(1,532 observations deleted)
We use the list command to take a quick look at the data, particularly with
respect to FY 2010. We will also make the command conditional by using “if
FY==2010”. This tells Stata that we only want to list those observations for
FY 2010.
list if FY==2010
We see that aggregate data on the total U.S. and Washington DC are in our
dataset. Because we want only states in our dataset, we drop all observations
for the U.S. total and Washington DC.
. drop if State==”US“
(9 observations deleted)
. drop if State==”Washington DC“
(9 observations deleted)
In Fig. 5.1, we also see there is no numeric State id number. Until we create
such a variable, we cannot declare our Stata data to be a panel dataset. We
can employ the user-created Stata program “statastates”, introduced in the
previous chapter, or the Stata command egen to create State id numbers
based on the states grouped by state name (State).5 Why should we create
an id number based on the state FIPS code and our own id number based
on State names? We may want to add additional variables from another file
created from state-level data that are grouped by FIPS code. We may also
want to add variables from another file (e.g., imported from Excel) in which
the data are grouped by id numbers that we generated.
With the Stata user-created statastates (Schpero 2018) program, we use
the nogenerate option to prevent the generation of a variable.

5 The egen command, which is short for extensions to generate, can be employed to create

variables that also require an additional function. For a detailed explanation of the egen
command, see the pages 203–223 of the Stata User’s Guide Release 16.
60 5 Getting to Know Thy Data

Fig. 5.1 Stata data—SHEEO Finance Data, FY 2010

. statastates, name(State) nogenerate


(459 real changes made)
(note: variable state_name was str14, now str20 to accommodate
using data’s values)

Result # of obs.
-----------------------------------------
5.3 Getting to Know Our Data 61

not matched 18
from master 18
from using 0

matched 450
-----------------------------------------
This creates two additional variables, state_abbrev and state_fips. The
first is the two character state abbreviation and the second is the state FIPS
code. To create a variable, stateid, based on state names, we use egen.
egen stateid = group(State)
We use the compress command to save computer memory.
. compress
variable stateid was float now byte
variable State was str20 now str14
(4,212 bytes saved)
After compressing the data, we use stateid and FY to declare the dataset
to be a panel.
We use the following syntax, xtset stateid FY, yearly.
. xtset stateid FY, yearly
panel variable: stateid (strongly balanced)
time variable: FY, 2010 to 2018
delta: 1 year
We see the dataset is strongly balanced with no gaps in the time periods.
The data are saved to a file with a new name.
save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata fi-
les\Example 5.2.dta”.

5.3 Getting to Know Our Data

Much of the data, including secondary data that we use for higher education
policy analysis, may be missing. We need to have a sense of not only how
much of the data is missing but the pattern of “missingness”. Using selected
variables from the public-use version of the High School Longitudinal Study
of 2009 (HSLS:09) that we saved in a Stata file (i.e., Example 5.3), we
demonstrate how to identify missing data. In the next section (5.4), we show
how to analyze missing data.
Like many other NCES longitudinal datasets, the HSLS:09 contains many
variables that are labeled with codes that indicate missing data. In some
instances, missing data are coded as −9. A good way to determine if and
62 5 Getting to Know Thy Data

how missing data are coded in datasets from secondary data sources is to
use the codebook command in Stata. In this particular example, we focus
on one variable, S3CLGPELL, which indicates whether in November 2013
a high school student was offered a scholarship or grant for the 2013–2014
school year.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata
files“
use ”Example 5.3.dta“
Codebook S3CLGPELL
From the last command, we see the following output:
. codebook S3CLGPELL
---------------------------------------------------------------
S3CLGPELL S3 D07C Offered scholarship/grant to attend
Nov 1 2013 school for 2013-2014 year
---------------------------------------------------------------

type: numeric (byte)


label: S3CLGPEL

range: [-9,3] units: 1


unique values: 7 missing .: 0/23,503

tabulation: Freq. Numeric Label


459 -9 Missing
4,945 -8 Unit non-response
5,022 -7 Item legitimate skip/NA
588 -4 Item not administered:
abbreviated interview
5,229 1 Yes
5,333 2 No
1,927 3 Don’t know

In addition to the information on the variable type, label, and we can


also see how the variable is coded and the frequency of the coding. More
specifically for missing data, the coding is −9. Because Stata does not treat
the value −9 as an indicator of missing data, the output does not show the
number of missing observations on this variable out of the 23,503 observations
in our dataset. To conduct missing data analysis, we would need to change
to change −9 to “.” To do so, we use the mvdecode command. In fact, after
using the codebook command for each of our variables, we can see which
variables are coded −9 for missing data and include those variables simply
by including (i.e, from families with annual incomes of $30,000 and lower)
5.4 Missing Data Analysis 63

_all as a part of the syntax when using mvdecode. The result of the latter
is:
. mvdecode _all, mv(-9=.)
STU_ID: string variable ignored
X1SEX: 6 missing values generated
X1RACE: 1006 missing values generated
X4ATPRLVLA: 136 missing values generated

We then save this file to a new version of itself (Example 5.4) and are
ready to do some missing data analysis, which is shown below.

5.4 Missing Data Analysis

Before we start analyzing data that we have, it is important to know what we


don’t have or what is missing. Among the variables we have in our dataset, we
need to determine to what extent there are missing data. Because it will have
implications for how we later conduct and interpret data analyses, missing
data analysis is very important.
The most straight-forward approach to missing data analysis is to
determine the number and frequency of missing data among our variables.
This can be carried out by invoking the user-created Stata program, mdesc
(Medeiros and Blanchette 2011). For the most recent version of this program,
in Stata, type install ssc mdesc, replace. This program produces a table
with the number of missing values, total number of cases, and percent missing
for each variable in our file. Using our example based on data extracted
from the public-use version of the HSLS:09 (Example 5.4) and mdesc, we
demonstrate how this is done.
. mdesc
Variable | Missing Total Percent Missing
----------------+---------------------------------------------
STU_ID | 0 23,503 0.00
X1SEX | 6 23,503 0.03
X1RACE | 1,0006 23,503 4.28
X1SES | 0 23,503 0.00
X1SESQ5 | 0 23,503 0.00
X4ATPRLVLA | 136 23,503 0.58
S3CLGPELL | 459 23,503 1.95
----------------+---------------------------------------------
From the output, we can see that 4.28% of the values of the variable
X1RACE and 1.95% of the values of variable 3CLGPELL are missing. We
64 5 Getting to Know Thy Data

can also see that missing values of the variable X1SEX and X4ATPRLVLA
are less than 1%, while we have complete data for the variables X1SES and
X1SESQ5. Notice that none of our student identification numbers (STU_ID),
which is a string variable, are missing.
We can also use the Stata command misstable tree, with various options,
to show the pattern of “missingness” in the data. The output for this
command is shown below:
. misstable tree
Nested pattern of missing values
X1RACE S3CLGPELL X4ATPRLVLA X1SEX
-------------------------------------------
4% <1% 0% 0%
0
<1 0
<1
4 <1 0
<1
4 <1
4
96 2 <1 0
<1
2 0
2
94 <1 0
<1
93 <1
93
-------------------------------------------
(percent missing listed first)
We can also use Stata command misstable patterns to produce the
following output:
. misstable patterns

Missing-value patterns
(1 means complete)

| Pattern
Percent | 1 2 3 4
------------+-------------
93% | 1 1 1 1
|
4 | 1 1 1 0
2 | 1 1 0 1
<1 | 1 0 1 1
<1 | 1 1 0 0
<1 | 0 1 1 0
5.4 Missing Data Analysis 65

<1 | 1 0 0 1
<1 | 1 0 1 0
<1 | 0 1 1 1
------------+-------------
100% |

Variables are (1) X1SEX (2) X4ATPRLVLA (3) S3CLGPELL (4) X1RACE

We can use another option, misstable tree, frequency, which produces:

. misstable tree, frequency

Nested pattern of missing values


X1RACE S3CLGPELL X4ATPRLVLA X1SEX
-------------------------------------------
1,006 23 0 0
0
23 0
23
983 3 0
3
980 5
975
22,497 436 3 0
3
433 0
433
22,061 130 0
130
21,931 1
21,930
-------------------------------------------
(number missing listed first)

If we are using panel data, we can also conduct missing data analysis
employing the user-created Stata program xtmis (Nguyen 2008). The
program must be installed by typing: ssc install xtmis. For xtmis to
work, another Stata program, tomata, must also be installed by typing:
ssc install tomata. The xtmis program will produce a report of the
number and percent of missing and non-missing values for each variable
in groups (e.g., states) indicated. Suppose we downloaded IPEDS data on
the amount of grants and scholarships awarded to low-income students (i.e.,
from families with annual incomes of $30,000 and lower) by private higher
education four-year institutions for the years 2010 to 2018 (Example 5.5).
The file has been declared a panel dataset based on the variable “unitid”
(the IPEDS code) and year. We need to determine the extent to which these
institutions did not provide data on the amount of grants and scholarships
66 5 Getting to Know Thy Data

awarded to low-income students. But before we use the xtmis program, we


must create a string variable from the IPEDS code (unitid) to determine the
frequency of missing values for the variable reflecting the amount of grants
and scholarships awarded to low-income students. We use the Stata command
tostring to create a string variable (unitid_s), based on the numeric IPEDS
variable (unitid). Then we invoke the xtmis command.
tostring unitid, generate(unitid_s)
xtmis grantlow , id(unitid_s)
The output from the second line of syntax is:
. xtmis grantlow , id(unitid_s)

Variable: grantlow
Group by | Obs Missing Feq.Missings NonMiss Feq.NonMiss
-------------------+---------------------------------------------------------
456348 | 9045 2234 24.698729 6811 75.301271
367909 | 112 37 33.035714 75 66.964286
445072 | 252 95 37.698413 157 62.301587
438601 | 48 17 35.416667 31 64.583333
220941 | 30 14 46.666667 16 53.333333
177162 | 84 47 55.952381 37 44.047619
164571 | 15 12 80 3 20
109013 | 6 6 100 0 0
181011 | 4 4 100 0 0
-------------------+---------------------------------------------------------
| 9596 2466 25.698208 7130 74.301792

We can see that about 26% of all observations have missing values for
the variable of interest. It appears that one institution in particular has a
substantial amount of missing data on the amount of grants and scholarships
awarded to low-income students. This may warrant dropping that institution
from any further analysis of the data. In addition to the procedures in Stata
discussed above to determine if and how data are missing, there is a whole
suite of utilities, embedded in the Stata user-created missings command.6
We can use the missing command to examine missing data by a categorical
variable, such as income group (e.g., quintiles). Using data extracted from
the public-use version of the HSLS:09, we can show the patterns of missing
data by student income level. First, install the most recent version of missing
(net install dm0085_1.pkg., replace). Then examine missing data by SES
quartiles.
. bysort X1SESQ5 : missings table

-----------------------------------------------------------------------------
-> X1SESQ5 = Unit non

6 For the full documentation for missings, see Cox (2015).


5.4 Missing Data Analysis 67

Checking missings in all variables:


1036 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,023 49.68 49.68
1 | 1,005 48.81 98.49
2 | 31 1.51 100.00
------------+-----------------------------------
Total | 2,059 100.00

-----------------------------------------------------------------------------
-> X1SESQ5 = First qu

Checking missings in all variables:


64 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,370 98.14 98.14
1 | 64 1.86 100.00
------------+-----------------------------------
Total | 3,434 100.00

-----------------------------------------------------------------------------
-> X1SESQ5 = Second q

Checking missings in all variables:


73 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,632 98.03 98.03
1 | 73 1.97 100.00
------------+-----------------------------------
Total | 3,705 100.00

-----------------------------------------------------------------------------
-> X1SESQ5 = Third qu

Checking missings in all variables:


103 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
68 5 Getting to Know Thy Data

0 | 4,130 97.57 97.57


1 | 103 2.43 100.00
------------+-----------------------------------
Total | 4,233 100.00

-----------------------------------------------------------------------------
-> X1SESQ5 = Fourth q

Checking missings in all variables:


122 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 4,431 97.32 97.32
1 | 120 2.64 99.96
2 | 2 0.04 100.00
------------+-----------------------------------
Total | 4,553 100.00

-----------------------------------------------------------------------------
-> X1SESQ5 = Fifth qu

Checking missings in all variables:


175 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 5,344 96.83 96.83
1 | 174 3.15 99.98
2 | 1 0.02 100.00
------------+-----------------------------------
Total | 5,519 100.00

We do not see a clear pattern in missing data by income group.


The same command can be repeated by racial-ethnic groups.
. bysort X1RACE : missings table

-----------------------------------------------------------------------------
-> X1RACE = Amer. In

Checking missings in all variables:


6 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 159 96.36 96.36
1 | 6 3.64 100.00
5.4 Missing Data Analysis 69

------------+-----------------------------------
Total | 165 100.00

-----------------------------------------------------------------------------
-> X1RACE = Asian, n

Checking missings in all variables:


58 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,894 97.03 97.03
1 | 57 2.92 99.95
2 | 1 0.05 100.00
------------+-----------------------------------
Total | 1,952 100.00

-----------------------------------------------------------------------------
-> X1RACE = Black/Af

Checking missings in all variables:


67 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,383 97.27 97.27
1 | 66 2.69 99.96
2 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,450 100.00

-----------------------------------------------------------------------------
-> X1RACE = Hispanic

Checking missings in all variables:


15 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 407 96.45 96.45
1 | 15 3.55 100.00
------------+-----------------------------------
Total | 422 100.00

-----------------------------------------------------------------------------
-> X1RACE = Hispanic
70 5 Getting to Know Thy Data

Checking missings in all variables:


73 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,302 97.84 97.84
1 | 73 2.16 100.00
------------+-----------------------------------
Total | 3,375 100.00

-----------------------------------------------------------------------------
-> X1RACE = More tha

Checking missings in all variables:


37 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,904 98.09 98.09
1 | 37 1.91 100.00
------------+-----------------------------------
Total | 1,941 100.00

-----------------------------------------------------------------------------
-> X1RACE = Native H

Checking missings in all variables:


5 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 105 95.45 95.45
1 | 5 4.55 100.00
------------+-----------------------------------
Total | 110 100.00

-----------------------------------------------------------------------------
-> X1RACE = White, n

Checking missings in all variables:


306 observations with missing values

# of |
missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 11,776 97.47 97.47
5.4 Missing Data Analysis 71

1 | 305 2.52 99.99


2 | 1 0.01 100.00
------------+-----------------------------------
Total | 12,082 100.00

-----------------------------------------------------------------------------
-> X1RACE = .

Checking missings in all variables:

1006 observations with missing values

# of |

missing |
values | Freq. Percent Cum.
------------+-----------------------------------
1 | 975 96.92 96.92
2 | 31 3.08 100.00
------------+-----------------------------------
Total | 1,006 100.00

From the output above, it appears that missing data are more prevalent
among non-whites.

5.4.1 Missing Data—Missing Completely at Random

The issue of missing data is crucial when we employ statistical methods,


which require assumptions about the nature of any missing data. Those
techniques assume if we have missing data, those data are missing at random
(MAR). Because information on missing data is not available, the MAR
is a weak assumption that cannot be tested. Therefore, analysts make the
stronger assumption of missing completely at random (MCAR). The MCAR
assumption can be tested using observed data (Little 1988). Using our data
extracted from the HSLS:09 dataset, we demonstrate how to conduct this
test. Before we do so, we have to install the Stata user-written program,
mcartest (Li 2013). In Stata, type “search mcartest, all”, click on “st0318”,
and then install or type:
net install st0318.pkg, replace
Let’s suppose we want to see if data on whether students were offered
a scholarship or grant to attend college in the 2013–2014 academic year
(S3CLGPELL) and have information on tuition and mandatory fees at a
72 5 Getting to Know Thy Data

specific college (P1TUITION) are missing completely at random (MCAR).


We conduct the MCAR test, with equal variances, as follows:
use ”HSLS09.dta“, clear
keep STU_ID X1SEX X1RACE X1SES X1SESQ5 X4ATPRLVLA S3CLGPELL P1TUITION
mvdecode _all, mv(-9=.)

. mvdecode _all, mv(-9=.)


STU_ID: string variable ignored
X1SEX: 6 missing values generated
X1RACE: 1006 missing values generated
X4ATPRLVLA: 136 missing values generated
S3CLGPELL: 459 missing values generated
P1TUITION: 1407 missing values generated

. mcartest S3CLGPELL P1TUITION


note: 32 observations omitted from EM estimation because of all
imputation
variables
missing

Little’s MCAR test

Number of obs = 23471


Chi-square distance = 68.1557
Degrees of freedom = 2
Prob > chi-square = 0.0000

Because the p-value is less than 0.05 in the above output, missing data in the
two variables (S3CLGPELL and P1TUITION) are not MCAR.
We can conduct the test with unequal variances.
. mcartest S3CLGPELL P1TUITION, unequal
note: 32 observations omitted from EM estimation because of all
imputation
variables
missing

Little’s MCAR test with unequal variances

Number of obs = 23471


Chi-square distance = 76.0439
Degrees of freedom = 4
Prob > chi-square = 0.0000
We can also add covariates to test the covariate-dependent missingness (CDM).
In this example, we add student race-ethnicity (X1RACE).
5.4 Missing Data Analysis 73

. mcartest S3CLGPELL P1TUITION = i.X1RACE if X1RACE !=. ,


unequal emoutput nolog
note: 32 observations omitted from EM estimation because of
all imputation
variables
missing

Expectation-maximization
estimation Number obs = 22465
Number missing = 1772
Number patterns = 3
Prior: uniform Obs per pattern: min = 404
avg = 7488.333
max = 20693

Observed log likelihood = -80052.204 at iteration 6

------------------------------------
| S3CLGPELL P1TUITION
-------------+----------------------
Coef |
1b.X1RACE | 0 0
2.X1RACE | 2.566391 .5038324
3.X1RACE | 1.102913 .590291
4.X1RACE | -.3349718 -1.390813
5.X1RACE | .9184874 .2190636
6.X1RACE | 1.246321 .9427008
7.X1RACE | .4102142 .2168802
8.X1RACE | 1.798048 1.152957
_cons | -3.909168 -6.200928
-------------+----------------------
Sigma |
S3CLGPELL | 20.58743 4.445989
P1TUITION | 4.445989 11.75155
------------------------------------

Little’s CDM test with unequal variances

Number of obs = 22465


Chi-square distance = 105.2008
Degrees of freedom = 18
Prob > chi-square = 0.0000
We see from the output above that even after including race-ethnicity, the
data in the two variables (S3CLGPELL and P1TUITION) are not MCAR.
74 5 Getting to Know Thy Data

There are at least two implications for data that are MCAR. First, it is
probably not a good idea to delete missing data that are not MCAR and the
variances do not matter. Second, statistical methods that assume no missing
data are valid when missing data are MCAR. In the next few chapters, some
of those statistical methods will be discussed.

5.5 Summary

This chapter discussed and demonstrated the importance of getting to know


the data that we use to conduct higher education policy analysis and
evaluation. More specifically, this chapter addressed the need to know the
structure of our datasets. The identification and exploration of missing data
were also discussed and demonstrated.

5.6 Appendix
*Chapter 5 Syntax
*use time series data from Chap. 4.

cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”Percent of US high school graduates in PSE, 1960 to 2016.dta“

*examine structure of the dataset


describe

*reduce the amount of memory required by float by compressing the data using ///
compress

*compare after compressing, show structure


describe

*open panel dataset


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“
use ”Example 5.0.dta“

*compress the data and show structure


compress
describe

*
recast int id
describe

*save
save ”Example 5.0.dta“, replace

*clear all

*using a large amount of data from secondary data sources such as the ///
National Center for Education Statistics’ NCES ///
public-use High School Longitudinal Study of 2009 (HSLS:09) student dataset

*set the maximum variables to 10,0000


5.6 Appendix 75

set maxvar 10000

*download all student data from the HSLS:09 dataset in Stata

*examine a shortened version of the HSLS:09 dataset’s structure


describe, short

*look at how much memory this dataset uses


memory

*try to see if we can compress the data


compress

*close dataset
clear all

*import an Excel file (SHEEO finance data)


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Excel files“
import excel ///
”SHEEO_SHEF_FY18_Nominal_Data.xlsx“, sheet(”State and U.S. Nominal Data (2“)
firstrow
*Because we want to use only post-Great Recession data, we drop observations ///
if they are prior to fiscal year (FY) 2010 or if less than FY 2010.
drop if FY<2010

*We use the list command to take a quick look at the data, particularly ///
with respect to FY 2010. We will also make the command ///
conditional by using
list if FY==2010

*Because we want only states in our dataset, we drop all observations for ///
the U.S. total and Washington DC.
drop if State==”US“
drop if State==”Washington DC“

*we employ the user-created statastates (Schpero 2018) program to create ///
fips codes and other state identifiers; include nogenerate option to ///
prevent the generation the variable _merge
statastates, name(State) nogenerate

*To create a variable, stateid, based on state names, we use egen.


egen stateid = group(State)

*We use the compress command to save computer memory.


compress

*we use stateid and FY to declare the dataset to be a panel


xtset stateid FY, yearly

*data are saved to a file with a new name.


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“
save ”Example 5.2.dta“
clear all

*Using selected variables from the public-use version of the HSLS:09 ///
at we saved in a Stata file (i.e., Example 5.3)
use ”Example 5.3.dta“

* determine if and how missing data are coded of the variable S3CLGPELL
Codebook S3CLGPELL

*before conducting missing data analysis, we would need to ///


change to change -9 to “. “ for all variables
mvdecode _all, mv(-9=.)

*Missing data analysis


*install Stata user-created program, mdesc (Medeiros and Blanchette 2011)
76 5 Getting to Know Thy Data

install ssc mdesc, replace

*produce a table with the number of missing values, total number of cases, ///
and percent missing for each variable in our file.
mdesc

*use the Stata command misstable tree, with various options, to show the ///
pattern of “missingness” in the data
misstable tree

*use Stata command, misstable


misstable patterns

*use another option,


misstable tree, frequency

*conduct missing data analysis employing Stata user-created ///


routine ”xtmis“ (Nguyen 2008). The program must be installed by typing:
ssc install tomata
ssc install xtmis

* use the Stata command ”tostring“ to create a string variable (unitid_s), ///
based on the numeric IPEDS variable (unitid). Then we invoke xtmis.
tostring unitid, generate(unitid_s)
xtmis grantlow , id(unitid_s)

* Stata user-created ”missings“ command; install most recent version ///


net install dm0085_1.pkg, replace

*examine missing data by SES quartiles


bysort X1SESQ5 : missings table

*same command can be repeated by racial-ethnic groups.


bysort X1RACE : missings table

*Missing Data - Missing Completely at Random


*install the Stata user-written program, mcartest (Li 2013)
net install st0318.pkg, replace

*set maximum variables to 10,000 and Open a large dataset – public ///
use version of the HSLS09
set maxvar 10000
use ”HSLS09.dta“

*keep selected variables


keep STU_ID X1SEX X1RACE X1SES X1SESQ5 X4ATPRLVLA S3CLGPELL P1TUITION

*convert code (-9) for missing data to “.”


mvdecode _all, mv(-9=.)

*test assumption of missing completely at random (MCAR) of two variables


mcartest S3CLGPELL P1TUITION

*add covariates to test the covariate-dependent missingness (CDM)


mcartest S3CLGPELL P1TUITION = i.X1RACE if X1RACE !=. , unequal emoutput nolog

*exit Stata
exit
*end
References 77

References

Cox, N. J. (2015). Speaking Stata: A set of utilities for managing missing values. The Stata
Journal ,15 (4), 1174–1185.
Li, C. (2013). Little’s Test of Missing Completely at Random. The Stata Journal ,13 (4),
795–809. https://doi.org/10.1177/1536867X1301300407
Little, R. J. (1988). A test of missing completely at random for multivariate data with
missing values. Journal of the American Statistical Association,83 (404), 1198–1202.
Medeiros, R. A., & Blanchette, D. (2011). MDESC: Stata module to tabulate prevalence
of missing values. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s457318.html
Nguyen, M. C. (2008). XTMIS: Stata module to report missing observations for each
variable in xt data. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s456945.html
Schpero, W. L. (2018). STATASTATES: Stata module to add US state identifiers to
dataset. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s458205.html
StataCorp. (2019). Stata User’s Guide Release 16. Stata Press.
Chapter 6
Using Descriptive Statistics
and Graphs

Abstract This chapter discusses the use of descriptive statistics and


graphs to present to policymakers and conduct exploratory data analysis
(EDA). Descriptive statistics, that include measures of central tendency and
dispersion, are discussed and demonstrated using real data. The utilization
of graphs, which includes histograms, box charts, and scatter plots, are also
presented.

Keywords Descriptive statistics · Graphs

6.1 Introduction

The use of descriptive statistics and graphs is essential in higher education


policy analysis. Many policymakers rely, sometimes solely, on the information
provided by analysts and researchers who use those basic analytical tools. In
many ways, descriptive statistics and graphs also serve as a basis upon which
more advanced statistical analyses are built and used to convey additional
information and evaluate higher education policy outcomes.
This chapter discusses the use of descriptive statistics and graphs in
higher education policy analysis, in terms of providing basic information to
policymakers and conducting exploratory data analysis (EDA). Building on
the previous chapter, this chapter elaborates on how descriptive statistics
and graphs can be employed to better understand the nature of data being
used for higher education policy analysis. The Stata commands and syntax
that are used throughout this chapter are included in an appendix.

© Springer Nature Switzerland AG 2021 79


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_6
80 6 Using Descriptive Statistics and Graphs

6.2 Descriptive Statistics

Measures of central tendency are commonly used to generate descriptive


statistics from continuous data. These measures of central tendency include
the mean or average and the median. Measures of dispersion, such as the
variance and standard deviation, as well as minimum and maximum, are
sometimes generated as a part of descriptive statistics that are provided
to data users. Percentiles and proportions, which could be characterized as
measures of distribution, may also be included in a set of descriptive statistics
that are made available to policymakers.

6.2.1 Measures of Central Tendency

We commonly use the arithmetic mean or average and the median to provide
basic information to policymakers and other data users. The average is
reflected in the formula below as:

1 
n
A= · xi (6.1)
n i=1

where A is the average, n is the number of terms (e.g., items, cases, etc.,
being averaged), and x1 is the value of each individual term in the list of
terms being averaged. Using cross-sectional data introduced in Chap. 4 and
the Stata command, ameans, we can easily demonstrate how to compute
the arithmetic means for both public and private high school graduates in
2012 who enrolled in post-secondary education (PSE) institutions by state
and the District of Columbia (DC).
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”US high school graduates in 2012 enrolled in PSE, by state.dta“

. ameans public private


Variable | Type Obs Mean [95% Conf. Interval]
-------------+---------------------------------------------------------------
public | Arithmetic 51 61748.73 40916.34 82581.11
| Geometric 51 36454.28 27008.66 49203.29
| Harmonic 51 21282.97 16156.79 31173.64
-------------+---------------------------------------------------------------
private | Arithmetic 51 6054.314 4144.743 7963.885
| Geometric 51 3190.959 2204.461 4618.915
| Harmonic 51 1009.752 555.4843 5541.604
-------------+---------------------------------------------------------------

In the output above, we can see the geometric and harmonic means in
addition to the arithmetic means. (The output also includes the number
of observations and the 95% confidence intervals, which we will discuss
6.2 Descriptive Statistics 81

later.) While interesting, the geometric and harmonic means are almost never
provided to policymakers and other data users.1 So if we wanted to generate
only the arithmetic mean, we could use the Stata command, mean, which
would result in the following output.
. mean public private
Mean estimation Number of obs = 51
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
public | 61748.73 10371.81 40916.34 82581.11
private | 6054.314 950.7169 4144.743 7963.885
--------------------------------------------------------------
In addition to the mean, we also see the standard errors (Std. Err.) and the
95% confidence intervals (95% Conf. Interval), both of which we will ignore
for now. From this output, we see the average (mean) number of public high
school graduates who enrolled in PSE institutions across all 50 states and DC
during 2012 was 61,749. The average number of private high school graduates
who enrolled in PSE institutions was 6054.
If we are interested in the other measures of central tendency, such as the
median, we can use the following Stata command, summarize, detail or
(sum, detail).

. sum, detail
Stateid
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 3 2
10% 6 3 Obs 51
25% 13 4 Sum of Wgt. 51

50% 26 Mean 26
Largest Std. Dev. 14.86607
75% 39 48
90% 46 49 Variance 221
95% 49 50 Skewness 0
99% 51 51 Kurtosis 1.799077

State abbreviation
-------------------------------------------------------------

1 The geometric mean multiplies rather than sums values, then takes the nth root rather

than dividing by n. The harmonic mean is the reciprocal of the arithmetic mean of the
reciprocals of the numbers in a dataset.
82 6 Using Descriptive Statistics and Graphs

no observations

FIPS code
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 4 2
10% 8 4 Obs 51
25% 16 5 Sum of Wgt. 51

50% 29 Mean 28.96078


Largest Std. Dev. 15.83283
75% 42 53
90% 50 54 Variance 250.6784
95% 54 55 Skewness -.0192853
99% 56 56 Kurtosis 1.895209

State name
-------------------------------------------------------------
no observations

Total number of graduates from HS located in the


state
-------------------------------------------------------------
Percentiles Smallest
1% 5603 5603
5% 7322 5680
10% 8456 7322 Obs 51
25% 18238 7789 Sum of Wgt. 51

50% 44575 Mean 67919.12


Largest Std. Dev. 80127.72
75% 76177 171404
90% 146493 209216 Variance 6.42e+09
95% 209216 306591 Skewness 2.842139
99% 451364 451364 Kurtosis 12.70425

Number of graduates from public HS located in the


state
-------------------------------------------------------------
Percentiles Smallest
1% 3860 3860
5% 6859 5553
10% 8196 6859 Obs 51
6.2 Descriptive Statistics 83

25% 17568 6942 Sum of Wgt. 51

50% 38681 Mean 61748.73


Largest Std. Dev. 74069.52
75% 65667 151964
90% 131733 180806 Variance 5.49e+09
95% 180806 292531 Skewness 2.932485
99% 418664 418664 Kurtosis 13.23787

Number of graduates from private HS located in the


state
-------------------------------------------------------------
Percentiles Smallest
1% 50 50
5% 260 200
10% 670 260 Obs 51
25% 1750 380 Sum of Wgt. 51

50% 3040 Mean 6054.314


Largest Std. Dev. 6789.477
75% 8520 14760
90% 14030 19440 Variance 4.61e+07
95% 19440 28410 Skewness 2.096875
99% 32700 32700 Kurtosis 7.855749

Number of first-time freshmen graduating from HS


enrolled in any state
-------------------------------------------------------------
Percentiles Smallest
1% 2463 2463
5% 3732 3170
10% 5825 3732 Obs 51
25% 10241 4142 Sum of Wgt. 51

50% 29023 Mean 41638.04


Largest Std. Dev. 48201.53
75% 53836 107716
90% 87075 146458 Variance 2.32e+09
95% 146458 176871 Skewness 2.615638
99% 263843 263843 Kurtosis 11.16016

Number of first-time freshmen graduating from HS


enrolled in home state
84 6 Using Descriptive Statistics and Graphs

-------------------------------------------------------------
Percentiles Smallest
1% 450 450
5% 2413 2040
10% 4443 2413 Obs 51
25% 6179 2426 Sum of Wgt. 51

50% 23268 Mean 33913.9


Largest Std. Dev. 41608.52
75% 38812 94985
90% 69039 117960 Variance 1.73e+09
95% 117960 156566 Skewness 2.826083
99% 231215 231215 Kurtosis 12.40817

Estimated rate of HS graduates going to college in


any state
-------------------------------------------------------------
Percentiles Smallest
1% 43.36268 43.36268
5% 46.93976 45.57333
10% 50.82883 46.93976 Obs 51
25% 56.57683 48.01237 Sum of Wgt. 51

50% 61.46578 Mean 60.72702


Largest Std. Dev. 7.220225
75% 65.25911 70.00325
90% 69.3848 70.67225 Variance 52.13166
95% 70.67225 70.75149 Skewness -.3147326
99% 78.78177 78.78177 Kurtosis 3.100983

Estimated rate of HS graduates going to college in


home state
-------------------------------------------------------------
Percentiles Smallest
1% 7.922535 7.922535
5% 29.46636 26.19078
10% 35.77116 29.46636 Obs 51
25% 41.41141 33.27016 Sum of Wgt. 51

50% 48.85525 Mean 47.22305


Largest Std. Dev. 10.45517
75% 54.84477 58.8365
90% 56.38192 60.36491 Variance 109.3106
95% 60.36491 60.77484 Skewness -1.002735
99% 73.12088 73.12088 Kurtosis 5.818673
6.2 Descriptive Statistics 85

In addition to the mean, the output, as shown above, provides the


median (50th percentile or 50%) for both public and private high school
graduates enrolled in PSE institutions. We also see the standard deviation
(Std. Dev.), variance, skewness, and kurtosis (all of which we will discuss
later). Comparing the median or 50th percentile to the mean gives us a
rough indication as to whether or not our data are normally distributed (i.e.,
a bell curve with 68% of the values of a variable lying within one standard
deviation of the mean).

6.2.2 Measures of Dispersion

A standard deviation is a measure of variance, or how dispersed the data are


around the central tendency or mean. The standard deviation, (std. dev.), is
simply the square root of the variance. Other measures of dispersion include
the relative standard deviation or the coefficient of variation (CV). The CV,
which is the ratio of the standard deviation to the mean, does not depend
on the unit of measurement (e.g., dollars, FTE students, etc.). Therefore,
in most instances, we can use the CV to compare the dispersion of values
for variables that have different units of measurement. For example, using a
dataset discussed in Chap. 5, we can compute the CV for net tuition revenue
(NetTuition) and FTE students (FTEStudents) by employing the tabstat
command:
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“
use ”Example 5.2.dta“
tabstat NetTuition FTEStudents, stat(cv)

. tabstat NetTuition FTEStudents, stat(cv) stats | NetTuin FTEStus


---------+--------------------
cv | .8863343 1.166096
------------------------------

The output above shows that FTE students are more dispersed than net
tuition revenue within the U.S. between FY 2010 and FY 2018. The tabstat
command can include options to include other statistics such as the mean,
median (50th percentile), standard deviation, minimum, and maximum by
unit (e.g., state). The options can also include, specifying the width of the
variable labels, a long format, displaying the statistics in columns rather than
rows, and with no column total. For example, the syntax would be as follows
(all on one line):
tabstat Netuition FTEStudents, stat(mean median sd min max
cv)labelwidth(30) long format by(state) col(stat) nototal
In Fig 6.1, (the remainder of the output after Idaho is omitted), we can
compare the descriptive statistics for net tuition revenue and FTE students
86 6 Using Descriptive Statistics and Graphs

Fig. 6.1 Net tuition revenue and FTE students by state, descriptive statistics

across states. Using tabstat with options, we can show the same set of
statistics by fiscal year.
In Fig. 6.2, we see that the coefficient of variation (CV) of net tuition
revenue has declined slightly between FY 2000 and 2018. We also see that
the CV of net tuition revenue across states has been consistently less than
that of FTE students over the same time period.

6.2.3 Distributions

We can include various percentiles in which we can observe how many of


the cases in the dataset fall within a certain percentage range. If our data
includes categorical variables (e.g., gender, race/ethnicity, types of student
financial aid, level of institution, state-level higher education governance
structure, etc.), then our descriptive statistics could include frequencies and
6.2 Descriptive Statistics 87

Fig. 6.2 Net tuition revenue and FTE students by year, descriptive statistics

Fig. 6.3 HSLS:09 race/ethnicity categories

cross tabulations (crosstabs). Using data from the High School Longitudinal
Study of 2009 (HSLS:09), we demonstrate how to show the frequencies of
various racial/ethnic categories of the variable X1RACE.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files“
use ”Example 6.1.dta“, clear
prop X1RACE

The output is shown in Fig. 6.3.


Notice that the proportions are in decimals rather than percentages
format. This is not particularly appealing to a lay audience. Instead, a more
88 6 Using Descriptive Statistics and Graphs

user-friendly output is generated by the tabulate (tab) command. We can


also easily show these statistics in descending order of frequency by simply
adding the option, sort.
tab X1RACE, sort
In addition to the frequencies, the percentages for each race/ethnicity and
cumulative percentages are also shown in Fig. 6.4. While this may be useful
information, it may be rather incomplete with regard to other key data, such
as hourly earnings by race/ethnicity. To generate those statistics, we can use
the following syntax, tab X1RACE, summarize(EarnHr). This produces
the following output:
In Fig. 6.5, we see the mean hourly earnings, along with the standard
deviation and frequency by race/ethnicity.
When we have more than one categorical variable, we may also want to
generate two-way tables showing a summary statistic (e.g., means) across
those categories. Expanding on the example above, if we want to show the
mean hourly earnings by race/ethnicity by sex, we can use the following
syntax:
tab X1RACE X1SEX, sum(EarnHr) means
This will produce the output, as shown in Fig. 6.6.
We notice that this particular command results in the truncation of long
value labels for the race/ethnicity category in the first column. This problem
may be addressed in a couple of ways. First, we create a new categorical
variable (i.e., RaceEthnic) reflecting race/ethnicity with value labels based
on a fewer categories, linking the new variable to the value labels. We do this
using the following syntax:
codebook X1RACE
gen RaceEthnic = 0
replace RaceEthnic = 1 if X1RACE==2
replace RaceEthnic = 2 if X1RACE==3
replace RaceEthnic = 3 if X1RACE==4 | X1RACE==5

Fig. 6.4 HSLS:09 race/ethnicity categories, sorted by percent


6.2 Descriptive Statistics 89

Fig. 6.5 HSLS:09 race/ethnicity categories, mean earnings per hour

Fig. 6.6 HSLS:09


race/ethnicity categories
by sex, mean earnings per
hour

replace RaceEthnic = 4 if X1RACE==6


replace RaceEthnic = 5 if X1RACE==1 | X1RACE==7
replace RaceEthnic = 6 if X1RACE==8
lab var RaceEthnic ”Race/Ethnicity“
label define RaceEthnic1 1 Asian 2 Black 3 Hispanic 4 Multiracial 5 Other 6 White
label values RaceEthnic RaceEthnic1
Then we type the following syntax:
tabulate RaceEthnic X1SEX, sum(EarnHr) means
90 6 Using Descriptive Statistics and Graphs

The output for the last line of syntax is shown in Fig. 6.7.
An alternative to the above procedure is to use the table command with
variable X1RACE, reflecting the original race/ethnicity categories to depict
mean hourly earnings by race/ethnicity by sex.
. table X1RACE X1SEX, contents(mean EarnHr)
This is shown in the output below in Fig. 6.8.
The table command provides more options with regard to the statistics
(e.g., median, percentiles, etc.) that can be shown. Formatting options can
also be included, such as stub (first column) width and other features.

Fig. 6.7 HSLS:09 race/ethnicity categories by sex, mean earnings per hour

Fig. 6.8 HSLS:09 race/ethnicity categories by sex, mean earnings per hour
6.2 Descriptive Statistics 91

The examples above demonstrate how we can generate descriptive statis-


tics from cross-sectional data. When we are using panel data, we can
also generate descriptive statistics by employing the xttab command.
This command is useful when showing the distribution of time-invariant
categorical variables. Using a panel dataset (50 states across 27 years, 1990–
2016) from the same working directory above, this is demonstrated below:
use ”Example 6.3.dta“
. xtdes
stateid: 1, 2, ..., 50 n = 50
year: 1990, 1991, ..., 2016 T = 27
Delta(year) = 1 year
Span(year) = 27 periods
(stateid*year uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max


27 27 27 27 27 27 27

Freq. Percent Cum. | Pattern


---------------------------+-----------------------------
50 100.00 100.00 | 111111111111111111111111111
---------------------------+-----------------------------
50 100.00 | XXXXXXXXXXXXXXXXXXXXXXXXXXX

From the output above, we see that our data are balanced with each of
the 50 states having 27 years of data. We invoke the xtttab command.
. xttab region_compact

Overall Between Within


region_t | Freq. Percent Freq. Percent Percent
----------+-----------------------------------------------------
None | 81 6.00 3 6.00 100.00
SREB | 432 32.00 16 32.00 100.00
WICHE | 351 26.00 13 26.00 100.00
MHEC | 324 24.00 12 24.00 100.00
NEBHE | 162 12.00 6 12.00 100.00
----------+-----------------------------------------------------
Total | 1350 100.00 50 100.00 100.00
(n = 50)

In this particular example, we are looking at the distribution of states


by their membership in regional interstate compacts for education (where
SREB is the Southern Regional Education Board; WICHE is the Western
Interstate Commission for Higher Education; MHEC is the Midwestern
Higher Education Compact; and NEBHE is the New England Board of
Higher Education). The overall frequency in each category is more than in
the between frequency. But the overall percent in each category is the same
as the between percent. Take note that we have 1350 state-years of data.
Because state membership in regional interstate compacts does not change
over time, the between frequency is the most useful in the output above.
92 6 Using Descriptive Statistics and Graphs

If we have time-variant categorical variables, then xttrans is a more


appropriate command to show distributions. Using the same dataset, we
estimate the transition probabilities of changing categories. For example, we
can show the probability of a state changing from providing undergraduate
merit aid to not providing undergraduate merit and vice versa.
. xttrans ugradmerit

Undergradu | Undergraduate Merit


ate Merit | Aid
Aid | 0 1 | Total
-----------+----------------------+----------
0 | 91.88 8.12 | 100.00
1 | 2.01 97.99 | 100.00
-----------+----------------------+----------
Total | 26.69 73.31 | 100.00
In this example, each year, 92% of the states in the data that didn’t offer
undergraduate merit aid did not do so in the next year, while 8% offered
undergraduate merit aid. The states that offered merit aid had only a 2%
chance of becoming a state that did not offer merit aid. Another way of
looking at this output is that of the 73% of states that offered merit aid, 98%
offered merit aid in the next year.

6.3 Graphs

In addition to descriptive statistics, graphs should also be used to help con-


duct exploratory data analysis (EDA) by researchers and convey information
to policymakers and other consumers of data.

6.3.1 Graphs—EDA

When conducting exploratory data analysis (EDA), graphs are useful tools
to quickly and initially determine whether certain assumptions of various
statistical techniques, such as regression, are valid. To ascertain if data for
a particular continuous variable has a normal distribution, one can create a
histogram with a superimposed normal curve. We illustrate this by using the
dataset above. First, we create a new variable, stapr_fte (state appropriations
per FTE student). Then we create a histogram, with a superimposed normal
curve, of the stapr_fte data.
gen stapr_fte = stapr/fte
histogram stapr_fte, normal
6.3 Graphs 93

Figure 6.9 shows that stapr_fte data are not normally distributed and
skewed to the right. This indicates that before any additional analysis (e.g.,
regression) is conducted, a transformation (e.g., logarithmic) of the data may
be required. (More on this will be discussed in the next chapter.)
We can also create a box chart of the same data to examine the distribution
of state appropriations data per FTE student by using the following syntax:
graph box stapr_fte
If the data are normally distributed, the line (the median) would be in
the middle of the box (the 25th and 75th percentiles). We can see that in
Fig. 6.10, however, the median is closer to the lower end of the box. The
graph also shows outliers at the upper end of the box, indicating a positive
skew.
A histogram can also be created to provide a quick depiction of the
frequency of categories. For example, if we wanted to see the distribution
of states by regional compact, we can easily do so using the following syntax
with the added options to include labels and percent (all on one line):
histogram region_compact, discrete addlabels ylabel(,grid) xlabel(0 1 2 3
4, valuelabel) percent
From Fig. 6.11, we can easily see that the largest proportion (32%) of
the states are in the SREB compact. From an analytical perspective, this
information is useful if we need to know to what extent we may need
to collapse the data into a smaller number of categories, due to skewed
distributions across categories, prior to additional analysis.

Fig. 6.9 Histogram of state appropriations per FTE student


94 6 Using Descriptive Statistics and Graphs

Fig. 6.10 Box chart of state appropriations per FTE student

Fig. 6.11 Histogram of membership in regional compacts

Given the distribution of states across regional compacts, we may also want
to see if state appropriations are distributed normally by regional compact.
This can be easily done by invoking the following syntax:
6.3 Graphs 95

histogram stapr_fte, by(region_compact)


As we can see in Fig. 6.12, only state appropriations per FTE student in
the MHEC regional compact appear to be normally distributed. In terms
of descriptive statistics for data users and exploratory data analysis for
researchers, this by itself may be very interesting information. But it also may
be of interest with respect to the statistical assumptions when using more
advanced statistical techniques. (This is something we will further discuss in
subsequent chapters of the book.)
The use of box charts can also be extended to examine the distribution of
data by categories to show skewness and whether there are outliers.
graph box stapr_fte, by(region_compact)
Figure 6.13 indicates that with respect to state appropriations per FTE
student, there is positive skewness across all regional compact areas and
extreme outliers in the WICHE region, specifically in the upper end of the
distribution in the 75th percentile. While Fig. 6.12 appears to show that
state appropriations per FTE student in the MHEC regional compact states
are normally distributed, Fig. 6.13 suggests that those same data are not
normally distributed. The latter figure indicates that the line (the median) is
not in the middle of the box but instead pulled to the low end (25th percentile)

Fig. 6.12 State appropriations per FTE student by regional compact


96 6 Using Descriptive Statistics and Graphs

Fig. 6.13 Box chart of state appropriations per FTE student by regional compact

and has a few outliers at the upper end (75th percentile) of the box. So Fig.
6.13 provides additional information regarding the characteristics of the data.
As part of EDA, scatter plots can also be used by analysts to show the
simple relationship between two continuous variables at a given point in time.
For example, we can show how net tuition revenue per FTE student is related
to state appropriations per FTE student in fiscal year 2016. Here is the syntax
and results:
graph twoway scatter stapr_fte netuit_fte if year==2016
Figure 6.14 shows there is a negative relationship between state appropri-
ations per FTE student and net tuition revenue per FTE student. We can fit
a regression line (more on this in the next chapter) through the data points
in Fig. 6.14 by slightly changing the previous syntax and typing the following
(all on one line):
twoway (scatter stapr_fte netuit_fte) (lfit stapr_fte netuit_fte) if
year==2016

or
twoway scatter stapr_fte netuit_fte, mlabel(state) || lfit stapr_fte netuit_fte
|| if year==2016
6.3 Graphs 97

Fig. 6.14 Scatter plot of state appropriations and net tuition revenue per FTE student

Fig. 6.15 Scatter plot of state appropriations and net tuition revenue per FTE student a
fitted line

While we can see how far they are from the regression line, we do not know
which states are outliers (see Fig. 6.15). We can, however, do so by simply
adding the option mlabel(state) to the following syntax (all on one line).
98 6 Using Descriptive Statistics and Graphs

twoway scatter stapr_fte netuit_fte, mlabel(state) || lfit stapr_fte netuit_fte


|| if year==2016

Figure 6.16 shows that Alaska (AK), Arizona (AZ), Wyoming (WY),
Hawaii (HI), and Connecticut (CT) are among the outliers and can be
interpreted as having an influence on the regression line. This suggests those
states should be excluded from any subsequent analysis of the relationship
between state appropriations and net tuition revenue per FTE student in
2016.
Finally, scatter plots can be used to determine if the relationship between
two variables changes over time by employing a Stata user-written program,
aaplot (Cox 2015). After installing the aaplot (ssc install aaplot), we create
scatter plots for two different time periods (1990 and 2016) to see if the
relationship remains the same over time.
aaplot netuit_fte stapr_fte if year==1990
aaplot netuit_fte stapr_fte if year==2016
Figure 6.17 is for the scatter plot with data in 1990. We see not
only the scatter plot, but we also see some statistics. First, we see a
negative relationship between net tuition revenue per FTE student and state
appropriations per FTE student. Second, the R2 , which measures the fit
between the data points and regression line, is 9.3%. This means 9.3% of the
variance in net tuition revenue per FTE student is explained by the variance
in state appropriations per FTE student in 1990. We see not only the scatter
plot, but also some statistics.

Fig. 6.16 Scatter plot of state appropriations and net tuition revenue per FTE student a
fitted line
6.3 Graphs 99

Fig. 6.17 State appropriations and net tuition revenue per FTE student and regression
line, FY1990

Fig. 6.18 State appropriations and net tuition revenue per FTE student and regression
line, FY 2016

Like the previous figure, Fig. 6.18 indicates there is a negative relationship
between net tuition revenue per FTE student and state appropriations per
FTE student. Compared to 1990, there is a slightly closer fit (R2 = 13%)
100 6 Using Descriptive Statistics and Graphs

between the data points and the regression line. Together, the two graphs
(Figs. 6.17 and 6.18) suggest the relationship between net tuition revenue
per FTE student and state appropriations per FTE student did not change
over time.

6.4 Conclusion

The measures of central tendency and graphs that are discussed above are
examples of descriptive statistics and EDA that can be used to provide
basic information to data users. These basic methods can and should also be
employed to better understand the nature of data used with intermediate and
advanced methods such as multiple regression as well as other techniques that
are used in higher education policy analysis and evaluation. In the following
chapters, we will turn our attention to those methods and techniques.

6.5 Appendix

*Chapter 6 Syntax
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“
use ”US high school graduates in 2012 enrolled in PSE, by state.dta“
*compute the arithmetic means
ameans public private
mean public private

*measures of central tendency, we can use summarize, detail or (sum, detail)


sum, detail

*Measures of dispersion

*employing the tabstat command:


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“
use ”Example 5.2.dta“
tabstat NetTuition FTEStudents, stat(cv)

*Distributions
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files“
use ”Example 6.1.dta“, clear

*frequencies and cross tabulations (crosstabs)


tab X1RACE
tab X1RACE, sort

*frequencies and cross tabulations (crosstabs) with descriptive statistics


tab X1RACE, summarize(EarnHr)
tab X1RACE X1SEX, sum(EarnHr) means

*create a new categorical variable (e.g., RaceEthnic) reflecting ///


race/ethnicity with value labels based on a fewer categories, ///
linking the new variable to the value labels

*look at how the variable is coded


codebook X1RACE
6.5 Appendix 101

*create new categorical variable


gen RaceEthnic = 0
replace RaceEthnic = 1 if X1RACE==2
replace RaceEthnic = 2 if X1RACE==3
replace RaceEthnic = 3 if X1RACE==4 | X1RACE==5
replace RaceEthnic = 4 if X1RACE==6
replace RaceEthnic = 5 if X1RACE==1 | X1RACE==7
replace RaceEthnic = 6 if X1RACE==8

*create variable and value labels


lab var RaceEthnic ”Race/Ethnicity“
label define RaceEthnic1 1 Asian 2 Black 3 Hispanic 4 Multiracial 5 Other 6 White

*link the new variable to the value labels


label values RaceEthnic RaceEthnic1

*frequencies and cross tabulations (crosstabs) with descriptive statistics


tabulate RaceEthnic X1SEX, sum(EarnHr) means

*with labels
table X1RACE X1SEX, contents(mean EarnHr)
clear

*descriptive statistics for panel data


*open panel dataset
use ”Example 6.3.dta“
xtdes

*descriptive statistics for time-invariant variables


xttab region_compact

*descriptive statistics for time-variant variables


xttrans ugradmerit

*Graphs - EDA
*histogram, with a superimposed normal curve
*create a new variable
gen stapr_fte = stapr/fte

*create histogram with a superimposed normal curve


histogram stapr_fte, normal
*box chart of the same data

*histogram with the frequency of categories with percent


histogram region_compact, discrete addlabels ///
ylabel(,grid) xlabel(0 1 2 3 4, valuelabel) percent

*histogram with the frequency of categories with percents and ///


superimposed normal curve

*box charts to examine the distribution of data by categories


graph box stapr_fte, by(region_compact)

*scatter plots to show the simple relationship between two continuous variables
graph twoway scatter stapr_fte netuit_fte if year==2016

*scatter plots to show the simple relationship between two continuous ///
variables with fitted regression line
twoway (scatter stapr_fte netuit_fte) (lfit stapr_fte netuit_fte) if year==2016

*adding the option mlabel(state)


twoway scatter stapr_fte netuit_fte, ///
mlabel(state) || lfit stapr_fte netuit_fte || if year==2016

*use of a Stata user-written program, aaplot (Cox 2015)


*install aaplot
102 6 Using Descriptive Statistics and Graphs

ssc install aaplot

*run aaplot for two different time periods (1990 & 2016)
aaplot netuit_fte stapr_fte if year==1990
aaplot netuit_fte stapr_fte if year==2016

*close dataset
Clear

*end

Reference

Cox, N. J. (2015). AAPLOT: Stata module for scatter plot with linear and/or quadratic
fit, automatically annotated. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s457286.html
Chapter 7
Introduction to Intermediate
Statistical Techniques

Abstract This chapter introduces intermediate statistical techniques, which


include pooled ordinary least squares (OLS), fixed-effects, and random-
effects regression. This chapter demonstrates how we can use these statistical
techniques to analyze panel data. It shows how various tests can be
conducted to determine the appropriate method that should be employed in
correlational studies. The chapter also introduces how multivariate regression
can be modified to infer causal effects by including difference-in-differences
estimators.

Keywords Ordinary least squares (OLS) regression · Fixed-effects


regression · Random-effect regression

7.1 Introduction

This chapter is an introduction to intermediate statistical methods that


are used in higher education policy analysis and evaluation. Starting with
ordinary least squares (OLS) regression models, this chapter presents the
tools that are indispensable to higher education researchers and policy
analysts to conduct correlational studies. Many policy analysts and most
higher education researchers continue to rely heavily on correlational methods
(Hutchinson and Lovell 2004; Wells et al. 2015). Therefore, this chapter will
discuss the use of correlational methods, such as OLS, fixed-effects, and
random-effects regression models to conduct higher education policy analysis
and evaluation. The Stata commands and syntax used to demonstrate these
methods are included in an appendix at the end of the chapter.

© Springer Nature Switzerland AG 2021 103


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_7
104 7 Introduction to Intermediate Statistical Techniques

7.2 Review of OLS Regression

OLS regression, a subset of generalized linear models (GLM), is the most


widely used quantitative analytical technique in education and social science
research. It is easy to use and interpret and includes a single dependent
continuous variable (e.g., outcome, endogenous response). OLS regression
also involves the use of independent variables (e.g., predictors, covariates,
exogenous, explanatory). It is commonly used in the social sciences and
education to:
• model reality;
• test hypotheses/theories; and
• forecast or predict outcomes.
The purpose of the OLS regression technique is to explain differences in the
dependent variable (DV) that may depend on the independent variables (IVs).
It is also employed to predict when a specific value of the DV might occur
based on the values of the IVs. More specifically, OLS regression techniques
involve estimating:
• the form of the relationship between the DV and IVs;
• the direction and strength of association between DV and IVs; and
• which IVs are important (i.e., statistically significant) and which are not.
When using OLS regression, it is this last point that is most relevant to
higher education policy analysts and researchers.
Scatter plots with a fitted line are visual representations of simple or
bivariate OLS regression models where the dependent variable is on the
vertical axis (y) and the independent variable is on the horizontal axis (x ).
Mathematically, this is represented as:

Ŷi = β0 + Xβ1 + εi (7.1)

where Ŷ i is the expected value of Yi , β 0 is the constant or expected or average


value of Yi when the independent variable X is zero, β 1 is the estimated
parameter for independent variable X, and εi is the error term uniquely
associated with state i.

7.2.1 The Assumptions of OLS Regression

OLS regression is based on seven “classical” assumptions, which are as follows:


7.2 Review of OLS Regression 105

1. Linearity—The regression model is linear in the coefficients and the error


term εi .
2. The error term (εi ) has a population mean of zero.
3. Homoscedasticity—The error term (εi ) has a constant variance. There is
no heteroscedasticity.
4. No autocorrelation or serial correlation—There is no systematic pattern to
the errors (εi ). Observations of the error term (εi ) are uncorrelated with
each other.
5. No independent variable is a perfect linear function of other explanatory
variables.
6. Exogeneity—All independent variables are uncorrelated with the error
term (εi ).
7. The error term (εi ) is normally distributed (optional).
Although it need not be assumed, the number of observations should be
greater than the number of independent variables. We will return to this point
when the limitations of using small cross-sectional datasets are discussed.
Later in this chapter, we will also discuss how to check for violations of these
assumptions when using OLS and other regression techniques.
OLS regression minimizes the sum of the squared errors (SSE).
n 
 2
SSE = Yi − Ŷi (7.2)
i=1

where Yi is the actual and Ŷ i is the expected outcome for unit (e.g., state)
i.

7.2.2 Bivariate OLS Regression

For an OLS regression model with one independent variable, or what is known
as a bivariate OLS regression model, the estimated beta coefficient in Figs.
6.22 and 6.23 or slopes are calculated as:
n 
  
Xi − X Yi − Y
i=1
β1 =  2 (7.3)
Xi − X

where Xi is the observed value of the independent variable, X is the mean


value of X, Yi is the observed value of the dependent variable, and Y is the
mean value of Y. After computing β 1 , the intercept (β 0 ) is computed as:

β0 = Y − β̂1 X (7.4)
106 7 Introduction to Intermediate Statistical Techniques

The above formula is for a bivariate regression model.1


If X or a vector of X (set of X independent variables) is equal to zero,
then β 0 is simply the expected mean of Y. In most social science and higher
education policy research, X or a vector of X is almost never zero. Therefore
when using OLS regression, β 0 is rarely of interest to higher education
policy analysts and researchers. If, however, the regression model has binary
categorical independent variables, the intercept is the mean value only for the
reference group when all other independent variables are zero. But again, this
is almost never the case when conducting higher education policy research.
Using the cross-sectional 2016 data from the example in the previous
chapter and Stata syntax, we can generate bivariate (one independent
variable) OLS regression output by typing the following:
regress netuit_fte stapr_fte if year ==2016
We get this output:
. regress netuit_fte stapr_fte if year ==2016

Source | SS df MS Number of obs = 50


-------------+---------------------------------- F(1, 48) = 7.19
Model | 62525800.7 1 62525800.7 Prob > F = 0.0100
Residual | 417181345 48 8691278.01 R-squared = 0.1303
-------------+---------------------------------- Adj R-squared = 0.1122
Total | 479707145 49 9789941.74 Root MSE = 2948.1

------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -.354383 .132125 -2.68 0.010 -.6200382 -.0887278
_cons | 10192.75 1107.684 9.20 0.000 7965.599 12419.9
------------------------------------------------------------------------------

We can see that R2 , 0.1303, is the same value as what was shown
in Fig. 6.18. But the regression output provides an analysis-of-variance
(ANOVA) table with the model and residual (errors) sum of squares (SS),
the degrees of freedom (df), and mean square (MS).2 Information from the
ANOVA table can be used to calculate the R2 , which is the regression
model sum of squares (RSS) divided by the total sum of squares (TSS) or
R2 = RSS/TSS, where
n 
 2
RSS = Ŷi − Y
i=1

1 For OLS regression formulas with more than one independent variable, see introductory

mathematical statistics texts.


2 It is assumed the reader is familiar with ANOVA.
7.2 Review of OLS Regression 107


n
 2
TSS = Yi − Y
i=1

With respect to the overall regression model, the output includes the F -
statistic and its statistical significance, adjusted R2 , root mean square error
(MSE), where
 
k n−1
adjusted R = 2
R −
2
n−1 n−k−1

The F -statistic compares a model with no independent variables (as an


intercept-only model) to the model with one or more independent variables.
The null hypothesis is the intercept-only model and model with independent
variables are equal. A rejection of the null hypothesis is if the intercept-
only model is significantly reduced compared to the model with one or more
independent variables. If we can reject the null hypothesis (Prob > F is
less than 0.05), then we can conclude that the model with one or more
independent variables provides a better fit than the intercept-only model.
We see from the output above that this is indeed the case. In the above
output, we also see the root MSE is the square root of the MS of the residual
(shown in the ANOVA table) or the standard deviation of the residuals. It
is an indication of the concentration of the data around the regression line.
The lower the values of the root MSE, the better the regression model fits
the data.
The estimated beta coefficients for state appropriations per FTE student
(stapr_fte) and the constant (_cons), as well as the standard error (Std.
Err.), t statistic (t), statistical significance (P > |t|), and 95% confidence
intervals are also shown. The standard error reflects the average distance
between the data points and the regression line and is represented in the
following formula:

 

Y − Ŷ 2
N −2
sβ =
 2
X − X̂

The t statistic is estimated as follows:


β
tn−2 =

The smaller sβ , the larger tn−2 , and more likely the probability that the
null hypothesis will be rejected and the claim the parameter (β) estimate
is statistically significant. If we can reject the null hypothesis with more
108 7 Introduction to Intermediate Statistical Techniques

than 95% certainty (95% of the values of β lie within mean ±1.96 * standard
deviation) then we can say β is not equal to zero or not the result of statistical
chance. This is the same as saying there is less than a 5% probability (p value
<0.05) the estimated β coefficient is equal to zero. In education and most of
the social sciences, p < 0.05, is acceptable to claim statistical significance. If
the p value is greater than 0.05, then it is not acceptable and we cannot reject
the null hypothesis. Therefore, we would need to reject the null hypothesis
(H0 ) that the beta coefficient β 1 is not equal to zero or accept the alternative
hypothesis (Ha ) or β 1 = 0 to make the claim there is statistical significance
with respect to the state appropriations per FTE student variable. This is
represented as:
H0 : β 1 = 0
 0
Ha : β 1 =
So the standard errors of the βs are VERY important!
The adjusted R2 is lower than the unadjusted R2 , taking into account the
one independent variable X. This suggests that 11% of the variability of net
tuition revenue per FTE student is explained by the regression model.
The estimated beta coefficient for state appropriations per FTE student is
−.354383, equal to what is shown in Fig. 6.18. This suggests that, on average,
a one dollar increase in state appropriations per FTE student will result in
a decrease of 35 cents in net tuition revenue per FTE student. The t-test for
stapr_fte equals −2.68, which means it is statistically significant in terms of
its difference from zero at the 0.01 level of significance (p < 0.05).3 Therefore,
based on this example, we can say net tuition revenue per FTE student is
negatively related to state appropriations per FTE student across 50 states
in 2016.

7.2.3 Multivariate OLS Regression

In the real world, however, of higher education policy analysis and research,
OLS regression models with only one independent variable should never be
used to address a question about the importance of a policy-oriented variable.
At the very least, the regression should include control variables. These are
not policy-oriented variables that can be manipulated by higher education
policymakers. So, we turn our attention to an OLS regression model with two
or more independent variables, otherwise known as multiple or multivariate
OLS regression. Expanding on Eq. (7.1), a multivariate OLS regression model
is represented mathematically as the following:

3 Thet statistic is equal to the estimated beta coefficient divided by the standard error. So
the smaller the standard error, the larger the absolute value of the t statistic.
7.2 Review of OLS Regression 109

Ŷi = β0 + β1 X1 + β2 X2 + . . . βn Xn + εi (7.5)

Equation (7.5) enables analysts and researchers to also include polynomials


of main policy-oriented independent variables as well as control variables.
Indeed, it may be advantageous to include a squared term of the main
independent or policy-oriented variable to ascertain whether there is a linear
and non-linear relationship between the dependent variable and a policy-
oriented independent variable. Using Stata syntax, we can easily create a new
variable reflecting, say, the squared term (or quadratic) of another variable
and then include that new variable in our regression. For example, we can
include the squared term and also state per capita personal income as a
control variable.
To do this, we type the following:
gen stapr_fte2 = stapr_fte*stapr_fte
regress netuit_fte stapr_fte stapr_fte2 pc_income if year ==2016

The output is:


. gen stapr_fte2 = stapr_fte*stapr_fte
. regress netuit_fte stapr_fte stapr_fte2 pc_income if year ==2016

Source | SS df MS Number of obs = 50


-------------+---------------------------------- F(3, 46) = 7.29
Model | 154551308 3 51517102.7 Prob > F = 0.0004
Residual | 325155837 46 7068605.16 R-squared = 0.3222
-------------+---------------------------------- Adj R-squared = 0.2780
Total | 479707145 49 9789941.74 Root MSE = 2658.7

-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -1.608785 .5061895 -3.18 0.003 -2.627692 -.5898788
stapr_fte2 | .0000543 .0000232 2.33 0.024 7.45e-06 .0001011
pc_income | .1322943 .0533078 2.48 0.017 .0249912 .2395974
_cons | 9744.101 3472.105 2.81 0.007 2755.115 16733.09
-------------------------------------------------------------------------------

We see the adjusted R2 is now 0.278, suggesting the model explains 28%
of the variability in net tuition revenue per FTE student across states in
2016. More importantly, the size of the estimated beta coefficient for state
appropriations per FTE student is now −1.61. Because it only relies on cross-
sectional data in 2016, this multiple regression model may have actually
produced biased estimates of the beta coefficients. With data from only
50 cases (i.e., states), a multiple regression model limits the number of
independent variables that may be included in the model.
For example, suppose we include seven independent variables in our model
(i.e., the variable reflecting states grouped by regional compacts, the number
of independent units of analysis). This means that the degrees of freedom
(which is the number of estimated beta coefficients minus one) will be
110 7 Introduction to Intermediate Statistical Techniques

reduced. Multiple regression models with very low degrees of freedom may
result in inefficient estimates of the beta coefficients.

7.2.4 Multivariate Pooled OLS Regression

If they are available, then data should be used that allows us to overcome
possible problems of low degrees of freedom and consequently inefficient
estimates of beta coefficients. The availability of panel data (discussed in
Chap. 4) would enable us to run pooled OLS (POLS) regression models. The
following example illustrates this point where we regress net tuition revenue
per FTE student on the same variables shown in the previous example.
However, now we use panel data (50 states across 27 years).
reg netuit_fte stapr_fte stapr_fte2 pc_income
The output is:
. reg netuit_fte stapr_fte stapr_fte2 pc_income

Source | SS df MS Number of obs = 1,350


-------------+---------------------------------- F(3, 1346) = 610.98
Model | 5.1916e+09 3 1.7305e+09 Prob > F = 0.0000
Residual | 3.8124e+09 1,346 2832408.8 R-squared = 0.5766
-------------+---------------------------------- Adj R-squared = 0.5756
Total | 9.0040e+09 1,349 6674588.45 Root MSE = 1683

-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -1.018307 .0773341 -13.17 0.000 -1.170015 -.8665983
stapr_fte2 | .0000329 4.33e-06 7.60 0.000 .0000244 .0000413
pc_income | .2036221 .0048243 42.21 0.000 .1941581 .2130862
_cons | 2403.068 320.5399 7.50 0.000 1774.256 3031.88
-------------------------------------------------------------------------------

From the results of the POLS, we see the number of observations at 1350
(50 × 27) is substantially large, compared to those in previous output. The
adjusted R2 at 57.6% is also greater while the root MSE is smaller, indicating
a better model fit. But more relevant to a higher education policy analyst
would be the estimated beta coefficients, specifically for stapr_fte, which
is now −1.018. This indicates while there is still a negative relationship
between net tuition revenue per FTE student and state appropriations per
FTE student, and the value of the beta coefficient is lower than when using
the 2016 cross-sectional data.
The larger number of observations will also enable us to include more
independent variables in our POLS regression model without being too
concerned about low degrees of freedom. For example, we can now include
the categorical variable representing region compacts (region_compact) in
7.2 Review of OLS Regression 111

our model. Because region_compact is a categorical or factor variable, we


include i.region_compact in the Stata syntax below.
reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact
The output is:
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact

Source | SS df MS Number of obs = 1,350


-------------+---------------------------------- F(7, 1342) = 329.13
Model | 5.6898e+09 7 812822803 Prob > F = 0.0000
Residual | 3.3143e+09 1,342 2469642.47 R-squared = 0.6319
-------------+---------------------------------- Adj R-squared = 0.6300
Total | 9.0040e+09 1,349 6674588.45 Root MSE = 1571.5

-------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -1.04053 .0750848 -13.86 0.000 -1.187826 -.8932333
stapr_fte2 | .0000383 4.22e-06 9.09 0.000 .00003 .0000466
pc_income | .1917324 .0047585 40.29 0.000 .1823976 .2010672
|
region_compact |
SREB | 185.804 194.7014 0.95 0.340 -196.1481 567.7562
WICHE | -957.9857 199.7539 -4.80 0.000 -1349.85 -566.1219
MHEC | 99.67403 197.3705 0.51 0.614 -287.5143 486.8623
NEBHE | 1100.607 215.7601 5.10 0.000 677.3429 1523.87
|
_cons | 2712.485 366.2787 7.41 0.000 1993.944 3431.027
-------------------------------------------------------------------------------

We see that controlling for regional compact does not substantially change
the estimated beta coefficient for state appropriations per FTE student or for
any of the variables. It is worth noting that compared to states that are not
members of regional compacts, WICHE states have lower net tuition revenue
per FTE student and NEBHE states have higher net tuition revenue per FTE
student.

7.2.4.1 Multivariate Pooled OLS Regression with Interaction


Terms

Because we are using pooled data, we can also include more variables,
including interaction terms. Interaction terms are combinations of existing
variables. The combination may include the following:
1. two or more categorical variables
2. two or more continuous variables
3. one or more categorical variables with one or more continuous variables
112 7 Introduction to Intermediate Statistical Techniques

For an example of 1, we will use regional compact (region_compact) and


undergraduate merit aid program (ugradgmerit). The double hashtag ##
is used to create the interaction term and include them separately in the
regression model. To get a sense of the omitted reference categories, we
include allbaselevels as an option.
. reg netuit_fte stapr_fte i.region_compact##i.ugradmerit, allbaselevels

Source | SS df MS Number of obs = 1,350


-------------+---------------------------------- F(10, 1339) = 33.36
Model | 1.7957e+09 10 179571843 Prob > F = 0.0000
Residual | 7.2083e+09 1,339 5383346.82 R-squared = 0.1994
-------------+---------------------------------- Adj R-squared = 0.1935
Total | 9.0040e+09 1,349 6674588.45 Root MSE = 2320.2

------------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------------+--------------------------------------------------------------
stapr_fte | .0275013 .0289854 0.95 0.343 -.0293604 .084363
|
region_compact |
None | 0 (base)
SREB | -4376.048 663.3507 -6.60 0.000 -5677.367 -3074.728
WICHE | -4350.232 610.8338 -7.12 0.000 -5548.527 -3151.937
MHEC | -3356.754 660.639 -5.08 0.000 -4652.754 -2060.754
NEBHE | -917.9704 637.625 -1.44 0.150 -2168.823 332.8823
|
ugradmerit |
No | 0 (base)
Yes | -2149.968 648.7122 -3.31 0.001 -3422.571 -877.3649
|
region_compact#ugradmerit |
None#No | 0 (base)
None#Yes | 0 (base)
SREB#No | 0 (base)
SREB#Yes | 3477.178 732.8958 4.74 0.000 2039.429 4914.927
WICHE#No | 0 (base)
WICHE#Yes | 2837.446 696.7433 4.07 0.000 1470.619 4204.274
MHEC#No | 0 (base)
MHEC#Yes | 3084.481 735.2593 4.20 0.000 1642.096 4526.867
NEBHE#No | 0 (base)
NEBHE#Yes | 3028.864 743.8004 4.07 0.000 1569.723 4488.005
|
_cons | 6658.134 600.8424 11.08 0.000 5479.439 7836.829
------------------------------------------------------------------------------------------

We can test to see if there is an interaction effect between being a


member of a regional compact and having a state merit aid program for
undergraduates that explains more variance in net tuition revenue per FTE
enrollment. We do so by quietly (qui) running the models and storing (est
sto) the model results without (model1) and with the interaction terms
(model2).
. qui reg netuit_fte stapr_fte i.region_compact

. est sto model1

. qui reg netuit_fte stapr_fte i.region_compact##i.ugradmerit

. est sto model2


7.2 Review of OLS Regression 113

Then we conduct a nested F -test to test whether the difference between


the R2 of the main effects model and the R2 of the interaction model is equal
to zero.
. lrtest model1 model2

Likelihood-ratio test LR chi2(4) = 23.33


(Assumption: model1 nested in model2) Prob > chi2 = 0.0001

Because the difference between the R2 of the main effects model and the
2
R of the interaction model is 0.0001, we can reject the null hypothesis and
conclude the model with the interaction terms does not help to explain more
variance in net tuition revenue per FTE enrollment. Using the testparm
command, the statistical significance of the interaction terms can also be
checked.
. testparm i.region_compact#i.ugradmerit

( 1) 1.region_compact#1.ugradmerit = 0
( 2) 2.region_compact#1.ugradmerit = 0
( 3) 3.region_compact#1.ugradmerit = 0
( 4) 4.region_compact#1.ugradmerit = 0

F( 4, 1339) = 5.84
Prob > F = 0.0001
The test results above indicate that the interaction terms as a whole are
statistically significant.
What if we wanted to investigate if the difference in net tuition revenue
per FTE enrollment by tuition-setting authority (i.tuitset) changes with the
amount of state appropriations per FTE enrollment? This is an example of
number 2 above, where the interaction term is composed of one continuous
variable and one categorical variable. The following syntax includes “c.”,
which indicates state appropriations per FTE enrollment (c.stapr_fte) is a
continuous variable.
. reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset

Source | SS df MS Number of obs = 1,350


-------------+---------------------------------- F(12, 1337) = 39.02
Model | 2.3356e+09 12 194633061 Prob > F = 0.0000
Residual | 6.6684e+09 1,337 4987601.41 R-squared = 0.2594
-------------+---------------------------------- Adj R-squared = 0.2527
Total | 9.0040e+09 1,349 6674588.45 Root MSE = 2233.3

-------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
ugradmerit |
No | 0 (base)
Yes | 561.8776 152.1065 3.69 0.000 263.4843 860.271
|
114 7 Introduction to Intermediate Statistical Techniques

region_compact|
None | 0 (base)
SREB | -1500.987 276.8269 -5.42 0.000 -2044.049 -957.9247
WICHE | -2130.328 283.3574 -7.52 0.000 -2686.201 -1574.454
MHEC | -1020.981 280.5026 -3.64 0.000 -1571.254 -470.7078
NEBHE | 1018.53 320.9729 3.17 0.002 388.8648 1648.195
|
stapr_fte | .2060764 .217321 0.95 0.343 -.2202508 .6324036
|
tuitset |
Legislature | 0 (base)
State-Wide Board | 10193.37 1678.683 6.07 0.000 6900.234 13486.51
System Board | 1900.957 1390.018 1.37 0.172 -825.8971 4627.81
Campus | 3296.063 1411.991 2.33 0.020 526.1053 6066.022
|
tuitset#c.stapr_fte |
State-Wide Board | -1.310195 .273725 -4.79 0.000 -1.847173 -.7732179
System Board | -.1069439 .2198251 -0.49 0.627 -.5381836 .3242958
Campus | -.2043566 .224125 -0.91 0.362 -.6440316 .2353184
|
_cons | 1895.032 1400.074 1.35 0.176 -851.5488 4641.613
-------------------------------------------------------------------------------------

. testparm c.stapr_fte#i.tuitset

( 1) 2.tuitset#c.stapr_fte = 0
( 2) 3.tuitset#c.stapr_fte = 0
( 3) 4.tuitset#c.stapr_fte = 0

F( 3, 1337) = 17.31
Prob > F = 0.0000

The results of the regression model show that the difference in net
tuition revenue per FTE enrollment by tuition-setting authority (specifically
for state-wide boards compared to the reference category, the legislature),
declines with increases in state appropriations per FTE enrollment. The
results of the post-estimation test indicate that the interaction terms are
statistically significant.
What if we wanted to find out how the relationship between net tuition
revenue per FTE enrollment and state appropriation changes as the amount
of state total need-based financial aid (state_needFTE) changes. Therefore,
the regression model should include an interaction term that is composed of
two continuous variables as shown below in the output.
. reg netuit_fte i.region_compact c.stapr_fte##c.state_needFTE

Source | SS df MS Number of obs = 1,350


-------------+---------------------------------- F(7, 1342) = 50.88
Model | 1.8885e+09 7 269778857 Prob > F = 0.0000
Residual | 7.1156e+09 1,342 5302211.49 R-squared = 0.2097
-------------+---------------------------------- Adj R-squared = 0.2056
Total | 9.0040e+09 1,349 6674588.45 Root MSE = 2302.7

---------------------------------------------------------------------------------------------
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------------------+----------------------------------------------------------------
region_compact |
None | 0 (base)
7.2 Review of OLS Regression 115

SREB | -729.9702 309.623 -2.36 0.019 -1337.368 -122.5725


WICHE | -1546.882 316.8258 -4.88 0.000 -2168.41 -925.3543
MHEC | -187.8332 309.1457 -0.61 0.544 -794.2946 418.6282
NEBHE | 1762.95 327.8046 5.38 0.000 1119.885 2406.016
|
stapr_fte | .1285269 .0364127 3.53 0.000 .057095 .1999588
state_needFTE | 3.372105 .4932087 6.84 0.000 2.404562 4.339649
|
c.stapr_fte#c.state_needFTE | -.0003921 .000074 -5.30 0.000 -.0005372 -.000247
|
_cons | 3349.423 373.6221 8.96 0.000 2616.476 4082.37
---------------------------------------------------------------------------------------------

The results above indicate that the relationship between net tuition
revenue and state appropriations per FTE enrollment is captured in the
interaction term that reflects state appropriations and total need-based aid.
If we focus on state total need-based aid changes, this means that the
relationship between net tuition revenue per FTE enrollment and state total
need-based aid changes as state appropriations per FTE enrollment changes.
(This would be the case, even if state appropriations per FTE enrollment by
itself was not statistically significant.)
The interpretation of the results of a regression with an interaction term
that is composed of two continuous variables is facilitated with the use of the
margins and marginsplot post-estimation commands. To restrict some of
the output, we include the vsquish option.
. margins, dydx(stapr_fte) at(state_needFTE=(0(3000)10000)) vsquish

Average marginal effects Number of obs = 1,350


Model VCE : OLS

Expression : Linear prediction, predict()


dy/dx w.r.t. : stapr_fte
1._at : state_needE = 0
2._at : state_needE = 3000
3._at : state_needE = 6000
4._at : state_needE = 9000

------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte |
_at |
1 | .1285269 .0364127 3.53 0.000 .057095 .1999588
2 | -1.047798 .2012571 -5.21 0.000 -1.442611 -.6529854
3 | -2.224123 .4220569 -5.27 0.000 -3.052086 -1.39616
4 | -3.400448 .6435904 -5.28 0.000 -4.663001 -2.137895
------------------------------------------------------------------------------

The value in the margins command indicates the amount of change in


net tuition revenue per FTE enrollment with a one unit (i.e., one dollar)
change in state appropriations per FTE enrollment at different values of state
total need-based aid per FTE enrollment. In our example, we are holding
116 7 Introduction to Intermediate Statistical Techniques

state total need-based aid per FTE enrollment at $0, $3000, $6000, $9000.
The output above indicates that state appropriations per FTE enrollment is
statistically significant for all of those values of state total need-based aid per
FTE enrollment. We can show this relationship changes at each value in a
graph by entering the following syntax.
. qui margins, at(stapr_fte=(0 10000) state_needFTE=(0(3000)10000)) vsquish
.
. marginsplot, noci x(stapr_fte) recast(line) xlabel(0(3000)10000)

We can see from Fig. 7.1 that where there is no state need-based
financial aid per FTE enrollment, the relationship between net tuition
revenue per FTE enrollment and state appropriations per FTE enrollment
is slightly positive. But as the amount of state need-based financial aid
per FTE enrollment increases the relationship between net tuition revenue
per FTE enrollment and state appropriations per FTE enrollment becomes
increasingly negative. This suggests that as states increase their funding
directly to students, net tuition revenue to institutions decline more rapidly
in response to higher amounts of state appropriations.
But it is possible that the estimated beta coefficients in this POLS
regression are biased due to violations of one or more of the seven classical
OLS assumptions presented in Sect. 7.2.1. More specifically, we can and
should check to see if some of the assumptions have been violated by
performing post-estimation diagnostics. One such diagnostic is a residual-

Fig. 7.1 Predictive margins of net tuition revenue per FTE by state need-based aid per
FTE
7.2 Review of OLS Regression 117

versus-fitted plot that can be created immediately after running the regression
by simply typing the Stata command syntax, rvfplot. This command graphs
the following plot.
We can see from Fig. 7.2 that the residuals are more dispersed in the
middle of the graph than at the right and left. This indicates there is a
violation of the assumption that the error term (ε) has a constant variance
or of homoscedasticity. Additionally, it is quite possible that the errors are
not normally distributed. So a comprehensive post-estimation test should be
conducted to detect if, in addition to the assumption of violation of normally
distributed errors, there is also heteroscedasticity. This is done by typing the
Stata command syntax estat imtest, which produces the following output:
. estat imtest

Cameron & Trivedi’s decomposition of IM-test

---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 189.76 24 0.0000
Skewness | 63.95 7 0.0000
Kurtosis | 9.56 1 0.0020
---------------------+-----------------------------
Total | 263.27 32 0.0000
---------------------------------------------------
The p values indicate the assumptions of homoscedasticity and normally
distributed errors have been violated. To take into account these two
violations of assumptions, we should rerun our POLS regression model using
the robust option.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, robust

Linear regression Number of obs = 1,350


F(7, 1342) = 249.37
Prob > F = 0.0000
R-squared = 0.6319
Root MSE = 1571.5

------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -1.04053 .0721036 -14.43 0.000 -1.181978 -.8990817
stapr_fte2 | .0000383 4.03e-06 9.51 0.000 .0000304 .0000462
pc_income | .1917324 .0060699 31.59 0.000 .1798248 .20364
|
region_compact |
SREB | 185.804 199.3342 0.93 0.351 -205.2365 576.8446
WICHE | -957.9857 180.9863 -5.29 0.000 -1313.033 -602.9389
118 7 Introduction to Intermediate Statistical Techniques

Fig. 7.2 Residual-versus-fitted plot (rvfplot)

MHEC | 99.67403 178.2546 0.56 0.576 -250.014 449.362


NEBHE | 1100.607 221.0546 4.98 0.000 666.9565 1534.257
|
_cons | 2712.485 353.6409 7.67 0.000 2018.736 3406.235
------------------------------------------------------------------------------

From this output we see the estimated beta coefficients are the same but
some of the standard errors have changed. But, it is also possible that the
variability of the dependent variable is unequal across a range of independent
variables or there is group-wise heteroscedasticity. In other words, net tuition
revenue per FTE student within each state may not be independent, leading
to residuals that are not independent across states. To detect group-wise
heteroscedasticity, another test should be conducted. This test, which is
robust to non-normality, is called the Levene test of homogeneity and is
conducted in the following steps.4
quietly: reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact
predict double eps, residual
robvar eps, by(state)

. robvar eps, by(state)

4 For a full description of the Levene test, see Levene, H. (1960). Robust tests for equality

of variances. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.),


Contributions to probability and statistics: Essay in honor of Harold Hotelling (pp. 278–
292). Stanford University Press.
7.2 Review of OLS Regression 119

State |
abbreviatio | Summary of Residuals
n | Mean Std. Dev. Freq.
------------+------------------------------------
AK | 1122.2639 939.42995 27
AL | 1616.6929 1956.758 27
AR | 162.56587 334.56975 27
AZ | -137.19617 477.91889 27
CA | -2498.2149 1121.2912 27
CO | -90.322544 701.35696 27
CT | -845.30401 699.56268 27
DE | 4531.8731 3107.7021 27
FL | -2373.7873 730.65775 27
GA | -810.79875 723.35512 27
HI | 1297.0811 471.53693 27
IA | 1141.2379 504.31709 27
ID | 468.24506 360.61989 27
IL | -1369.4974 1150.3627 27
IN | 1404.8174 1242.093 27
KS | -933.50045 366.96024 27
KY | 623.90684 721.21209 27
LA | -1011.2335 521.60785 27
MA | -2230.438 826.23532 27
MD | -752.82225 415.40112 27
ME | 873.48914 1048.5964 27
MI | 1873.1069 1620.3943 27
MN | -134.58677 664.61368 27
MO | -865.87429 391.30801 27
MS | 504.26361 703.23813 27
MT | 483.33723 393.45575 27
NC | -630.84519 306.35169 27
ND | 227.6154 755.3359 27
NE | -662.25153 372.39504 27
NH | -1012.948 760.14257 27
NJ | 223.57714 440.59753 27
NM | 254.49388 293.12376 27
NV | -464.44503 327.67539 27
NY | -1739.7618 502.25295 27
OH | 385.59024 496.92629 27
OK | -845.73432 563.76589 27
OR | 500.09117 664.07122 27
PA | 1516.1847 582.46832 27
RI | -78.700309 720.27019 27
SC | 878.38713 769.57448 27
SD | 172.33711 638.97413 27
TN | -197.12965 483.98213 27
TX | -1033.8177 637.22766 27
UT | 680.8684 424.88764 27
VA | -1089.88 508.53689 27
VT | 3293.9012 1334.3393 27
WA | -1094.1928 523.50132 27
WI | -1238.9945 541.81062 27
120 7 Introduction to Intermediate Statistical Techniques

WV | 428.35934 789.9017 27
WY | -522.0092 1628.6087 27
------------+------------------------------------
Total | -1.169e-13 1567.427 1,350

W0 = 21.243149 df(49, 1300) Pr > F = 0.00000000

W50 = 10.597663 df(49, 1300) Pr > F = 0.00000000

W10 = 19.183198 df(49, 1300) Pr > F = 0.00000000

The output above shows that Delaware (DE), Alabama (AL), Wyoming
(WY), Michigan (MI), and Vermont (VT) have very large standard devia-
tions, which suggests they are outliers. But more relevant to the Levene test,
the p value of W0 (which is more robust to non-normality than the other
tests) indicates the equality of variances should be is rejected. This strongly
suggests there is group-wise heteroscedasticity. To address this particular
violation of the assumption of homoscedasticity, we use the cluster option,
with state as the cluster variable in our POLS regression model.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state)

Linear regression Number of obs = 1,350


F(7, 49) = 34.92
Prob > F = 0.0000
R-squared = 0.6319
Root MSE = 1571.5

(Std. Err. adjusted for 50 clusters in state)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -1.04053 .2449388 -4.25 0.000 -1.532753 -.5483067
stapr_fte2 | .0000383 .0000122 3.13 0.003 .0000137 .0000629
pc_income | .1917324 .0167239 11.46 0.000 .1581245 .2253404
|
region_compact |
SREB | 185.804 915.4101 0.20 0.840 -1653.781 2025.39
WICHE | -957.9857 851.9214 -1.12 0.266 -2669.986 754.0144
MHEC | 99.67403 841.7089 0.12 0.906 -1591.803 1791.151
NEBHE | 1100.607 1053.032 1.05 0.301 -1015.539 3216.753
|
_cons | 2712.485 1294.855 2.09 0.041 110.3759 5314.595
------------------------------------------------------------------------------

Compared to the previous regression model with the robust option, this
model produces different results with respect to the statistical significance
of the categorical variables reflecting regional compacts. In this model, net
tuition revenue per FTE student is not related to state membership in
regional compacts. Using this example, we can see that not taking into
account the clustered nature of the residuals with respect to the states
results in making false claims about the statistical significance of certain
variables or Type I errors (rejection of true null hypotheses). Therefore,
7.4 Fixed-Effects Regression 121

when employing POLS regression models we should always test for group-wise
heteroscedasticity and if called for, allow for use the appropriate standard
errors that reflect the relaxation of intragroup independence.

7.3 Weighted Least Squares and Feasible


Generalized Least Squares Regression

When the assumption of homoscedasticity is violated and the variance of


the dependent variable is known, we can use weighted least squares (WLS).
When the form of heteroscedasticity is known, we can employ feasible
generalized least squares (FGLS). The use of WLS, however, requires a
great deal of judgment on the part of the analyst regarding the weight
that should be used. Consequently, there may be different results across
analysts estimating the same regression with the same variables model
but with different weights. According to Hoechle (2007), FGLS regression
models are inappropriate for data in which the number of time (T ) periods
(e.g., years) is less than the number of panels (m) (e.g., states). In higher
education policy research, T < m is the rule rather than the exception.
Additionally, both WLS and FGLS multivariate regression models may also
be inappropriate statistical methods when using what econometricians refer
to as microeconometric (T < m) panel data in an effort to take into account
unobserved heterogeneity.

7.4 Fixed-Effects Regression

Even with the use of cluster-robust standard errors, multivariate POLS


regression models may be limited by their inability to take into unobserved
differences or heterogeneity between units of analysis (e.g., students, institu-
tions, states). This takes us to a discussion of unobserved heterogeneity and
fixed-effects regression models.
122 7 Introduction to Intermediate Statistical Techniques

7.4.1 Unobserved Heterogeneity and Fixed-Effects


Dummy Variable (FEDV) Regression

When conducting higher education policy research, unobserved heterogeneity


may influence findings. For example, state culture with regard to higher
education (which we may not be able to observe) may also influence the extent
to which states allow public higher education institutions to be funded by
tuition revenue. With respect to a regression model, unobserved heterogeneity
is included in the equation below as error is comprised of ui (unobserved
characteristics of a group or entity such as an institution, state, or country)
and εi (real residual):

Ŷit = β0 + β1 X1it + β2 X2it + . . . βn Xnit + uit + εit (7.6)

In Eq. (7.6), we include uit , which is constant or fixed over a reasonable


amount of time or a time-invariant group effect (e.g., institutional culture,
state culture, national identity, etc.) and can be represented by “dummy”
variables in a multivariate regression model. Therefore, Eq. (7.6) can be
expanded and be rewritten as a regression model containing dummy variables,
reflecting the number (N ) of units of groups minus (N − 1).

Ŷit =β0 +β1 X1it +β2 X2it + . . . βn Xnit +α1 D2t +α2 D3t + . . . αn DN −1t +uit +εit
(7.7)

where each β is the estimated beta coefficient for each of the respective
dummy variables (D). Equation (7.7) excludes the first dummy variable (D1 ),
which is the reference group. Applying this above equation to a state-level
panel dataset, αi is a state fixed-effect as the “effect” of state i is “fixed”
across all years. In Eq. (7.7), each α represents a different state fixed-effect,
while β 1 . . . β n are the same for all states.

7.4.2 Estimating FEDV Multivariate POLS


Regression Models

Using the panel data from the example above and dummy variables, we
show how state fixed-effects can be taken into account by adding i.stateid
to the multivariate POLS regression model (without the regional compact
categorical variable) above.
. reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state)

Linear regression Number of obs = 1,350


7.4 Fixed-Effects Regression 123

F(2, 49) = .
Prob > F = .
R-squared = 0.8989
Root MSE = 837.7

(Std. Err. adjusted for 50 clusters in state)


-------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+-------------------------------------------------------------
stapr_fte | -.5639993 .1274783 -4.42 0.000 -.8201766 -.3078221
stapr_fte2 | 9.09e-06 7.15e-06 1.27 0.210 -5.29e-06 .0000235
pc_income | .2109376 .0114659 18.40 0.000 .1878961 .2339792
|
stateid |
Alaska | -864.7734 504.8033 -1.71 0.093 -1879.214 149.6669
Arizona | -2570.101 179.2046 -14.34 0.000 -2930.226 -2209.976
Arkansas | -1487.844 35.80258 -41.56 0.000 -1559.792 -1415.896
California | -5408.939 120.4583 -44.90 0.000 -5651.009 -5166.87
Colorado | -2641.914 231.0978 -11.43 0.000 -3106.323 -2177.506
Connecticut | -1804.824 263.9297 -6.84 0.000 -2335.21 -1274.437
Delaware | 2795.367 112.9649 24.75 0.000 2568.355 3022.378
Florida | -4036.336 82.02393 -49.21 0.000 -4201.169 -3871.502
Georgia | -2552.067 74.2346 -34.38 0.000 -2701.247 -2402.887
Hawaii | -1178.139 376.8727 -3.13 0.003 -1935.493 -420.7845
Idaho | -2360.993 45.92879 -51.41 0.000 -2453.29 -2268.696
Illinois | -3203.982 83.94839 -38.17 0.000 -3372.683 -3035.282
Indiana | -323.1194 58.14713 -5.56 0.000 -439.9704 -206.2683
Iowa | -661.722 48.34359 -13.69 0.000 -758.8721 -564.5719
Kansas | -2656.893 97.87067 -27.15 0.000 -2853.571 -2460.214
Kentucky | -1039.455 33.75404 -30.79 0.000 -1107.286 -971.6237
Louisiana | -2613.661 26.13406 -100.01 0.000 -2666.179 -2561.143
Maine | 45.72522 35.23359 1.30 0.200 -25.07933 116.5298
Maryland | -2589.094 157.934 -16.39 0.000 -2906.474 -2271.714
Massachusetts | -3270.657 154.2259 -21.21 0.000 -3580.585 -2960.728
Michigan | 290.352 136.2497 2.13 0.038 16.54801 564.1559
Minnesota | -2032.941 87.92352 -23.12 0.000 -2209.63 -1856.252
Mississippi | -1056.209 33.40168 -31.62 0.000 -1123.332 -989.0861
Missouri | -2468.3 120.12 -20.55 0.000 -2709.69 -2226.91
Montana | -2046.337 137.3619 -14.90 0.000 -2322.376 -1770.298
Nebraska | -2522.536 59.83506 -42.16 0.000 -2642.779 -2402.293
Nevada | -3389.331 60.91085 -55.64 0.000 -3511.736 -3266.926
New Hampshire | -1343.527 313.4941 -4.29 0.000 -1973.517 -713.5372
New Jersey | -1970.144 153.7648 -12.81 0.000 -2279.146 -1661.142
New Mexico | -2521.355 101.9705 -24.73 0.000 -2726.273 -2316.438
New York | -3830.118 132.7709 -28.85 0.000 -4096.931 -3563.305
North Carolina | -2394.794 85.28162 -28.08 0.000 -2566.174 -2223.415
North Dakota | -1529.233 55.4164 -27.60 0.000 -1640.596 -1417.87
Ohio | -1248.71 117.3901 -10.64 0.000 -1484.615 -1012.806
Oklahoma | -2539.597 20.97655 -121.07 0.000 -2581.751 -2497.443
Oregon | -2087.886 153.6461 -13.59 0.000 -2396.649 -1779.122
Pennsylvania | -235.4195 158.476 -1.49 0.144 -553.8891 83.05
Rhode Island | -794.3782 137.5683 -5.77 0.000 -1070.832 -517.9245
124 7 Introduction to Intermediate Statistical Techniques

South Carolina | -667.1569 58.70243 -11.37 0.000 -785.1238 -549.1899


South Dakota | -1501.664 108.4995 -13.84 0.000 -1719.702 -1283.626
Tennessee | -1878.185 23.56557 -79.70 0.000 -1925.541 -1830.828
Texas | -2752.335 49.84661 -55.22 0.000 -2852.506 -2652.165
Utah | -2040.838 38.52044 -52.98 0.000 -2118.248 -1963.428
Vermont | 2953.999 246.3382 11.99 0.000 2458.964 3449.034
Virginia | -2760.425 161.7475 -17.07 0.000 -3085.469 -2435.381
Washington | -4003.913 114.8606 -34.86 0.000 -4234.734 -3773.092
West Virginia | -1083.476 46.69777 -23.20 0.000 -1177.319 -989.6333
Wisconsin | -2920.358 121.1009 -24.12 0.000 -3163.719 -2676.996
Wyoming | -3131.875 236.0225 -13.27 0.000 -3606.18 -2657.57
|
_cons | 2177.63 574.5532 3.79 0.000 1023.023 3332.238
-------------------------------------------------------------------------------

Before we interpret the results of the above output, we should determine


if the state fixed-effects as a whole are statistically significant. Immediately
after we run the above regression, we do this by typing the following:
testparm i.stateid
We see an output that looks like this:
. testparm i.stateid

( 1) 2.stateid = 0
( 2) 3.stateid = 0
( 3) 4.stateid = 0
( 4) 5.stateid = 0
( 5) 6.stateid = 0
( 6) 7.stateid = 0
( 7) 8.stateid = 0
( 8) 9.stateid = 0
( 9) 10.stateid = 0
[omitted output]
(45) 46.stateid = 0
(46) 47.stateid = 0
(47) 48.stateid = 0
(48) 49.stateid = 0
(49) 50.stateid = 0

F( 3, 49) = 30.34
Prob > F = 0.0000
We reject the null that the coefficients for all 49 state dummy variables
are jointly equal to zero. Therefore, state fixed-effects can be retained in
the regression model. We see from the output that every state except the
first state, Alabama, was included in the regression results. Compared to
Alabama, net tuition revenue per FTE student is lower in every state
7.4 Fixed-Effects Regression 125

except Delaware, Maine (no statistically significant difference), Michigan, and


Vermont. Compared to the multivariate POLS regression model without state
fixed-effects, the beta coefficient for state appropriations per FTE student is
also statistically significant but substantially smaller at −0.564. (We also see
that the squared term of state appropriations per FTE student is no longer
statistically significant.)
Many analysts (particularly economists) view dummy variables as “nui-
sance” variables that are not discussed when presented in studies. Therefore,
it may not be necessary to show the estimated beta coefficients of the dummy
variables reflecting group (e.g., states) fixed-effects. In most instances, we can
simply indicate (e.g., Yes) that state or institution fixed-effects have been
included in a POLS regression model that has been fitted to panel data.
By adding the dummy variables (Ds) for each state, we are estimating the
pure effects of the independent variables (X s). Each dummy variable (D) is
absorbing the effects particular to each state. This concept is the basis for
the alternative approach to producing the same results that are shown above
by using the following Stata syntax:
areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid)
absorb(stateid)

The resulting output is:


. areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid)
absorb(stateid)

Linear regression, absorbing indicators Number of obs = 1,350


Absorbed variable: stateid No. of categories = 50
F( 3, 49) = 118.57
Prob > F = 0.0000
R-squared = 0.8989
Adj R-squared = 0.8949
Root MSE = 837.7043

(Std. Err. adjusted for 50 clusters in stateid)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stapr_fte | -.5639993 .1274783 -4.42 0.000 -.8201766 -.3078221
stapr_fte2 | 9.09e-06 7.15e-06 1.27 0.210 -5.29e-06 .0000235
pc_income | .2109376 .0114659 18.40 0.000 .1878961 .2339792
_cons | 339.0282 574.4332 0.59 0.558 -815.3384 1493.395
------------------------------------------------------------------------------

Minus the estimated beta coefficients for 49 states, the results are exactly
the same as the previous output. This option is very useful when running
a FEDV multivariate POLS regression model with many units or groups
(e.g., institutions). For example, suppose we are conducting a study of how
education and general (EG) expenditures across 220 public master’s colleges
and universities (over 10 years) are related to state appropriations (controlling
126 7 Introduction to Intermediate Statistical Techniques

for other variables) using a FEDV multivariate POLS regression model.


Clearly, it would be more efficient to use this (areg) option than including
219 dummy variables in the regression model. This is shown below:
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 7\Stata files“
use ”Example 7.1.dta“
areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new)

The output is as follows:


. areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new)
absorb(opeid5_new)

Linear regression, absorbing indicators Number of obs = 1,978


Absorbed variable: opeid5_new No. of categories = 220
F( 6, 219) = 221.65
Prob > F = 0.0000
R-squared = 0.9714
Adj R-squared = 0.9677
Root MSE = 9.127e+06

(Std. Err. adjusted for 220 clusters in opeid5_new)


------------------------------------------------------------------------------
| Robust
eg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
statea | .6341792 .0779977 8.13 0.000 .480457 .7879014
tuition | 1.193035 .0634888 18.79 0.000 1.067908 1.318162
totfteiarep | 1167.479 1007.936 1.16 0.248 -819.0184 3153.976
ftfac | 27583.39 29489.68 0.94 0.351 -30536.51 85703.28
ptfac | 7026.641 14691.85 0.48 0.633 -21928.88 35982.16
D | -5818460 2680039 -2.17 0.031 -1.11e+07 -536491.5
_cons | -1.93e+07 7709459 -2.50 0.013 -3.45e+07 -4108069
------------------------------------------------------------------------------

(Note: eg = education and general expenditures; statea = state appropria-


tions; tuition = tuition revenue; totfteiarep = total FTE students per IPEDS
report; ftfac = full-time faculty; ptfac = part-time faculty; D = whether the
institution confers doctoral degrees (0 = no/1 = yes).)

7.4.2.1 Unobserved Heterogeneity and Within-Group


Estimator Fixed-Effects Regression

While the use of the Stata command areg enables us to run a FEDV regres-
sion model that takes into account unobserved time-invariant heterogeneity,
xtreg, allows us to do the same via the within-group estimator. The within-
group estimator involves the indirect use of the between-effects model, which
regresses the group mean of the dependent variable on the group means of
the independent variables. This is reflected in Eq. (7.8). The within-group
estimator fixed-effects regression is obtained by subtracting Eq. (7.8) from
Eq. (7.6).
7.4 Fixed-Effects Regression 127

Y i = β1 X 1i + β2 X 2i + . . . βn X ni + μi + εi (7.8)

The result of this subtraction, also known as “time demeaning” the data, is
the disappearance of the ui term or time-invariant unobserved heterogeneity.
In Stata, this is equivalent to using the xtreg command with the fe option.
. xtreg eg statea tuition totfteiarep ftfac ptfac, fe cluster(opeid5_new)

Fixed-effects (within) regression Number of obs = 1,978


Group variable: opeid5_new Number of groups = 220

R-sq: Obs per group:


within = 0.7784 min = 2
between = 0.9312 avg = 9.0
overall = 0.9011 max = 10

F(5,219) = 284.17
corr(u_i, Xb) = -0.7836 Prob > F = 0.0000

(Std. Err. adjusted for 220 clusters in opeid5_new)


------------------------------------------------------------------------------
| Robust
eg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
statea | .6359503 .0721045 8.82 0.000 .4938427 .7780578
tuition | 1.20119 .0598787 20.06 0.000 1.083178 1.319202
totfteiarep | 1050.312 1031.059 1.02 0.309 -981.7563 3082.38
ftfac | 32819.51 31076.35 1.06 0.292 -28427.49 94066.5
ptfac | 7375.78 14858.29 0.50 0.620 -21907.76 36659.32
_cons | -2.51e+07 6197457 -4.05 0.000 -3.73e+07 -1.29e+07
-------------+----------------------------------------------------------------
sigma_u | 21316368
sigma_e | 9198872.8
rho | .84300893 (fraction of variance due to u_i)
------------------------------------------------------------------------------

While the output shows that the estimated beta coefficients are the same as
those produced by the FEDV POLS regression model with dummy variables
using the areg command the above output provides more information.
First, it shows the within R2 , between R2 , and the overall R2 . The within
R2 measures how much variation in the dependent variable within groups
(e.g., institutions) units is explained over time by the regression model. The
between R2 measures how much variation in the dependent variable between
groups is captured by the model. The overall R2 is a weighted average of
the within R2 and the between R2 . In some cases, the within R2 will be
higher than the between R2 and in other cases, the reverse may hold true.
Because most higher education policy research is more concerned with the
importance (i.e., the statistical significance of beta coefficients) of policy-
oriented variables, there is less focus on the R2 s.
Second, information is provided about the time-invariant group-specific
error term (μi ) and the idiosyncratic error term (εi ). The sigma_u is the
128 7 Introduction to Intermediate Statistical Techniques

standard deviation of μi and sigma_e is the standard deviation of μi .


The rho (fraction of variance due to u_i) indicates the proportion of
the unexplained variance that is due to unobserved time-invariant group
heterogeneity. In the above example, we see that 84% of the unexplained
variance is due to unobserved time-invariant group heterogeneity.
Third, the output provides information on the number of observations per
group. The example above shows that the panel data set is unbalanced with a
minimum of two observations per group and a maximum of ten observations
per group.
Fourth, corr(u_i, Xb)shows the correlation between unobserved time-
invariant group heterogeneity and the independent variables in the fixed-
effects regression model. The output above indicates there is correlation
(−0.7836) between the unobserved time-invariant group heterogeneity and
the independent variables. When using a fixed-effects model, any such
correlation is acceptable. It is not acceptable when using a random-effects
model (discussed below), which assumes no correlation between unobserved
time-invariant group heterogeneity and the independent variables.

7.4.2.2 Limitations of Fixed-Effects Regression Models

While they take into account unobserved time-invariant group heterogeneity,


fixed-effects regression models cannot include an observed time-invariant
group variable. For example, a fixed-effects regression model cannot estimate
the variable reflecting membership in a regional compact, which does not
vary over time.

7.4.3 Fixed-Effects Regression


and Difference-in-Differences

Like a pooled OLS or random-effects regression model, a fixed-effects


regression model does not infer causation. In our example above, we
cannot conclude that a change in any of the independent variables “causes”
the dependent variable (E&G expenditures) to change. In order to infer
causation, a difference-in-differences (DiD) estimator has to be included in
a fixed-effects regression model. In general, the DiD estimator takes the
difference in average outcomes for a treated group (e.g., students, institutions,
states) compared to an untreated comparison or control group before and
after the treatment.
7.4 Fixed-Effects Regression 129

7.4.3.1 The DiD Estimator

Drawing heavily from Furquim et al. (2020) and using their notation, the
DiD estimator is based on the following:

 T T
  C C

δDiD = Y 1 − Y 0 − Y 1 − Y 0 (7.9)

where Ȳ is the average outcome, T is the treated group, C is the control


group, 0 is before treatment, and 1 is after treatment. In a regression model,
this is represented as:

Yit = α + βTi + γPt + δDiD Ti × Pt + θCon + εit (7.10)

where Yit is the treatment outcome for group i in year t, Ti is a binary


variable indicating the treatment status (treated group = 1 and untreated
group = 0), P is a binary variable indicating the time periods t when the
treatment takes effect. Con is a vector of control variables. In Eq. (7.10), the
interaction of T and P takes a value of 1 for all observations of groups in
the treatment group in the treatment and post-treatment time periods. The
treatment effect is δ DiD , or the “average” treatment effect.5
To demonstrate the use of a fixed-effects regression-based DiD model, we
will refer to an actual higher education policy change that occurred in a state.
Using a fixed-effects regression-based DiD model and the relevant data, we
will show how this question can be addressed.

7.4.3.2 Fixed-Effects Regression-Based DiD: An Example

In 2004, Colorado enacted Senate Bill 189 (SB 04-189) to establish the College
Opportunity Fund (COF) program. Starting in 2005, the COF-designated,
higher education institutions no longer received state appropriations. Instead
funding was provided to resident undergraduate students in the form of a
stipend to help pay their tuition. The legislation also required that 20% of
increased resident tuition be set aside for financial aid. This suggested that
net tuition should not increase substantially. If Colorado state policymakers
ask whether COF had an effect on net tuition revenue, then a fixed-effects
regression-based DiD model is an appropriate technique that analysts can
use to address this question.
use ”Example 7.1.dta“, clear

5 For an excellent comprehensive description, discussion, and example of regression-based

DiD techniques, see Furquium et al. (2020).


130 7 Introduction to Intermediate Statistical Techniques

We create the treatment variable (T).


gen T=0
replace T=1 if state==”CO“
The post-treatment (P) is then created.
gen P=0
replace P=1 if year>=2004
Based on every state other than the treatment state (Colorado), we create
the first control group.
gen C1 = 0
replace C1=1 if state !=”CO“
Based on every state that is a member of the Western Interstate
Commission for Higher Education (WICHE) other than the treatment state
(Colorado), we create a second control group.
gen C2 = 0
replace C2=1 if state !=”CO“ & region_compact==2
In order avoid additional keystrokes, we use the Stata global command
to create temporary variables reflecting the dependent variable net tuition
revenue per FTE enrollment (y)
global y ”netuit_fte“
and the set of control variables, state appropriations to higher education
per FTE enrollment (stapr_fte) and state per capita income (pc_income).
global controls ”stapr_fte pc_income“
We run a DiD regression model that includes controls, year dummy
variables (i.year), and state dummy variables (i.stateid). The model is run
covering the year 2000 to the most recently available year and states in
the treatment group or first control group. To take into account unobserved
heterogeneity, we include the robust (rob) as an option in the syntax.
reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1),
rob

. reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1),
rob
note: 2016.year omitted because of collinearity
note: 8.fips omitted because of collinearity

Linear regression Number of obs = 850


F(68, 781) = 192.85
Prob > F = 0.0000
R-squared = 0.9393
Root MSE = 685.82
------------------------------------------------------------------------------
7.4 Fixed-Effects Regression 131

| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.T | -1111.343 365.6843 -3.04 0.002 -1829.184 -393.5028
1.P | 4634.093 427.505 10.84 0.000 3794.899 5473.288
|
T#P |
1 1 | 501.2044 202.7361 2.47 0.014 103.2322 899.1765
|
stapr_fte | -.1933747 .0320378 -6.04 0.000 -.2562652 -.1304842
pc_income | .0001359 .0198814 0.01 0.995 -.0388913 .0391632
|
year |
2001 | 219.4993 172.0082 1.28 0.202 -118.1539 557.1525

[omitted output]
|
fips |
2 | -165.3607 397.6397 -0.42 0.678 -945.9298 615.2084

[omitted output]
|
_cons | 5565.35 539.019 10.32 0.000 4507.252 6623.447
------------------------------------------------------------------------------

We see from the output above that the DiD coefficient (δ DiD ) is positive
and statistically significant (beta = 501, p < 0.05). This suggests that net
tuition revenue per FTE enrollment was, on average, higher by $501 in
Colorado after passage of SB 04-189, compared to net tuition revenue per
FTE enrollment in all other states.
The within-group fixed-effects DiD regression model (xtreg) can also be
employed.
xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob

. xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob


note: 1.T omitted because of collinearity
note: 2016.year omitted because of collinearity

Fixed-effects (within) regression Number of obs = 850


Group variable: fips Number of groups = 50

R-sq: Obs per group:


within = 0.8217 min = 17
between = 0.1378 avg = 17.0
overall = 0.3530 max = 17
F(18,49) = .
corr(u_i, Xb) = 0.0532 Prob > F = .

(Std. Err. adjusted for 50 clusters in fips)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
132 7 Introduction to Intermediate Statistical Techniques

-------------+----------------------------------------------------------------
1.T | 0 (omitted)
1.P | 4634.093 1141.528 4.06 0.000 2340.108 6928.079
|
T#P |
1 1 | 501.2044 162.1192 3.09 0.003 175.4137 826.995
|
stapr_fte | -.1933747 .0767252 -2.52 0.015 -.3475598 -.0391896
pc_income | .0001359 .0554728 0.00 0.998 -.1113407 .1116126
|
year |
2001 | 219.4993 66.723 3.29 0.002 85.41442 353.5842

[omitted output]
|
_cons | 4291.512 1517.395 2.83 0.007 1242.192 7340.831
-------------+----------------------------------------------------------------
sigma_u | 2066.6057
sigma_e | 685.81506
rho | .90079681 (fraction of variance due to u_i)
------------------------------------------------------------------------------

In the within-group fixed-effects model, the DiD coefficient (δ DiD ) is


positive and statistically significant and has the same value as the regression
model with state dummy variables.
For comparison, we run the within-group fixed-effects model with the
second control group (states in WICHE).
xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob

. xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob


note: 1.T omitted because of collinearity
note: 2016.year omitted because of collinearity

Fixed-effects (within) regression Number of obs = 221


Group variable: fips Number of groups = 13

R-sq: Obs per group:


within = 0.8405 min = 17
between = 0.1823 avg = 17.0
overall = 0.4986 max = 17

F(12,12) = .
corr(u_i, Xb) = -0.0404 Prob > F = .

(Std. Err. adjusted for 13 clusters in fips)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.T | 0 (omitted)
1.P | 3694.226 1339.566 2.76 0.017 775.5632 6612.888
|
7.4 Fixed-Effects Regression 133

T#P |
1 1 | 947.8925 215.3771 4.40 0.001 478.6261 1417.159
|
stapr_fte | -.1722081 .1092674 -1.58 0.141 -.4102812 .065865
pc_income | .0047754 .0758187 0.06 0.951 -.1604195 .1699702
|
year |
2001 | 106.4016 75.83226 1.40 0.186 -58.82273 271.6259

[omitted output]
|
_cons | 3187.94 2251.055 1.42 0.182 -1716.689 8092.568
-------------+----------------------------------------------------------------
sigma_u | 1228.8195
sigma_e | 543.94789
rho | .83615752 (fraction of variance due to u_i)
------------------------------------------------------------------------------

We see that when the second control group is used, the DiD coefficient
(δ DiD ) is also positive and statistically significant but the value is higher
($948).
The preferred regression-based DiD model is a matter of choice for
analysts. The choice depends on the selection of treatment period which the
analyst thinks the adoption of a policy began to take full effect, the control
variables, and control group.

7.4.3.3 DiD Placebo Tests

In order to determine whether there is “real” evidence of the effect of a


policy or some unknown factor, placebo tests are required. The tests involve
estimating the treatment effect after changing the treatment timing. This is
demonstrated below, where we change the timing in our example to 2000 and
simulate the treatment to occur before 2005.
gen placebo_2000 = 1 if year>=2000
recode placebo_2000 (.=0)
xtreg $y T##placebo_2000 $controls if (year>1995 | year<2005)
& (C2==1 | T==1),
fe rob

. xtreg $y T##placebo_2000 $controls if (year>1995 | year<2005)


& (C2==1 | T==1),
fe rob
note: 1.T omitted because of collinearity

Fixed-effects (within) regression Number of obs = 351


Group variable: fips Number of groups = 13

R-sq: Obs per group:


within = 0.7901 min = 27
134 7 Introduction to Intermediate Statistical Techniques

between = 0.1559 avg = 27.0


overall = 0.5263 max = 27

F(3,12) = .
corr(u_i, Xb) = -0.2875 Prob > F = .

(Std. Err. adjusted for 13 clusters in fips)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+--------------------------------------------------------------
1.T | 0 (omitted)
1.placebo_2000 | -278.5172 228.006 -1.22 0.245 -775.2996 218.2653
|
T#placebo_2000 |
1 1 | 404.7803 377.3187 1.07 0.304 -417.3266 1226.887
|
stapr_fte | -.3260768 .1256597 -2.59 0.023 -.5998658 -.0522878
pc_income | .1858099 .0281743 6.60 0.000 .1244233 .2471964
_cons | -629.3751 939.2766 -0.67 0.516 -2675.883 1417.133
---------------+--------------------------------------------------------------
sigma_u | 1113.4546
sigma_e | 687.0107
rho | .7242707 (fraction of variance due to u_i)
------------------------------------------------------------------------------

We see from the output directly above that the coefficient for the placebo
is statistically insignificant. This suggests the effect of SB 04-189 on net
tuition revenue per FTE enrollment is “real”. Policy analysts are encouraged
to conduct several placebo tests and use different control groups to validate
their findings.

7.5 Random-Effects Regression

Like fixed-effects regression, random-effects regression models (also known as


random intercept models) allow us to take into account unobserved time-
invariant variables. With random-effects models, we can use generalized
least squares (GLS) or maximum likelihood (ML) estimating techniques. ML
estimating techniques tend to have asymptotic properties (as the sample size
increases, the efficiency and consistency of the estimates are maintained).
The random-effects regression model is reflected in the following equation:

Ŷit = β0 + β1 X1it + β2 X2it + · · · βn Xnit + ΥZit + μit + εit (7.11)

where Z is an observed time-invariant categorical variable and γ is estimated


beta coefficient for Z. In Eq. (7.11), Z does not vary over time with time
(t), while unobserved time-invariant heterogeneity of the group error (μit ) is
7.5 Random-Effects Regression 135

assumed to be random and uncorrelated to the independent variables, which


allows for time-invariant variables to play a role as explanatory variables.
The random-effects estimator is the weighted average between the “within”
and “between” estimator. The weight is based on between-group variances
and derived from variances of μit and εit , which produces the random-effects
model estimates. Using the data from one of the examples above, we can run
a random-effects regression model that includes the regional compact variable
(region_compact) by entering the Stata command xtreg, with the option re.
When running random-effects models, GLS techniques tend to produce larger
standard errors than ML techniques. Because it is the default option and can
be used when estimating cluster-robust standard errors, GLS is implicitly
part of the following syntax.
xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, re
cluster(stateid)

The output is below.


.xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, re
cluster(stateid)

Random-effects GLS regression Number of obs = 1,350


Group variable: stateid Number of groups = 50

R-sq: Obs per group:


within = 0.8085 min = 27
between = 0.4145 avg = 27.0
overall = 0.6173 max = 27

Wald chi2(7) = 396.73


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. Err. adjusted for 50 clusters in stateid)


------------------------------------------------------------------------------
| Robust
netuit_fte | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+--------------------------------------------------------------
stapr_fte | -.5766974 .1224583 -4.71 0.000 -.8167112 -.3366836
stapr_fte2 | 9.83e-06 6.98e-06 1.41 0.159 -3.84e-06 .0000235
pc_income | .2106405 .0113809 18.51 0.000 .1883344 .2329467
|
region_compact |
SREB | 343.5133 950.5288 0.36 0.718 -1519.489 2206.516
WICHE | -629.2418 910.5012 -0.69 0.490 -2413.791 1155.308
MHEC | 276.6268 904.9169 0.31 0.760 -1496.978 2050.231
NEBHE | 1304.324 1162.57 1.12 0.262 -974.2709 3582.919
|
_cons | 226.986 967.0053 0.23 0.814 -1668.31 2122.282
---------------+--------------------------------------------------------------
sigma_u | 1235.0691
sigma_e | 837.70428
rho | .68491108 (fraction of variance due to u_i)
-------------------------------------------------------------------------------
136 7 Introduction to Intermediate Statistical Techniques

Notice that with the exception of corr(u_i, X) = 0 (assumed), the


format of the output is the same as what we would find when we run a fixed-
effects regression model. How do we know whether the random-effects model
is more appropriate than a POLS regression model? By conducting a test
immediately after we run the random-effects regression model, we can provide
an answer to this question. This test is the Breusch and Pagan Lagrangian
multiplier test for random effects (xttest0).6
. xttest0
Breusch and Pagan Lagrangian multiplier test for random effects

netuit_fte[stateid,t] = Xb + u[stateid] + e[stateid,t]

Estimated results:
| Var sd = sqrt(Var)
---------+-----------------------------
netuit_e | 6674588 2583.522
e | 701748.5 837.7043
u | 1525396 1235.069

Test: Var(u) = 0
chibar2(01) = 8010.80
Prob > chibar2 = 0.0000

The rejection of the null hypothesis above indicates that, compared to a


POLS regression model, the random-effects regression is more appropriate.
But how do we know whether a random-effects or a fixed-effects regression
is the most appropriate model? According to some econometricians (Judge
et al. 1988), it depends on judgment and/or statistical tests. If we are using
a sample of units (e.g., 35 out of 50 states) or we suspect our independent
variables and unobserved heterogeneity are correlated or we are including
time-invariant variables, then we may want to use a random-effects regression
model. But what if we are doing an analysis that does not include observed
time-invariant variables? This is where the use of statistical tests is required.

7.5.1 Hausman Test

The Hausman test is most commonly employed to determine whether to


use a fixed-effects or random-effects model.7 Using data from one of our
examples above and a set of Stata commands, we can easily run this test

6 For a complete discussion of this test, see Breusch and Pagan (1980).
7 For a technical discussion of the Hausman test, see Hausman (1978).
7.5 Random-Effects Regression 137

in five steps. First we quietly run (i.e., not showing the results) a within-
group fixed-effects model. Second, we store those estimated results (i.e., est
sto fixed) to memory, which is illustrated below. Third, we quietly run a
random-effects model. Fourth, we store those estimated results (i.e., est sto
random). Fifth, we run the Hausman test (i.e., hausman fixed random,
sigmamore).
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
est sto fixed
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
est sto random
hausman fixed random

Given the ordering of the stored estimated results in the last line of syntax,
a rejection of the null would indicate the fixed-effects regression is the more
appropriate model. The output is below.
. quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
.
. est sto fixed
.
. quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
.
. est sto random
.
. hausman fixed random
Note: the rank of the differenced variance matrix (3) does not equal the
number of coefficients being tested (5); be sure this is what you expect,
or there may be problems computing the test. Examine the output of
your estimators for anything unexpected and possibly consider scaling your
variables so that the coefficients are on a similar scale.
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed random Difference S.E.
-------------+----------------------------------------------------------------
statea | .6359503 .711084 -.0751337 .0078128
tuition | 1.20119 1.078007 .1231832 .0101439
totfteiarep | 1050.312 -332.6668 1382.979 296.0531
ftfac | 32819.51 10317.97 22501.54 6505.906
ptfac | 7375.78 5765.71 1610.069 2575.428
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(3) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B)
= 45.67
Prob>chi2 = 0.0000
138 7 Introduction to Intermediate Statistical Techniques

While the above results of the Hausman test suggest we should use the
fixed-effects regression model, the note in the beginning of the output states
there may be possible problems with the test and recommends rescaling the
variables. To rescale the variables, we log transform the variables and rerun
the test. This time we will show the entire output, including the results of
each regression model by excluding the quiet (qui) option.
. xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, fe

Fixed-effects (within) regression Number of obs = 1,978


Group variable: opeid5_new Number of groups = 220

R-sq: Obs per group:


within = 0.6691 min = 2
between = 0.9157 avg = 9.0
overall = 0.8825 max = 10

F(5,1753) = 708.86
corr(u_i, Xb) = -0.8252 Prob > F = 0.0000

-------------------------------------------------------------------------------
lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+---------------------------------------------------------------
lnstatea | .0128249 .0045956 2.79 0.005 .0038114 .0218384
lntuition | .5562887 .0157124 35.40 0.000 .5254718 .5871057
lntotfteiarep | .113466 .0386945 2.93 0.003 .0375738 .1893581
lnftfac | .5642428 .0458174 12.32 0.000 .4743802 .6541054
ptfac | .0003861 .0000482 8.01 0.000 .0002915 .0004806
_cons | 3.801971 .3236401 11.75 0.000 3.16721 4.436732
--------------+---------------------------------------------------------------
sigma_u | .32099191
sigma_e | .12532947
rho | .86771903 (fraction of variance due to u_i)
-------------------------------------------------------------------------------
F test that all u_i=0: F(219, 1753) = 16.90 Prob > F = 0.0000

.
. est sto fixed

.
. xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re

Random-effects GLS regression Number of obs = 1,978


Group variable: opeid5_new Number of groups = 220

R-sq: Obs per group:


within = 0.6536 min = 2
between = 0.8999 avg = 9.0
overall = 0.8697 max = 10
7.5 Random-Effects Regression 139

Wald chi2(5) = 5302.90


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

-------------------------------------------------------------------------------
lneg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+---------------------------------------------------------------
lnstatea | .0192334 .004174 4.61 0.000 .0110526 .0274143
lntuition | .5254071 .0151305 34.72 0.000 .4957518 .5550624
lntotfteiarep | .0644028 .0331053 1.95 0.052 -.0004824 .129288
lnftfac | .3408924 .0344727 9.89 0.000 .273327 .4084577
lnptfac | .0417042 .0071786 5.81 0.000 .0276343 .0557741
_cons | 5.823929 .1957728 29.75 0.000 5.440222 6.207637
--------------+---------------------------------------------------------------
sigma_u | .15783994
sigma_e | .12699568
rho | .60703286 (fraction of variance due to u_i)
-------------------------------------------------------------------------------

.
. est sto random

.
. hausman fixed random

---- Coefficients ----


| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed random Difference S.E.
-------------+----------------------------------------------------------------
lnstatea | .0128249 .0192334 -.0064085 .0019229
lntuition | .5562887 .5254071 .0308817 .0042361
lntotfteiap | .113466 .0644028 .0490632 .0200325
lnftfac | .5642428 .3408924 .2233504 .0301806
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(4) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B)
= 454.45
Prob>chi2 = 0.0000

While the results of the Hausman test using the log transformed variables
are more accurate, they are based on models that cannot allow us to take
into account heteroscedasticity via cluster-robust errors. This is a limitation
of the standard Hausman test provided by Stata. For this reason, we now
turn to a Stata user-written Hausman routine (rHausman) by Kaiser (2015)
that addresses this limitation. (We have to download this program by typing
ssc install rHausman.) Using this program, the log transformed variables,
and the models with cluster-robust errors, we rerun the Hausman test. The
options reps(1000) and cluster are included to allow for random sampling
140 7 Introduction to Intermediate Statistical Techniques

with replacement (i.e., 400 times) and take into account the cluster variable
institution and cluster variable opeid5_new, respectively.8
. quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) fe
. est sto fixed
. quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) re
. est sto random
. rhausman fixed random, reps(400) cluster
bootstrap in progress
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
(This bootstrap will approximately take another 0h. 1min. 10sec.)
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
-------------------------------------------------------------------------------
Cluster-Robust Hausman Test
(based on 400 bootstrap repetitions)

b1: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) fe
b2: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac,
cluster(opeid5_new) re

Test: Ho: difference in coefficients not systematic

chi2(5) = (b1-b2)’ * [V_bootstrapped(b1-b2)]ˆ(-1) * (b1-b2)


= 49.03
Prob>chi2 = 0.0000

Given the output directly above, now we can be confident that the results
of the Hausman test accurately indicate the fixed-effects regression model is
more appropriate. Using a fixed-effects regression model and our panel data
of public master’s universities and colleges, we can now conclude that E&G
expenditures are positively related to state appropriations (lnstatea), tuition
revenue (lntuition), total FTE students (lntotfteiarep), full-time faculty
(lnftfac), and part-time faculty (ptfac).

8 For more on bootstrapping in Stata, see Guan (2003).


7.7 Appendix 141

7.6 Summary

This chapter introduced intermediate statistical methods that are used in


higher education policy correlational studies. Starting with pooled ordinary
least squares (POLS) and continuing with fixed-effects and random-effects
regression models, this chapter demonstrated how we can use these statistical
techniques to analyze panel data with Stata syntax. We also showed how
various tests can be conducted to determine the appropriate method that
should be employed in correlational studies. The chapter also introduced how
fixed-effects regression can be modified to infer causal effects by including
difference-in-differences estimators.

7.7 Appendix
*Chapter 7 Stata syntax

*Bivariate OLS Regression

*use dataset from the previous chapter


use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files\Example 6.3.dta“, clear

*generate bivariate (one independent variable) OLS regression output


regress netuit_fte stapr_fte if year ==2016
*create a new variable reflecting, say, the squared term ///
(or quadratic) of another variable
gen stapr_fte2 = stapr_fte*stapr_fte

*include new variable in the regression model


regress netuit_fte stapr_fte stapr_fte2 pc_income if year ==2016

*Multivariate Pooled OLS Regression


reg netuit_fte stapr_fte stapr_fte2 pc_income

*include the categorical variable (region_compact) in regression model


reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact

*Multivariate Pooled OLS Regression with Interaction Terms


reg netuit_fte stapr_fte i.region_compact##i.ugradmerit, allbaselevels

*test to see if there is an interaction effects by quietly (qui) running the ///
models and storing (est sto) the model results without (model1)and with and the

*interaction terms (model2)


qui reg netuit_fte stapr_fte i.region_compact
est sto model1
qui reg netuit_fte stapr_fte i.region_compact##i.ugradmerit
est sto model2
lrtest model1 model2

*Using the testparm command, the statistical significance of the interaction ///
terms can also be checked.
testparm i.region_compact#i.ugradmerit

*if the interaction term is composed of one continuous (c) variable and one ///
categorical (i) variable
reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset
testparm c.stapr_fte#i.tuitset
142 7 Introduction to Intermediate Statistical Techniques

*change working directory


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 7\Stata files“

*use new dataset


Use ”Example 7.1.dta“, clear

*if two continuous variables are included


reg netuit_fte i.region_compact c.stapr_fte##c.state_needFTE

*using margins (with the vsquish option)


margins, dydx(stapr_fte) at(state_needFTE=(0(3000)10000)) vsquish
qui margins, at(stapr_fte=(0 10000) state_needFTE=(0(3000)10000)) vsquish

*and marginsplot with different patterns


marginsplot, noci x(stapr_fte) recast(line) xlabel(0(3000)10000) ///
plot1opts(lpattern(”...“)) plot2opts(lpattern(”-..-“) color(black)) ///
plot3opts(lpattern(”---“) color(black)) plot4opts(color(black))
*residual-versus-fitted plot
rvfplot, mcolor(black)

*comprehensive post-estimation
estat imtest

*POLS regression model using the robust option


reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, robust

*Levene test of homogeneity


quietly: reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact
predict double eps, residual
robvar eps, by(state)

*To address this particular violation of the assumption of homoscedasticity, we ///


use the cluster option, with state as the cluster variable, in our POLS

*regression model.
reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state)

*Fixed-Effects Regression
* Unobserved Heterogeneity and Fixed-Effects Dummy Variable (FEDV) ///
Regression Estimating FEDV Multivariate POLS Regression Models
reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state)

*we determine if the state fixed-effects as a whole are statistically ///


significant, immediately after we run the above regression
testparm i.stateid

* alternative approach to producing the same results


areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid) absorb(stateid)

*open another dataset


use ”Example 7.1.dta“
areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new)

*Unobserved Heterogeneity and Within-Group Estimator Fixed-Effects Regression ///


using the xtreg command with the fe option
xtreg eg statea tuition totfteiarep ftfac ptfac, fe cluster(opeid5_new)

*Fixed-effects regression and difference-in-differences (DiD)


*The DiD Estimator
*Fixed-effects Regression-based DiD: An Example
use ”Example 7.1.dta“, clear

*We create the treatment variable (T).


gen T=0
replace T=1 if state==”CO“
7.7 Appendix 143

*The post-treatment (P) is then created.


gen P=0
replace P=1 if year>=2004

*Based on every state other than the treatment state (Colorado), we create the ///
first control group.
gen C1 = 0
replace C1=1 if state !=”CO“

*we create a second control group.


gen C2 = 0
replace C2=1 if state !=”CO“ & region_compact==2

*we use the global command to create temporary variables reflecting the ///
dependent variable net tuition revenue per FTE enrollment (y)
global y ”netuit_fte“

*and the set of control variables state appropriations to higher education per ///
FTE enrollment (stapr_fte) and state per capita income (pc_income).
global controls ”stapr_fte pc_income“

*To take into account unobserved heterogeneity, we include the robust (rob) as ///
an option in the syntax.
reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1), rob

*The within-group fixed-effects DiD regression model can also be employed.


xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob

*For comparison, we run the within-group fixed-effects model with the second ///
control group (states in WICHE).
xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob

*DiD Placebo Tests


gen placebo_2000 = 1 if year>=2000
recode placebo_2000 (.=0)
xtreg $y T##placebo_2000 $controls if (year>1995 | year<2005) & (C2==1 | ///
T==1), fe rob

*Random-Effects Regression
xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, ///
re cluster(stateid)

*Breusch and Pagan Lagrangian multiplier test for random effects


xttest0

* Hausman test
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe
est sto fixed
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re
est sto random
hausman fixed random

*log transform the variables and rerun the test


xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, fe
est sto fixed
xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re
est sto random
hausman fixed random

*install a Stata user-written Hausman routine by Kaiser (2015)


ssc install rHausman

*run rHausman
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) fe
est sto fixed
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) re
144 7 Introduction to Intermediate Statistical Techniques

est sto random


rhausman fixed random, reps(400) cluster

*end

References

Breusch, T. S., & Pagan, A. R. (1980). The Lagrange Multiplier Test and its Applications
to Model Specification in Econometrics. The Review of Economic Studies, 47 (1), 239–
253. JSTOR. https://doi.org/10.2307/2297111
Furquim, F., Corral, D., & Hillman, N. (2020). A Primer for Interpreting and Designing
Difference-in-Differences Studies in Higher Education Research. In L. W. Perna (Ed.),
Higher Education: Handbook of Theory and Research: Volume 35 (pp. 667–723).
Springer International Publishing. https://doi.org/10.1007/978-3-030-31365-4_5
Guan, W. (2003). From the help desk: Bootstrapped standard errors. The Stata Journal,
3 (1), 71–80.
Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46 (6), 1251–
1271. JSTOR. https://doi.org/10.2307/1913827
Hoechle, D. (2007). Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence. The Stata Journal: Promoting Communications on Statistics and Stata,
7 (3), 281–312. https://doi.org/10.1177/1536867X0700700301
Hutchinson, S. R., & Lovell, C. D. (2004). A review of methodological characteristics
of research published in key journals in higher education: Implications for graduate
research training. Research in Higher Education, 45 (4), 383–403.
Judge, G. G., Hill, R. C., Griffiths, W. E., Lutkepohl, H., & Lee, T.-C. (1988). Introduction
to the Theory and Practice of Econometrics, 2nd Edition (2 edition). Wiley.
Kaiser, B. (2015). RHAUSMAN: Stata module to perform Robust Hausman Specification
Test. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s457909.html
Wells, R. S., Kolek, E. A., Williams, E. A., & Saunders, D. B. (2015). “How We Know
What We Know”: A Systematic Comparison of Research Methods Employed in Higher
Education Journals, 1996—2000 v. 2006—2010. The Journal of Higher Education,
86 (2), 171–198.
Chapter 8
Advanced Statistical Techniques: I

Abstract This chapter introduces advanced correlational statistical tech-


niques that are employed when there are violations of OLS regression
assumptions. This chapter discusses and demonstrates how graphs and formal
tests can be utilized to detect the violations of OLS assumptions when using
time series and cross-sectional data. The introduction of advanced statistical
techniques to address the violation of assumptions is also presented.

Keywords Autocorrelation · Autoregressive (AR1) · Time series


regression · Cross-sectional dependence

8.1 Introduction

This chapter introduces and demonstrates the use of advanced correlational


statistical techniques. We use these techniques when some of the assumptions
of OLS regression are violated. In this chapter, we discuss and show how
to create graphs and conduct tests to detect when those assumptions are
violated. We also introduce and demonstrate the use of advanced statistical
techniques that address the violation of OLS assumptions. These statistical
techniques include regression models that are used with time series, cross-
sectional, and panel data. The Stata commands and syntax that are used
to demonstrate the use of these graphs, tests, and statistical techniques are
included in an appendix at the end of the chapter.

© Springer Nature Switzerland AG 2021 145


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_8
146 8 Advanced Statistical Techniques: I

8.2 Time Series Data and Autocorrelation

When conducting policy analysis using time series data and regression
techniques, it is very likely that we will encounter violations of many OLS
assumptions. OLS assumes that errors terms have constant variance. In
other words, no autocorrelation is present. Autocorrelation occurs when the
residual or idiosyncratic error (ε) for one time period (εt + 1) is correlated
with the error for the subsequent time period (εt + 2). In a regression
framework, the first-order autoregressive disturbance term (ρ) is reflected
in the following equations:

Yt = βXt + ut , (8.1)

where

ut = ρut−1 + t

In regression models that do not address autocorrelated errors, the


estimators are biased and may include biased estimates of beta coefficients.
If we are planning to use a regression model with time series or panel data,
then we should try to investigate whether autocorrelation is present. There
are several ways to detect autocorrelation, including a visual inspection of
the residuals from a preliminary OLS regression model of time series data,
and formal statistical tests for use with time series or panel data.
When using regression models with time series data, we have to carefully
examine residuals for autocorrelation. Additionally, we have to take care that
when we initially model our data, we do not do so without taking into account
the possibility that the values of our variables may be increasing with time,
otherwise known as nonstationary data. Therefore, this discussion of time
series and autocorrelation also briefly introduces the concepts of stationary
data and differencing. The related concept of cointegration is discussed later
in the next chapter.
We will use annual data compiled from the U.S. Department of Education
(DOE), which includes total enrollment in and tuition (adjusted for inflation)
at public 2-year higher education institutions from 1970 to 2016. This dataset
is also supplemented with annual unemployment rates, over the same time
period, from the U.S. Bureau of Labor Statistics (BLS). Using these data and
multivariate time series analysis, it will be demonstrated how we can visually
inspect residuals from an initial OLS model. First, we change our working
directory and open the file: Time series—Enrollment & Tuition & fees at 2
year public HEIs.dta.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 8\Stata files“
use ”Time series - Enrollment & Tuition & fees at 2 yr public
HEIs.dta“
8.2 Time Series Data and Autocorrelation 147

Second, we set the data to a time series by typing, tsset year. Third, after
observing that the data are skewed, we log transform and visually inspect
the data for each of the three variables over time. We use the following Stata
syntax (which must be entered on one line) to create a line graph of the log
of enrollment, tuition, and unemployment. (Take note of the options.)
twoway (line lnenpub2yr year, lcolor(black) lpattern(solid))
(line lntupub2yr year, lcolor(black) lpattern(dash))
(line lnunemprate year, lcolor(black) lpattern(dot)),
xlabel(1970 (6) 2017, labsize(small)) ytitle(Logs)
title(”Trends in Enrollment in 2 YR, Tuition at 2 YR, and
Unemployment Rates“ ”1970 to 2017“, size(medium))
Figure 8.1 shows that log transformed enrollment and tuition changes with
time (years). This observation suggests the time series data are nonstationary,
which would produce a spurious relationship between the variables and
unreliable beta coefficients when using a regression model. But more evidence
is needed to arrive at a conclusion regarding the existence of nonstationary
data or otherwise known as unit root. To uncover unit root, statistical tests
are conducted. The most well-known unit root test is the augmented Dickey–
Fuller test (ADF) unit root test (Dickey and Fuller 1979). However, the
modified Dickey–Fuller test (known as the DF-GLS test), is more powerful
than the ADF (Elliott et al. 1996).1 Therefore, we will use the DF-GLS unit
root test.
The null hypothesis of the DF-GLS test is the time series of the variable
is unit root, while the alternative is (1) stationary about a linear time trend
or (2) stationary with a possibly nonzero mean but with no linear time
trend. When we conduct the DF-GLS test via Stata (dfgls), we use the
first alternative.
. dfgls lnenpub2yr
DF-GLS for lnenpub2yr Number of obs = 38

Maxlag = 9 chosen by Schwert criterion


DF-GLS tau 1% Critical 5% Critical 10% Critical

[lags] Test Statistic Value Value Value


------------------------------------------------------------------------------
9 -0.898 -3.770 -2.723 -2.425
8 -0.882 -3.770 -2.783 -2.490
7 -0.647 -3.770 -2.850 -2.559
6 -0.867 -3.770 -2.921 -2.630
5 -1.014 -3.770 -2.994 -2.701
4 -0.798 -3.770 -3.066 -2.769
3 -0.831 -3.770 -3.133 -2.833

1 The DF-GLS unit root test uses generalized least squares (GLS) regression to de-trend

the data.
148 8 Advanced Statistical Techniques: I

Fig. 8.1 Enrollment, tuition, and unemployment, changes over time (1970–2017)

2 -0.965 -3.770 -3.195 -2.889


1 -1.094 -3.770 -3.247 -2.937
Opt Lag (Ng-Perron seq t) = 1 with RMSE .0272961

Min SIC = -7.01057 at lag 1 with RMSE .0272961

Min MAIC = -7.075481 at lag 1 with RMSE .0272961

From the test results for the log of enrollment, we can see that the null
hypothesis of a unit root is not rejected for any of the lags.
. dfgls lntupub2yr

DF-GLS for lntupub2yr Number of obs = 38

Maxlag = 9 chosen by Schwert criterion


DF-GLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value
------------------------------------------------------------------------------
9 -1.708 -3.770 -2.723 -2.425
8 -1.757 -3.770 -2.783 -2.490
7 -1.902 -3.770 -2.850 -2.559
6 -1.947 -3.770 -2.921 -2.630
5 -1.909 -3.770 -2.994 -2.701
4 -1.963 -3.770 -3.066 -2.769
3 -2.054 -3.770 -3.133 -2.833
2 -2.780 -3.770 -3.195 -2.889
1 -3.454 -3.770 -3.247 -2.937
Opt Lag (Ng-Perron seq t) = 3 with RMSE .0224833
8.2 Time Series Data and Autocorrelation 149

Min SIC = -7.263363 at lag 1 with RMSE .0240551


Min MAIC = -7.063126 at lag 3 with RMSE .0224833

From the test results for the log of tuition, we can see that the null
hypothesis of a unit root is rejected for only lag 1, but not at the 1% level.
. dfgls lnunemprate

. dfgls lnunemprate
DF-GLS for lnunemprate Number of obs = 38
Maxlag = 9 chosen by Schwert criterion
DF-GLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value
------------------------------------------------------------------------------
9 -1.674 -3.770 -2.723 -2.425
8 -1.603 -3.770 -2.783 -2.490
7 -2.413 -3.770 -2.850 -2.559
6 -2.039 -3.770 -2.921 -2.630
5 -2.230 -3.770 -2.994 -2.701
4 -1.908 -3.770 -3.066 -2.769
3 -2.330 -3.770 -3.133 -2.833
2 -2.294 -3.770 -3.195 -2.889
1 -3.013 -3.770 -3.247 -2.937
Opt Lag (Ng-Perron seq t) = 1 with RMSE .1173187
Min SIC = -4.09427 at lag 1 with RMSE .1173187
Min MAIC = -3.763267 at lag 2 with RMSE .1165867

The test results indicate there is unit root in the unemployment rate data.
So, the DF-GLS unit root test results above confirm that all three variables
are nonstationary.
Because the data are nonstationarity, we have to transform the time series
data by taking their first differences and then run the regression model
on first-differenced data. Differencing is simply computing the differences
between consecutive observations, or in other words, subtracting the previous
value from the current value. While the logarithmic transformation of the
data may stabilize the variance, differencing may result in a constant mean
of a time series. In Stata, we can automatically difference data inserting “D1.”
in syntax as shown below. We first create a graph of the first-differenced time
series. We enter the following syntax (all on one line).
twoway (line D1.lnenpub2yr year, lcolor(black) lpattern(solid))
(line D1.lntupub2yr year, lcolor(black) lpattern(dash))
(line D1.lnunemprate year, lcolor(black) lpattern(dot)),
xlabel(1971 (5) 2017, labsize(small)) ytitle(Change in Logs)
title(”First-Differenced Enrollment in 2 YR, Tuition at
2 YR, and Unemployment Rates“ ”1971 to 2017“, size(small))
Figure 8.2 shows that the first-differenced logged data (except enrollment)
do not change with time and are, for the most part, stationary. Now we can
regress the first-differenced log of enrollment on the first-differenced log of
tuition and unemployment.
150 8 Advanced Statistical Techniques: I

Fig. 8.2 Enrollment, tuition, and unemployment, first-differenced (1971–2017)

. reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate


Source | SS df MS Number of obs = 46
-------------+---------------------------------- F(2, 43) = 20.76
Model | .046892295 2 .023446148 Prob > F = 0.0000
Residual | .048564458 43 .001129406 R-squared = 0.4912
-------------+---------------------------------- Adj R-squared = 0.4676
Total | .095456753 45 .002121261 Root MSE = .03361>
------------------------------------------------------------------------------
D.lnenpub2yr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
lntupub2yr |
D1. | -.2340422 .0944225 -2.48 0.017 -.4244633 -.043621
|
lnunemprate |
D1. | .1763452 .0336472 5.24 0.000 .1084892 .2442012
|
_cons | .0265284 .0053751 4.94 0.000 .0156885 .0373684
------------------------------------------------------------------------------

Next, the Stata command racplot is used to create an autocorrelation


function or correlogram of the residuals from the regression model above.
The correlogram of the residuals is shown in Fig. 8.3.
We can see that after the first lag and second lag, the values of the
autocorrelation of the residuals decline substantially with the number of lags
and lies inside the 95% confidence interval (the shaded area). The partial
autocorrelations of the residuals in Fig. 8.3 provide additional visual evidence
of first-order autoregressive (AR1) disturbance. This figure is created by
8.3 Testing for Autocorrelations 151

Fig. 8.3 Autocorrelation (correlogram) of the residuals from the regression model

generating the residuals from the model (predict residuals, resid) and
creating a graph of partial autocorrelations (pac residuals, yw). We see
that after the first lag, the partial autocorrelations of the residuals dissipate
at the higher lags and are well within the 95% confidence interval (Fig. 8.4).
Combined, these visuals suggest evidence of first-order autocorrelation
(AR1) that should be addressed before using a final regression model.
However, for more definitive evidence, we should conduct statistical tests.

8.3 Testing for Autocorrelations

When using time series data, the most common tests for autocorrelation are
the Durbin–Watson (D-W) test (Durbin and Watson 1950) and Breusch–
Godfrey (B-W) test (Breusch 1978; Godfrey 1978). The D-W test is based
on a measure of autocorrelation in the residuals from a regression model.
That measure or D-W (d ) statistic always has a value between 0 and 4. A
d statistic with a value of 2.0 indicates there is no autocorrelation, while a
value from 0 to less than 2 indicates a positive autocorrelation. A d statistic
with a value from 2 to 4 indicates negative autocorrelation. Different versions
of the D-W test are based on different assumptions regarding the exogeneity
of the independent variables. The results of the D-W test that are based on
the work of Durbin and Watson (1950) assume the independent variables
152 8 Advanced Statistical Techniques: I

Fig. 8.4 Partial autocorrelations of residuals

are exogenous and the residuals are normally distributed. An alternative


version of the D-W test relaxes that assumption of exogenous independent
variables, normally distributed residuals, homoscedasticity (Davidson and
MacKinnon 1993). The B-G test is limited in that while it relaxes the
assumption of strictly exogenous regressors, it does not take into account
violations of the assumption of homoscedasticity. The Woodbridge (2002)
test or Arellano-Bond (A-B) test (Arellano and Bond 1991) is conducted
when using panel data. The less known Cumby-Huizinga (C-H) general test
(Cumby and Huizinga 1992) can be used with both time series and panel
data.

8.3.1 Examples of Autocorrelation Tests—Time


Series Data

To demonstrate how to conduct a D-W test when using time series data,
we use the same data from above, assume the independent variables are
exogenous, and use the Stata post-estimation time series command estat
dwatson.
. estat dwatson
Durbin-Watson d-statistic( 3, 47) = .8127196
8.4 Time Series Regression Models with AR terms 153

While the value of the D-W d statistic shown above is 0.813, the D-
W test does not tell us whether or not the value is statistically different
from 2. In addition, the results are based on the assumptions of exogenous
independent variables, a normal distribution of the residuals or errors (ε),
and homoscedastic errors. (The OLS regression model that we ran did not
take into account possible heteroscedasticity.) So, we “quietly” (i.e., do not
show the output) rerun the regression model with the robust (rob) option
and use the alternative D-W test, the post-estimation time series command
estat durbinalt, with the force option.
. quietly: reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
.

. estat durbinalt, force


Durbin’s alternative test for autocorrelation

---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 24.039 1 0.0000
---------------------------------------------------------------------------
H0: no serial correlation

We can see from the above output of the D-W alternative test that the
null hypothesis (Ho ) of no serial correlation (no autocorrelation) is rejected
(p < 0.001).

8.4 Time Series Regression Models with AR terms

When we find a violation of the assumption of no autocorrelation, we have to


use a regression model of time series that include first-order serially correlated
residuals or an autoregressive (AR1) disturbance term. This regression model
can be estimated via several estimating techniques. (See Davidson and
MacKinnon (1993) for a complete discussion of these estimating techniques.)
The time series regression model with an AR term can be calibrated via the
Prais–Winsten (P-W) estimator. In Stata, this is accomplished by using the
prais command in place of the regress command. The prais command is
used (with the default rhotype(regress)—base rho (ρ) on single-lag OLS of
residuals) along with the same dependent and independent variables and the
rob option as in the regression model above.
. prais D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
Iteration 0: rho = 0.0000
Iteration 1: rho = 0.5658
Iteration 2: rho = 0.6117
Iteration 3: rho = 0.6147
Iteration 4: rho = 0.6149
154 8 Advanced Statistical Techniques: I

Iteration 5: rho = 0.6149


Iteration 6: rho = 0.6149
Prais-Winsten AR(1) regression -- iterated estimates
Linear regression Number of obs = 47
F(2, 44) = 25.33
Prob > F = 0.0000
R-squared = 0.5486
Root MSE = .02624
------------------------------------------------------------------------------
| Semirobust
D.lnenpub2yr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lntupub2yr |
D1. | -.240876 .0914333 -2.63 0.012 -.4251476 -.0566044
|
lnunemprate |
D1. | .1335829 .0257963 5.18 0.000 .0815939 .1855718
|
_cons | .0279557 .0100677 2.78 0.008 .0076656 .0482458
-------------+----------------------------------------------------------------
rho | .6149498
------------------------------------------------------------------------------
Durbin-Watson statistic (original) 0.812720
Durbin-Watson statistic (transformed) 2.145852

The rho (ρ) in the output above shows there is positive autocorrelation.
The regression model with an AR1 term shows that the first-differenced log
transformed tuition and unemployment variables are statistically significant.
We also see that the value of the transformed D-W d statistic is 2, suggesting
no autocorrelation. However, we should examine the autocorrelation and
partial autocorrelation functions of the residuals from the P-W regression.
We do so by first generating residuals from the P-W regression (predict
residuals_PW, resid) and creating graphs of autocorrelations and partial
autocorrelations of those residuals.

8.4.1 Autocorrelation of the Residuals from the P-W


Regression

Given Fig. 8.5, it appears as if there is still autocorrelation even after we ran
the P-W regression. The partial autocorrelation function in Fig. 8.6 further
provides evidence of first-order autocorrelation (AR1).
Because the alternative test D-W test for autocorrelation does not work
after running a P-W regression in Stata, we use the Cumby-Huizinga (C-H)
general test of the residuals. However, the Stata user-written program for
the C-H test has to be downloaded (ssc install actest). We will check for
autocorrelation of the residuals from the P-W regression (residuals_PW) for
up to four lags (lag (4)), specify the null hypothesis of no autocorrelation at
8.4 Time Series Regression Models with AR terms 155

Fig. 8.5 Autocorrelation of the residuals from the P-W regression

Fig. 8.6 Partial autocorrelations of the residuals from P-W regression

any lag order (q=0), and take into account possible heteroscedasticity (rob).
So our Stata syntax for this test is: actest residuals_PW, lag(4) q0 rob).
The output is as follows.
156 8 Advanced Statistical Techniques: I

. actest residuals_PW, lag(4) q0 robCumby-Huizinga test for autocorrelation


H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 5.369 1 0.0205 | 1 | 5.369 1 0.0205
1 - 2 | 6.147 2 0.0463 | 2 | 5.812 1 0.0159
1 - 3 | 6.255 3 0.0998 | 3 | 3.141 1 0.0763
1 - 4 | 6.344 4 0.1749 | 4 | 1.201 1 0.2731
-----------------------------------------------------------------------------
Test robust to heteroskedasticity

Looking at the panel on the right in output from the C-H test, we can
see that the null hypothesis of no autocorrelation is rejected at both the first
(chi2 5.369 = p < 0.5) and second (chi2 = 5.812, p < 0.5) lags, indicating
first-order (AR1) and second-order (AR2) autocorrelation.2 Unfortunately,
Stata’s Prais–Winsten (prais) regression allows for including only AR1.
Therefore, we have to use an autoregressive–moving-average (ARMA) model
with only autoregressive terms and exogenous independent variables, or
commonly known as an ARMAX model.3 ARMA models can accommodate
autoregressive disturbance terms with more than one lag and are reflected in
an expansion of Eq. (7.1) in the previous chapter to include the following:

Yt = βXi + μt
p 
q
(8.2)
μt = ρi μt−1 + θj εt−j + εt
i=1 j−1

where p is the number of autoregressive terms, q is the number of moving-


average terms, ρ is the autoregressive parameter, Θ is the first-order moving-
average parameter, j is the lag, t is time, and εt is the error term.
Using Stata command arima, we then move to estimate an ARMAX
model with first-order (AR1) and second-order (AR2) autoregressive
terms with the dependent variable first-differenced log of enrollment
(D1.lnenpub2yr), and the exogenous independent variables first-differenced
log of tuition (D1.lntupub2yr) and first-differenced log of unemployment

2 According to Hoechle (2007), an AR process can be approximated by a MA process.


3 The ARMAX model is an extension of the Box–Jenkins autoregressive moving average
(ARIMA) model with exogenous variables. This particular example does not include the
moving average (MA) as an independent variable. The decision to include MA term is
determined by observing the autocorrelation function of the dependent variable. Although
not shown, including the MA term in this example, does not change the results. For more
information on the ARIMA model, see Box and Jenkins (1970).
8.4 Time Series Regression Models with AR terms 157

rate (D1.lnunemprate). The vce(robust) option is included, which produces


semi-robust standard errors.
. arima D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, ar(1 2 ) vce(robust)
(setting optimization to BHHH)
. arima D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, ar(1 2 ) vce(robust)
(setting optimization to BHHH)
Iteration 0: log pseudolikelihood = 104.65434
Iteration 1: log pseudolikelihood = 106.70775
Iteration 2: log pseudolikelihood = 107.3222
Iteration 3: log pseudolikelihood = 107.57829
Iteration 4: log pseudolikelihood = 107.68275
(switching optimization to BFGS)
Iteration 5: log pseudolikelihood = 107.7467
Iteration 6: log pseudolikelihood = 107.80569
Iteration 7: log pseudolikelihood = 107.81934
Iteration 8: log pseudolikelihood = 107.82036
Iteration 9: log pseudolikelihood = 107.82039
Iteration 10: log pseudolikelihood = 107.82039
ARIMA regression
Sample: 1971 - 2017 Number of obs = 47
Wald chi2(4) = 111.29
Log pseudolikelihood = 107.8204 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Semirobust
D.lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnenpub2yr |
lntupub2yr |
D1. | -.2147784 .085312 -2.52 0.012 -.3819868 -.04757
|
lnunemprate |
D1. | .1324922 .0213737 6.20 0.000 .0906004 .1743839
|
_cons | .0296133 .0197496 1.50 0.134 -.0090953 .0683219
-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | .4546644 .1601833 2.84 0.005 .1407109 .7686179
L2. | .3434536 .2337644 1.47 0.142 -.1147162 .8016235
-------------+----------------------------------------------------------------
/sigma | .0241711 .0022081 10.95 0.000 .0198433 .0284988
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.

From the results of the ARMAX model shown above, we see that the
AR1 disturbance term is statistically significant (beta = 0.455, p < 0.01) but
not the AR2 disturbance term. However, we examine the residuals from the
ARMAX model to see if there is any autocorrelation and conduct a final test.
. predict residuals_ARMX12, resid
(1 missing value generated)
158 8 Advanced Statistical Techniques: I

Fig. 8.7 Autocorrelations of residuals from ARMX12

Both Figs. 8.7 and 8.8 suggest there is no autocorrelation of the residuals
from the ARMAX model with AR1 and AR2 disturbance terms. Using the
C-H general test, we conduct a final test to detect autocorrelation.
. actest residuals_ARMX12 , lag(4) q0 rob
Cumby-Huizinga test for autocorrelation
H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 0.120 1 0.7288 | 1 | 0.120 1 0.7288
1 - 2 | 0.208 2 0.9014 | 2 | 0.089 1 0.7648
1 - 3 | 2.721 3 0.4367 | 3 | 2.567 1 0.1091
1 - 4 | 2.756 4 0.5995 | 4 | 0.746 1 0.3877
-----------------------------------------------------------------------------
Test robust to heteroskedasticity

We see from the C-H general test results above, the null hypothesis
of no autocorrelation cannot be rejected. So, the results of the C-H test
combined with the Figs. 8.7 and 8.8 allow us to definitively conclude there
is no autocorrelation when using our time series data and the ARMAX
model above. Given the ARMAX model results, it can now be stated with
confidence that enrollment in public 2-year colleges is negatively related to
8.4 Time Series Regression Models with AR terms 159

Fig. 8.8 Partial autocorrelations of residuals from ARMX12

published tuition and fees at public 2-year colleges and positively related to
unemployment rates.
Because the ARMAX model was used with first-differenced variables, the
interpretation of the results is based on an average short-term (1 year) rather
than an average over the long-term (e.g., 47 years) relationship. If we wanted
to make a statement based on the latter, we would have to fit an ARMAX
model to the data levels rather than their first differences. Fortunately, we
can do this by using the diffuse option in Stata.4 (The nolog option is also
included to not show the iteration log.) We show the output below.
. arima lnenpub2yr lntupub2yr lnunemprate, ar(1 2 ) rob diffuse nolog
ARIMA regression
Sample: 1970 - 2016 Number of obs = 47
Wald chi2(4) = 30163.62
Log pseudolikelihood = 86.21852 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Semirobust
lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnenpub2yr |
lntupub2yr | -.1989154 .0891667 -2.23 0.026 -.3736788 -.024152
lnunemprate | .1478185 .029172 5.07 0.000 .0906425 .2049945
_cons | 16.97047 .7746909 21.91 0.000 15.45211 18.48884

4 Formore information on the diffuse option, see the Stata Reference Time-Series Manual,
Release 16 and Ansley and Kohn (1985) and Harvey (1989).
160 8 Advanced Statistical Techniques: I

-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | 1.305528 .0153133 85.25 0.000 1.275514 1.335542
L2. | -.3506986 .0030851 -113.67 0.000 -.3567453 -.3446518
-------------+----------------------------------------------------------------
/sigma | .022082 .0020042 11.02 0.000 .0181539 .0260101
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.

From the results above, we can see that all the independent variables are
statistically significant. We also see that both autocorrelation disturbance
terms (AR1 and AR2) are statistically significant. Like with the ARMAX
model using first-differences, the C-H test is used to detect any remaining
autocorrelation.
. predict residuals_nsARMA12dn, resid
. actest residuals_nsARMA12dn, q0 rob lag(4)
Cumby-Huizinga test for autocorrelation
H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 0.892 1 0.3450 | 1 | 0.892 1 0.3450
1 - 2 | 0.977 2 0.6136 | 2 | 0.067 1 0.7958
1 - 3 | 1.797 3 0.6156 | 3 | 0.503 1 0.4782
1 - 4 | 2.237 4 0.6923 | 4 | 0.256 1 0.6127
-----------------------------------------------------------------------------
Test robust to heteroskedasticity

We see the results of the test indicate there is no remaining autocorrelation


up through lag 4. (There is no reason to think there is autocorrelation for any
lags beyond lag 4.) While these results may be good news from a statistical
perspective, they may not be helpful to policymakers who are interested in
how changes in tuition and fees at public community colleges may influence
enrollment at those institutions. It is quite possible that a shift in the demand
for higher education at public 2-year higher institutions may influence a
change in tuition and fees at those colleges (Toutkoushian and Paulsen 2016).
So in order to avoid “reverse causality”, we have to regress enrollment on at
least a 1 year lag of tuition. In Stata, we do this by including the lag operator
(L1) in a re-calibrated ARMAX model. To not lose an additional observation,
we use data through 2017.
. arima lnenpub2yr L1.lntupub2yr lnunemprate, ar(1 2 ) rob diff nolog
numerical derivatives are approximate
flat or discontinuous region encountered
numerical derivatives are approximate
8.4 Time Series Regression Models with AR terms 161

flat or discontinuous region encountered


ARIMA regression
Sample: 1971 - 2017 Number of obs = 47
Wald chi2(4) = 4111.19
Log pseudolikelihood = 83.0424 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Semirobust
lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnenpub2yr |
lntupub2yr |
L1. | .1337166 .082935 1.61 0.107 -.028833 .2962662
|
lnunemprate | .171104 .0317356 5.39 0.000 .1089033 .2333046
_cons | 14.27244 .6893422 20.70 0.000 12.92135 15.62353
-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | 1.225775 .190853 6.42 0.000 .8517105 1.59984
L2. | -.2997325 .1754243 -1.71 0.088 -.6435578 .0440928
-------------+----------------------------------------------------------------
/sigma | .02378 .0025161 9.45 0.000 .0188485 .0287114
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.

With respect to the influence of tuition on enrollment, this model shows


results that differ substantially from the previous model. In this model,
tuition lagged 1 year (L1.lntupub2yr) is statistically insignificant. Finally,
we can fit an ARIMA model to the same data, using slightly different Stata
syntax where the arima (2 0 0) indicates the model should include a first-
order (AR1) and second-order (AR2) autoregressive term, no (0) differencing
and no (0) moving-average (MA) term.
. arima lnenpub2yr L1.lntupub2yr lnunemprate, arima(2 0 0) rob nolog diffuse
numerical derivatives are approximate
flat or discontinuous region encountered
numerical derivatives are approximate
flat or discontinuous region encountered
ARIMA regression
Sample: 1971 - 2017 Number of obs = 47
Wald chi2(4) = 4111.19
Log pseudolikelihood = 83.0424 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Semirobust
lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnenpub2yr |
lntupub2yr |
L1. | .1337166 .082935 1.61 0.107 -.028833 .2962662
|
lnunemprate | .171104 .0317356 5.39 0.000 .1089033 .2333046
_cons | 14.27244 .6893422 20.70 0.000 12.92135 15.62353
162 8 Advanced Statistical Techniques: I

-------------+----------------------------------------------------------------
ARMA |
ar |
L1. | 1.225775 .190853 6.42 0.000 .8517105 1.59984
L2. | -.2997325 .1754243 -1.71 0.088 -.6435578 .0440928
-------------+----------------------------------------------------------------
/sigma | .02378 .0025161 9.45 0.000 .0188485 .0287114
------------------------------------------------------------------------------
Note: The test of the variance against zero is one sided, and the two-sided
confidence interval is truncated at zero.

We see that the results are the same as in the previous output. One
final test after a final ARMAX model is to check the model’s stability.
More specifically, the estimated dependent variable should not increase with
time and its variance should be independent of time. More specifically, the
estimated parameters (ρ) in our second-order AR (AR2) model must meet
the following conditions:

ρ2 + ρ1 < 1

ρ2 − ρ1 < 1

−1 < ρ2 < 1

Or in other words, inverse roots (eigenvalues) of the AR polynomial must


all lie inside the unit circle.5 To check the stability of the ARMAX model
the post-estimation test estat aroots is conducted. (We include the option
dlabel to show each eigenvalue along with its distance from the unit circle.)
. estat aroots, dlabel Eigenvalue stability condition
+----------------------------------------+
| Eigenvalue | Modulus |
|--------------------------+-------------|
| .8883852 | .888385 |
| .3373902 | .33739 |
+----------------------------------------+
All the eigenvalues lie inside the unit circle.
AR parameters satisfy stability condition.
As indicated in the test results and shown in Fig. 8.9, the ARMAX model
is stable.

5 For more information on inverse roots, see the Stata Reference Time-Series Manual,

Release 16 and Hamilton (1994).


8.6 Examples of Autocorrelation Tests—Panel Data 163

Fig. 8.9 Inverse roots of


the AR polynomial

8.5 Summary of Time Series Data, Autocorrelation,


and Regression

As longer time series data become available to higher education analysts,


greater care must be taken to use available visual and statistical tools prior
to providing regression-based information to policymakers. The visual tools
include simple line graphs as well as correlograms, partial autocorrelations
functions, and unit circles. The statistical tools include tests to detect serial
correlation, nonstationary data processes, and unstable time series regression
models.

8.6 Examples of Autocorrelation Tests—Panel Data

Autocorrelation may also be present in the errors of fixed-effects and random-


effects regression models. Consequently, we should conduct autocorrelation
tests when we are using those models. With respect to panel-data models,
the most commonly used autocorrelation test is the Woolbridge (2002)
test. Under the null of no first-order autocorrelation (AR1), the errors
from a regression of the first-differenced variables should have an AR1 of
-.5. In Stata, this test is conducted via the command xtserial, which is
demonstrated below.
. use ”Balanced panel data - state.dta“, clear
164 8 Advanced Statistical Techniques: I

We include the option output to show the results of the regression of the
first-differenced variables.
. xtserial lnnetuit lnstapr lnfte lnpc_income, output
Linear regression Number of obs = 1,300
F(3, 49) = 266.10
Prob > F = 0.0000
R-squared = 0.3332
Root MSE = .09355
(Std. Err. adjusted for 50 clusters in stateid)
------------------------------------------------------------------------------
| Robust
D.lnnetuit | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnstapr |
D1. | -.2611843 .0510568 -5.12 0.000 -.3637868 -.1585818
|
lnfte |
D1. | .6437485 .1010357 6.37 0.000 .4407098 .8467873
|
lnpc_income |
D1. | 1.377408 .060067 22.93 0.000 1.256699 1.498117

------------------------------------------------------------------------------
Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
F( 1, 49) = 83.583
Prob > F = 0.0000

We see from the results of the test there is first-order autocorrelation in


the errors of our model. Assuming all of the other assumptions of OLS hold,
would use a fixed- or random-effects model with an AR(1) disturbance term
to generate our regression results.

8.7 Panel-Data Regression Models with AR Terms

After we discover there is first-order autocorrelation in the errors in our


regression model, we will need to include an AR disturbance term. In Stata,
we do this by using the command xtregar for either a fixed-effects model
(xtregar with the option fe) or random-effects model (xtregar with the
option re).
. xtregar lnnetuit lnstapr lnfte lnpc_income, fe
FE (within) regression with AR(1) disturbances Number of obs = 1,300
Group variable: stateid Number of groups = 50
R-sq: Obs per group:
within = 0.3472 min = 26
between = 0.8435 avg = 26.0
overall = 0.8380 max = 26
F(3,1247) = 221.11
8.7 Panel-Data Regression Models with AR Terms 165

corr(u_i, Xb) = 0.5483 Prob > F = 0.0000


------------------------------------------------------------------------------
lnnetuit | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnstapr | -.2734652 .038685 -7.07 0.000 -.34936 -.1975704
lnfte | .7158662 .0699253 10.24 0.000 .578682 .8530503
lnpc_income | 1.390187 .0669224 20.77 0.000 1.258894 1.52148
_cons | 2.735897 .1438201 19.02 0.000 2.453741 3.018053
-------------+----------------------------------------------------------------
rho_ar | .85955152
sigma_u | .55845958
sigma_e | .09146923
rho_fov | .97387421 (fraction of variance because of u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(49,1247) = 8.76 Prob > F = 0.0000

We see from the output that the AR1 (rho_ar) is 0.86. It should be noted
that in several ways, the xtregar command is rather limited. First, it does
not allow for the use of higher-order autoregressive (AR) disturbance terms.
Second, there is no option to estimate robust standard errors. Third, it cannot
take into account possible cross-sectional dependence in the data, which we
will discuss later in the chapter.
The use of xtregar, as shown above, is appropriate if we have stationary
time series data in our panel. However, if we are uncertain our data are
stationary, we should conduct a series of tests prior to using xtregar. Fortu-
nately, there are several first-generation panel unit root tests (PURTs) we can
choose from in Stata.6 However, we will use the Stata user-written routine
xtpurt, the most recently developed and available second-generation PURTs
that take into account autocorrelation (Herwartz et al. 2018). Herwartz and
Siedenburg (2008) contend that second-generation PURTs allow for cross-
sectional error correlation. (The xtpurt routine, however, requires a balanced
panel dataset.) We include the default option hs, reflecting the Herwartz and
Siedenburg test.7
. xtpurt lnnetuit
Herwartz and Siedenburg (2008) unit-root test for lnnetuit
-----------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 22
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 2.8239 0.9976

6 For more information on unit root tests for panel data, see Stata Longitudinal Data/Panel

Data Reference Manual Release 16.


7 For information on the other tests, please see Herwartz et al. (2018).
166 8 Advanced Statistical Techniques: I

------------------------------------------------------------------------------
. xtpurt lnstapr
Herwartz and Siedenburg (2008) unit-root test for lnstapr
----------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 23
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 1.3166 0.9060
------------------------------------------------------------------------------
. xtpurt lnfte
Herwartz and Siedenburg (2008) unit-root test for lnfte
--------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 23
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.6345 0.7371
------------------------------------------------------------------------------

. xtpurt lnpc_income
Herwartz and Siedenburg (2008) unit-root test for lnpc_income
--------------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 27
After rebalancing = 22
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=1 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.2176 0.5861
------------------------------------------------------------------------------

The results of the unit root tests show the null hypothesis of the panels
containing unit roots was not rejected, indicating nonstationary time series
data are in the panel. This suggests that we have to include first-differenced
variables in our final regression fixed- or random-effects model with an AR1
disturbance term. We run this model, using quiet (or qui for short), to omit
the output of the regression results.
. qui xtregar D1.lnnetuit D1.lnstapr D1.lnfte D1.lnpc_income, re
8.8 Cross-Sectional Dependence 167

However, we conduct a test to see if there is any remaining autocorrelation


in the residuals. We do this by using the Cumby-Huizinga (C-H) general test
for autocorrelation, which we discussed earlier in the chapter.
First, we generate residuals from the model.
. predict ar_residuals_re, ue
(50 missing values generated)
Then we conduct the C-H autocorrelation general test of the residuals.
. actest ar_residuals_re, lags(10) q0 robust
Cumby-Huizinga test for autocorrelation
H0: disturbance is MA process up to order q
HA: serial correlation present at specified lags >q
-----------------------------------------------------------------------------
H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated)
HA: s.c. present at range specified | HA: s.c. present at lag specified
-----------------------------------------+-----------------------------------
lags | chi2 df p-val | lag | chi2 df p-val
-----------+-----------------------------+-----+-----------------------------
1 - 1 | 2.768 1 0.0962 | 1 | 2.768 1 0.0962
1 - 2 | 2.901 2 0.2345 | 2 | 0.962 1 0.3268
1 - 3 | 3.089 3 0.3781 | 3 | 0.064 1 0.7998
1 - 4 | 5.877 4 0.2086 | 4 | 3.207 1 0.0733
1 - 5 | 6.552 5 0.2562 | 5 | 0.000 1 0.9990
1 - 6 | 7.300 6 0.2940 | 6 | 1.302 1 0.2539
1 - 7 | 9.940 7 0.1920 | 7 | 2.166 1 0.1411
1 - 8 | 11.225 8 0.1892 | 8 | 0.615 1 0.4331
1 - 9 | 13.556 9 0.1390 | 9 | 0.827 1 0.3632
1 - 10 | 13.583 10 0.1929 | 10 | 0.431 1 0.5115
-----------------------------------------------------------------------------

Test robust to heteroskedasticity

We see from the results of the test that autocorrelation of the model’s
residuals is not present. Unfortunately, xtregar is limited in that its
estimated standard errors are not robust to heteroscedasticity and cross-
sectional dependence, which we will discuss in the next section.

8.8 Cross-Sectional Dependence

As discussed in Chap. 4, higher education policy analysis and evaluation


also involve the use of cross-sectional data and time series/cross-sectional
or panel data. When using those data, we may encounter situations where
there is correlation between cases or units (e.g. institutions, states, nations)
or cross-sectional dependence. This would be a violation of one of the implicit
assumptions of OLS regression. The implicit assumption of OLS regression
is that the data are based on randomly and independently drawn samples
from the population. Cross-sectional dependence may arise if observations
168 8 Advanced Statistical Techniques: I

are not independently drawn, leading to those observations having an effect


on each other’s outcomes. Common unobserved shocks due to state higher
education state policies may result in cross-sectional dependence among
institutions. Some units of analysis, such as institutions or states, may be
highly interconnected across space, leading to cross-sectional dependence.
The latter type of interconnectedness is based on spatial dependence or
spatial autocorrelation.

8.8.1 Cross-Sectional Dependence—Unobserved


Common Factors

When using regression models with cross-sectional dependence in our data,


the effect of unobserved common factors may be transmitted through the
residual or error. This may result in biased estimated standard errors, leading
to biased estimated beta coefficients. Therefore, when using cross-sectional
or panel datasets consisting of units of analysis such as institutions or states,
we should test and may have to correct for possible nonspatial cross-sectional
dependence or spatial autocorrelation before providing regression-based
information to policymakers.

8.8.2 Tests to Detect Cross-Sectional


Dependence—Unobserved Common Factors

There are several tests that use the uncommon factor approach to detect
cross-sectional dependence in panel data. These tests include the Pesaran
(2004), Friedman (1937), and Frees (1995) tests and made available in Stata
by De Hoyos and Sarafidis (2006). Each of these tests are based on the
correlation coefficients of residuals from OLS regression models of time series
data within each individual unit (e.g., institution, state, etc.) in a panel. After
running a fixed-effects (xtreg, fe) or random-effects (xtreg, re) regression
model in Stata, we can conduct the Pesaran, Friedman, and Frees tests by
using the post-estimation commands: xtcsd, pesaran; xtcsd, friedman;
and xtcsd, frees, respectively. Consequently, we demonstrate the use of all
three tests below.
First, we have to install the Stata user-written routine, xtcsd (De Hoyos
and Sarafidis 2006).
. ssc install xtcsd
checking xtcsd consistency and verifying not already installed
all files already exist and are up to date.
We change our working directory and open our dataset.
8.8 Cross-Sectional Dependence 169

. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 8\Stata
files“
. use ”Unbalanced panel data - institutional.dta“
We use the xtdescribe (or the shortened version, xtdes) command to get
a sense of the distribution of observations per unit (i.e., institution) in the
panel dataset.
. xtdes
opeid5_new: 1004, 1005, ..., 31703 n = 203
endyear: 2004, 2005, ..., 2013 T = 10
Delta(endyear) = 1 year
Span(endyear) = 10 periods
(opeid5_new*endyear uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
8 8 9 9 10 10 10
Freq. Percent Cum. | Pattern
---------------------------+------------
95 46.80 46.80 | 1111111111
43 21.18 67.98 | 1.11111111
33 16.26 84.24 | 1.1.111111
7 3.45 87.68 | 111.111111
7 3.45 91.13 | 1111111.11
4 1.97 93.10 | 1.111.1111
4 1.97 95.07 | 111.1.1111
3 1.48 96.55 | 11111.1111
2 0.99 97.54 | 1.11111.11
5 2.46 100.00 | (other patterns)
---------------------------+------------
203 100.00 | XXXXXXXXXX

From the output above, we see clearly that this is a slightly unbalanced
panel dataset with observations per institution ranging from eight to 10 years.
Next, we “quietly” run our fixed-effects regression model using the within
regression estimator (xtreg, with the fe option).
. qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac
lnptfac , fe
We run the Pesaran test and the Friedman test.
. xtcsd, pesaran
Pesaran’s test of cross sectional independence = 82.069, Pr = 0.0000
. xtcsd, friedman
Friedman’s test of cross sectional independence = 293.510, Pr = 0.0000

The results of both tests show that the null of cross-sectional independence
is rejected (p < 0.001), which indicates cross-sectional dependence.
The Frees test is also conducted.
. xtcsd, frees Frees’ test of cross sectional independence = 44.948
|--------------------------------------------------------|
Critical values from Frees’ Q distribution
alpha = 0.10 : 0.4892
170 8 Advanced Statistical Techniques: I

alpha = 0.05 : 0.6860


alpha = 0.01 : 1.1046

The Frees test statistic of 44.948 is above the critical values of α at different
levels, indicating a rejection of the null of cross-sectional independence. So,
based on all three tests, we can say with some degree of certainty there is
cross-sectional dependence.8
Using the common factor approach, Eberhardt (2011) extended the Stata
routine by De Hoyos and Sarafidis and developed a cross-sectional dependence
test (xtcd) that can be applied to variables in the pre-estimation rather than
the post-estimation stage. Below, we show how this test can be conducted
using a few variables from the same panel dataset. First, we download the
most recent version of xtcd (Eberhardt 2011).
. ssc install xtcd, replace

checking xtcd consistency and verifying not already installed...

all files already exist and are up to date.


Then we run the test on variables of interest from the same panel dataset.
. xtcd lneg lntuition lnftfac lnptfac
Average correlation coefficients & Pesaran (2004) CD test
Variables series tested: lneg lntuition lnftfac lnptfac

Group variable: opeid5_new


Number of groups: 203
Average # of observations: 8.74
Panel is: unbalanced
---------------------------------------------------------
Variable | CD-test p-value corr abs(corr)
-------------+-------------------------------------------
lneg | 362.27 0.000 0.859 0.871
-------------+-------------------------------------------
lntuition | 351.21 0.000 0.833 0.866
-------------+-------------------------------------------
lnftfac | 90.10 0.000 0.212 0.531
-------------+-------------------------------------------
lnptfac | 83.18 0.000 0.194 0.453
---------------------------------------------------------
Notes: Under the null hypothesis of cross-section
independence CD  N(0,1)

As we can see from the results above, the null hypotheses of cross-sectional
independence are rejected for all the variables. If we cannot or choose not
to include all of the variables at one time, we can test the residuals from
a regression model. Using the variables that we included in a fixed-effects

8 For more information on the use of these tests, see De Hoyos and Sarafidis (2006).
8.8 Cross-Sectional Dependence 171

model above, we employ a random-effects regression model and apply the


test to the residuals.
. qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac
lnptfac, re
. predict ue_residuals_re, ue
. xtcd ue_residuals_re
Average correlation coefficients & Pesaran (2004) CD test
Variables series tested: ue_residuals_re

Group variable: opeid5_new

Number of groups: 203

Average # of observations: 8.74

Panel is: unbalanced


---------------------------------------------------------
Variable | CD-test p-value corr abs(corr)
-------------+-------------------------------------------
ue_residuae | 144.14 0.000 0.338 0.544
---------------------------------------------------------
Notes: Under the null hypothesis of cross-section
independence CD  N(0,1)
We can see from the above results of the test, there is cross-sectional
dependence.
Thus far, the tests we have discussed are based on “strong” correlation
of the residuals between units in a panel. In other words, the correlation
converges to a constant as the number of units approaches infinity (Pesaran
2004). If the correlation approaches zero as the units approach infinity, then
we have what is called a “weak” correlation (Pesaran 2015). Written by
Pesaran, the Stata routine xtcd2 allows us to test for weak cross-sectional
dependence. After installing the xtcd2 routine, this is shown below.
. ssc install xtcd2, replace

checking xtcd2 consistency and verifying not already installed...


the following files will be replaced:
c:\ado\plus\x\xtcd2.ado
c:\ado\plus\x\xtcd2.sthlpinstalling into c:\ado\plus\...
installation complete.
. quietly: xtreg lneg lnstatea lntuition
lntotfteiarep lnftfac lnptfac, fe
. xtcd2
Pesaran (2015) test for weak cross-sectional dependence.

Residuals calculated using predict, e from xtreg.


172 8 Advanced Statistical Techniques: I

Unbalanced panel detected, test adjusted.


H0: errors are weakly cross-sectional dependent.

CD = 77.124

p-value = 0.000

The test results indicate there is at least weak cross-sectional dependence.


Finally, another Stata user-written program xtcdf (Wursten 2017) allows
for a much faster estimation of the Pesaran cross-sectional dependence test
and provides additional statistics. The xtcdf routine also enables us to
conduct a test on several variables as well as the residuals from a regression
model. As customary, we first install the most recent version of Wursten-
written Stata routine.
. ssc install xtcdf, replace
checking xtcdf consistency and verifying not already installed...
all files already exist and are up to date.
Then we “quietly” run our fixed-effect regression model and generate the
residuals.
. qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac
lnptfac, fe. predict ue_residuals_fe, ue
We conduct the test, which includes the variables as well as the residuals
from the fixed-effects regression.
. xtcdf lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac
ue_residuals_fe
The output from the test is shown below.
. xtcd test on variables lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac
ue_residuals_fe
Panelvar: opeid5_new
Timevar: endyear
------------------------------------------------------------------------------+
Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) |
----------------+--------------------------------------+----------------------|
lneg + 362.268 0.000 8.69 + 0.86 0.87 |
lnstatea + 147.015 0.000 8.69 + 0.35 0.49 |
lntuition + 351.212 0.000 8.69 + 0.83 0.87 |
lntotfteiarep + 154.81 0.000 8.69 + 0.36 0.56 |
lnftfac + 90.103 0.000 8.69 + 0.21 0.53 |
lnptfac + 83.181 0.000 8.69 + 0.19 0.45 |
ue_residualfe + 82.069 0.000 8.69 + 0.19 0.48 |
------------------------------------------------------------------------------+
Notes: Under the null hypothesis of cross-section independence, CD  N(0,1)
P-values close to zero indicate data are correlated across panel groups.

From the results of the test we can see there is at least weak cross-sectional
dependence across all the variables and residuals from the fixed-effects
regression model. The output also shows the mean and mean absolute
correlation (ρ) between institutions.
8.9 Panel Regression Models That Take Cross-Sectional Dependency. . . 173

8.9 Panel Regression Models That Take


Cross-Sectional Dependency into Account

In the previous section, we discussed and demonstrated how to detect


one type (unobserved common factors) of cross-sectional dependence. After
discovering this type of cross-sectional dependence, what is the most appro-
priate regression model to use? It depends. If the number of periods (T ) is
greater than or equal to the number of panels (m), then a regression model
that is estimated via feasible generalized least squares (FGLS) is the most
appropriate. In higher education policy research, however, T ≥ m is rarely
the case. Additionally, Stata’s FGLS regression command xtgls can only
be used with balanced panel datasets when taking into account correlated
panels or cross-sectional dependence. Consequently, regression models with
Driscoll and Kraay (1998) standard errors are the most appropriate. The
most recent routine for estimating regression models with Driscoll and Kraay
(D-K) standard errors was made available for use in Stata by Hoechle (2018)
and can be downloaded by typing ssc install xtscc, replace.
A variety of regression models with D-K standard errors can be estimated
with pooled OLS, weighted least squares (WLS), fixed-effects (within),
or generalized least squares (GLS) random-effects. In addition to being
robust to cross-sectional dependence, D-K standard errors are also robust
to heteroscedasticity, and autocorrelation with higher-order lags. Using our
institution-level panel data, we demonstrate their use below with a fixed-
effects regression model.
. use ”Unbalanced panel data - institutional.dta“, clear
. xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe lag(2)

Regression with Driscoll-Kraay standard errors Number of obs = 1875

Method: Fixed-effects regression Number of groups = 203


Group variable (i): opeid5_new F( 5, 9) = 624.23
maximum lag: 2 Prob > F = 0.0000
within R-squared = 0.6572
-------------------------------------------------------------------------------
| Drisc/Kraay
lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
lnstatea | .0142957 .0101544 1.41 0.193 -.0086751 .0372664
lntuition | .570156 .026773 21.30 0.000 .5095913 .6307208
lntotfteiarep | .1831277 .0617837 2.96 0.016 .0433633 .3228921
lnftfac | .6344542 .1151675 5.51 0.000 .3739271 .8949813
lnptfac | .0306096 .0038811 7.89 0.000 .0218299 .0393893
_cons | 2.454215 .4902524 5.01 0.001 1.345188 3.563243
-------------------------------------------------------------------------------

Next, we demonstrate the use of a random-effects model with D-K


standard errors.
174 8 Advanced Statistical Techniques: I

. xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac,


re lag(2)Regression with Driscoll-Kraay standard errors
Number of obs = 1875

Method: Random-effects GLS regression Number of groups = 203


Group variable (i): opeid5_new Wald chi2(5) = 33301.28
maximum lag: 2 Prob > chi2 = 0.0000
corr(u_i, Xb) = 0 (assumed) overall R-squared = 0.8692
-------------------------------------------------------------------------------
| Drisc/Kraay
lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
lnstatea | .0186992 .008593 2.18 0.058 -.0007394 .0381378
lntuition | .5332011 .0532264 10.02 0.000 .4127946 .6536076
lntotfteiarep | .056763 .0708909 0.80 0.444 -.1036034 .2171294
lnftfac | .3563393 .0977645 3.64 0.005 .1351806 .577498
lnptfac | .0448512 .0114099 3.93 0.003 .0190403 .0706622
_cons | 5.659183 .5206644 10.87 0.000 4.481358 6.837008
--------------+----------------------------------------------------------------
sigma_u | .1600267
sigma_e | .12709341
rho | .61321262 (fraction of variance due to u_i)
-------------------------------------------------------------------------------

When we include year-fixed effects in a fixed-effects or random-effects


regression model with D-K standard errors, there appears to be no cross-
sectional dependence in the residuals.
. qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac
lnptfac i.endyear, fe lag(2)
. predict xtscc_residuals_fe2y, resid. xtcdf xtscc_residuals_fe2y
xtcd test on variables xtscc_residuals_fe2y
Panelvar: opeid5_new
Timevar: endyear
------------------------------------------------------------------------------+
Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) |
----------------+--------------------------------------+----------------------|
xtscc_resid2y + .266 0.790 8.69 + 0.00 0.41 |
------------------------------------------------------------------------------+
Notes: Under the null hypothesis of cross-section independence, CD  N(0,1)
P-values close to zero indicate data are correlated across panel groups.
. qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear,
re lag(2). predict xtscc_residuals_re2y, resid. xtcdf xtscc_residuals_re2y
xtcd test on variables xtscc_residuals_re2y
Panelvar: opeid5_new
Timevar: endyear
------------------------------------------------------------------------------+
Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) |
----------------+--------------------------------------+----------------------|
xtscc_resre2y + .417 0.677 8.69 + 0.00 0.41 |
------------------------------------------------------------------------------+
Notes: Under the null hypothesis of cross-section independence, CD  N(0,1)
P-values close to zero indicate data are correlated across panel groups.

As a final comparison of the estimated coefficients of interest to policy


analysts or researchers, we run and store the results of three regression
8.9 Panel Regression Models That Take Cross-Sectional Dependency. . . 175

models: (1) a fixed-effects model without year fixed-effects; (2) a fixed-effects


model with year fixed-effects; and (3) a fixed-effects model with year fixed-
effects and D-K standard errors.
. eststo: qui xtreg lneg lnstatea lntuition lntotfteiarep
lnftfac lnptfac, fe
(est1 stored)
. eststo: qui xtreg lneg lnstatea lntuition
lntotfteiarep lnftfac lnptfac i.endyear, fe
(est2 stored)
. eststo: qui xtscc lneg lnstatea lntuition
lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2)
(est3 stored)
Then we use Stata’s esttab command (with the label, p[(fmt)], and
keep options as well as the estout varwidth option) to create a table of
the stored regression results to compare the estimated beta coefficients of
variables of interest across the three models.
. esttab, label keep(lnstatea lntuition lntotfteiarep lnftfac lnptfac)
varwidth(30) beta(%8.3f)
(tabulating estimates stored by eststo; specify ”.“ to tabulate the active results)
------------------------------------------------------------------------------
(1) (2) (3)

log(eg) log(eg) log(eg)

------------------------------------------------------------------------------
Log of state appropriations 0.040** 0.019* 0.019

(3.07) (2.05) (0.77)


Log of tuition revenue 0.714*** 0.070** 0.070*

(34.83) (3.04) (2.91)


Log of FTE students 0.178*** 0.073** 0.073

(4.67) (2.60) (1.96)


Log of full-time faculty 0.554*** 0.421*** 0.421***

(13.19) (13.54) (7.48)


Log of part-time faculty 0.051*** -0.010 -0.010

(3.76) (-1.00) (-1.91)


------------------------------------------------------------------------------
Observations 1875 1875 1875
------------------------------------------------------------------------------
Standardized beta coefficients; t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

We see that, compared to models 1 and 2, model 3 does not produce statis-
tically significant beta coefficient estimates of the log of state appropriations
and log of FTE students. This suggests that when we don’t take into account
cross-sectional dependence, our regression models that we use to fit panel
data may produce biased estimates of some beta coefficients.
176 8 Advanced Statistical Techniques: I

8.10 Summary

This chapter discussed how, when using regression techniques, we should


check the data we use for higher education policy analysis for possible
violations of statistical assumptions. We showed how to use graphs and
formal statistical tests to detect some of those possible violations. We
also introduced advanced statistical techniques to address those violations.
These advanced statistical techniques include regression with autoregressive
disturbance terms and AMAX regression models that are used to fit time
series data. This chapter also discussed the violation of statistical assumptions
and tests to detect these violations within a panel data framework. The
chapter ended with a discussion and demonstration of a regression technique
that is robust even to several of the most common violations of statistical
assumptions.

8.11 Appendix

*Chapter 8 Stata syntax

*Change working directory and open dataset


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 8\Stata files“
use ”Time series - Enrollment & Tuition & fees at 2 yr public HEIs.dta“, clear

*set the data to a time series


tsset year

*create Fig. 8.2.1.1. Fig. 8.2.1.1. Enrollment, Tuition, and Unemployment, Changes Over ///
Time (1970 to 2017)
twoway (line lnenpub2yr year, lcolor(black) lpattern(solid)) (line lntupub2yr year, ///
lcolor(black) lpattern(dash)) (line lnunemprate year, lcolor(black) lpattern(dot)), ///
xlabel(1970 (6) 2017, labsize(small)) ytitle(Logs) title(”Trends in Enrollment in ///
2 YR, Tuition at 2 YR, and Unemployment Rates“ ”1970 to 2017“, size(medium))

*conduct the DF-GLS tests


dfgls lnenpub2yr
dfgls lntupub2yr
dfgls lnunemprate
*create Fig. 8.2.1.2. Enrollment, Tuition, and Unemployment, ///
First-Differenced (1971 to 2017), take note of options
twoway (line D1.lnenpub2yr year, lcolor(black) lpattern(solid)) (line D1.lntupub2yr year, ///
lcolor(black) lpattern(dash)) (line D1.lnunemprate year, lcolor(black) lpattern(dot)),///
xlabel(1971 (5) 2017, labsize(small)) ytitle(Change in Logs) ///
title(”First-Differenced ///Enrollment in 2 YR, Tuition at 2 ///
YR, and Unemployment“ ”1971 to 2017“, size(small))

*regress the first-differenced log of enrollment on the first-differenced log of tuition ///
and unemployment
reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate

*create an autocorrelation function or correlogram of the residuals from the regression model ///
racplot
*generate the residuals from the model
predict residuals, resid
8.11 Appendix 177

*create a graph partial autocorrelations


pac residuals, yw

*DW test
estat dwatson

*alternative DW test
quietly: reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob
estat durbinalt, force

*time series regression model with an AR term calibrated via the Prais-Winsten (P-W) ///
estimator prais D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob

*generate residuals from the P-W regression


predict residuals_PW, resid

*use the Cumby-Huizinga (C-H) general test of the residuals


ssc install actest
actest residuals_PW, lag(4) q0 rob

*estimate an ARMAX model with first-order (AR1) and second-order (AR2) ///
autoregressive terms
arima D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, ar(1 2 ) vce(robust)

*examine the residuals from the ARMAX model to see if there is any autocorrelation and ///
conduct a final test
predict residuals_ARMX12, resid
actest residuals_ARMX12 , lag(4) q0 rob

*fit an ARMAX model to the data levels rather than their first-differences using the ///
diffuse option, showing with no interations (nolog)
arima lnenpub2yr lntupub2yr lnunemprate, ar(1 2 ) rob diffuse nolog

*C-H test is used to detect any remaining autocorrelation


predict residuals_nsARMA12dn, resid
actest residuals_nsARMA12dn, q0 rob lag(4)

*To avoid “reverse causality”, regress enrollment on at least a one year lag of tuition. ///
Include the lag operator (L1) in a re-calibrated ARMAX model and use data through
2017.
arima lnenpub2yr L1.lntupub2yr lnunemprate, ar(1 2 ) rob diff nolog
*Fit an ARIMA model to the same data, using slightly different Stata syntax where ///
the arima (2 0 0) indicates the model should include a first-order (AR1) and ///
second-order (AR2) autoregressive term,

*no (0) differencing and no (0) moving average (MA) term.


arima lnenpub2yr L1.lntupub2yr lnunemprate, arima(2 0 0) rob nolog diffuse

*check the stability of the ARMAX model


estat aroots, dlabel

*Examples of Autocorrelation Tests - Panel Data


*open a panel dataset
use ”Balanced panel data - state.dta“, clear

*test for autocorrelation in the panel data


xtserial lnnetuit lnstapr lnfte lnpc_income, output

*Panel-Data Regression Models with AR terms


*fixed effects model with AR term
xtregar lnnetuit lnstapr lnfte lnpc_income, fe

*panel unit root tests (PURTs); install xtpurt (to install in Stata, ///
type ”search xtpurt, all“, click on ”st0519“ and install) or type:
net install st0519, replace
xtpurt lnnetuit
xtpurt lnstapr
178 8 Advanced Statistical Techniques: I

xtpurt lnfte
xtpurt lnpc_income

*first-differenced variables in our final regression fixed- or random-effects ///


model with an AR1 *disturbance term
qui xtregar D1.lnnetuit D1.lnstapr D1.lnfte D1.lnpc_income, re

*generate residuals from the model


predict ar_residuals_re, ue

*conduct the C-H autocorrelation general test of the residuals


actest ar_residuals_re, lags(10) q0 robust

*Tests to Detect Cross-Sectional Dependence - Unobserved Common Factors

*install the Stata user-written routine, xtcsd


ssc install xtcsd

*use unbalanced panel dataset


use ”Unbalanced panel data - institutional.dta“, clear

*get a sense of the distribution of observations per unit (i.e., institution) in the ////
panel dataset
xtdes

*run our fixed-effects regression model using the within regression ///
estimator (xtreg, with the fe option)
qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac , fe

*run the Pesaran test and the Friedman test.


xtcsd, pesaran
xtcsd, friedman

*The Frees test is also conducted.


xtcsd, frees

*We download the most recent version of xtcd (Eberhardt 2011).


ssc install xtcd, replace

*Then we run the test on variables of interest from the same panel dataset.
xtcd lneg lntuition lnftfac lnptfac

*Using the variables that we included in a fixed-effects model above, we ///


employ a random-effects

*regression model and apply the test to the residuals.


qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re
predict ue_residuals_re, ue

*install the xtcd2 routine


ssc install xtcd2, replace

*check for weak cross-sectional dependence


qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe
xtcd2

*use xtcdf (Wursten 2017) to allow for a much faster estimation of the ///
Pesaran cross-sectional

*dependence test and provide additional statistics


ssc install xtcdf, replace
qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe
predict ue_residuals_fe, ue
xtcdf lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac ue_residuals_fe

* Panel Regression Models That Take Cross-Sectional Dependency Account ///


install routine by Hoechle (2018) of regression model with ///
References 179

Driscoll and Kraay (D-K) standard errors for use in Stata


ssc install xtscc, replace.

*run fixed-effects regression model with D-K standard errors and 2 lags of the AR term
xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe lag(2)

*check for cross-sectional dependence in the residuals of the regression, including ///
year fixed-effects
qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2)
predict xtscc_residuals_fe2y, resid
xtcdf xtscc_residuals_fe2y

*Compare the estimated coefficients of interest to policy analysts of ///


researchers, by running and storing the results of three ///
regression models: (1) a fixed-effects model without

*year fixed-effects; (2) a fixed-effects model with year fixed-effects ///


and (3) a fixed-effects

*model with year fixed-effects and D-K standard errors.


eststo: qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe
eststo: qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe
eststo: qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2)

*Use esttab command (with the label, p[(fmt)], and keep options as well as the ///
Estout varwidth option) to create a table of the stored regression results to ///
compare the estimated beta coefficients of variables of interest ///
across the three models
esttab, label keep(lnstatea lntuition lntotfteiarep lnftfac lnptfac) varwidth(30) beta(%8.3f)

*end

References

Ansley, C. F., & Kohn, R. (1985). Estimation, Filtering, and Smoothing in State Space
Models with Incompletely Specified Initial Conditions. Ann. Statist, 13 (4), 1286–1316.
https://doi.org/10.1214/aos/1176349739
Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo
Evidence and an Application to Employment Equations. The Review of Economic
Studies, 58 (2), 277–297. https://doi.org/10.2307/2297968
Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis; forecasting and control.
Holden-Day. http://www.gbv.de/dms/hbz/toc/ht000495926.pdf
Breusch, T. S. (1978). Testing for autocorrelation in dynamic linear models. Australian
Economic Papers, 17 (31), 334–355.
Cumby, R. E., & Huizinga, J. (1992). Testing the Autocorrelation Structure of Disturbances
in Ordinary Least Squares and Instrumental Variables Regressions. Econometrica,
60 (1), 185–195.
Davidson, R., & MacKinnon, J. G. (1993). Estimation and Inference in Econometrics (1
edition). Oxford University Press.
De Hoyos, R. E., & Sarafidis, V. (2006). Testing for cross-sectional dependence in panel-
data models. The Stata Journal, 6 (4), 482–496.
Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive
time series with a unit root. Journal of the American Statistical Association, 74 (366a),
427–431.
Driscoll, J. C., & Kraay, A. C. (1998). Consistent covariance matrix estimation with
spatially dependent panel data. Review of Economics and Statistics, 80 (4), 549–560.
180 8 Advanced Statistical Techniques: I

Durbin, J., & Watson, G. S. (1950). Testing for serial correlation in least squares regression:
I. Biometrika, 37 (3/4), 409–428.
Eberhardt, M. (2011). XTCD: Stata module to investigate Variable/Residual Cross-Section
Dependence. https://econpapers.repec.org/software/bocbocode/s457237.htm
Elliott, G., Rothenberg, T. J., & Stock, J. H. (1996). Efficient Tests for an Autoregressive
Unit Root. Econometrica, 64 (4), 813–836.
Frees, E. W. (1995). Assessing cross-sectional correlation in panel data. Journal of
Econometrics, 69 (2), 393–414.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association, 32 (200), 675–701.
Godfrey, L. G. (1978). Testing against general autoregressive and moving average error
models when the regressors include lagged dependent variables. Econometrica: Journal
of the Econometric Society, 1293–1301.
Hamilton, J. D. (1994). Time Series Analysis (1 edition). Princeton University Press.
Harvey, A. C. (1989). Forecasting, structural time series models and the Kalman filter by
Andrew C. Harvey. CUP.
Herwartz, H., & Siedenburg, F. (2008). Homogenous panel unit root tests under cross sec-
tional dependence: Finite sample modifications and the wild bootstrap. Computational
Statistics & Data Analysis, 53 (1), 137–150. https://doi.org/10.1016/j.csda.2008.07.008
Herwartz, Helmut, Maxand, S., Raters, F. H., & Walle, Y. M. (2018). Panel unit-root tests
for heteroskedastic panels. Stata Journal, 18 (1), 184–196.
Hoechle, D. (2007). Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence. The Stata Journal: Promoting Communications on Statistics and Stata,
7 (3), 281–312. https://doi.org/10.1177/1536867X0700700301
Hoechle, D. (2018). XTSCC: Stata module to calculate robust standard errors for panels
with cross-sectional dependence. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s456787.html
Pesaran, M. H. (2004). General diagnostic tests for cross section dependence in panels.
Pesaran, M. H. (2015). Testing weak cross-sectional dependence in large panels. Econo-
metric Reviews, 34 (6–10), 1089–1117.
Toutkoushian, R. K., & Paulsen, M. B. (2016). Economics of Higher Education: Back-
ground, Concepts, and Applications (1st ed. 2016 edition). Springer.
Woodbridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. MIT
Press, 2002.
Wursten, J. (2017). XTCDF: Stata module to perform Pesaran’s CD-test for cross-sectional
dependence in panel context. In Statistical Software Components. Boston College
Department of Economics. https://ideas.repec.org/c/boc/bocode/s458385.html
Chapter 9
Advanced Statistical
Techniques: II

Abstract This chapter continues our discussion of advanced statistical


techniques. More specifically, this chapter introduces and demonstrates the
use of advanced statistical techniques that enable higher education policy
analysts to exploit the use of emerging macro panel data. These statistical
techniques include heterogeneous coefficient regression (HCR) with dynamic
coefficient common correlated estimation (DCCE) and mean group (MG)
estimators. Using these techniques, analysts can distinguish between: (1)
short-run and long-run relationships and; (2) overall average beta coefficients
and unit-specific beta coefficients.

Keywords Heterogeneous coefficient regression (HCR) · Dynamic


coefficient common correlated estimation (DCCE) · Mean group (MG)
estimators

9.1 Introduction

As a field of study, higher education is coming of age. Mostly quantitative in


nature, studies in higher education increasingly involve the use of advanced
quantitative techniques. While there has been some growth in the number
of investigations employing ex-post facto/causal comparative and quasi-
experimental/causal inference, the field was still dominated by correlational
designs up through the last decade (Wells et al. 2015). This chapter introduces
and demonstrates the use of one of the most recently developed advanced
correlational statistical techniques to address the most commonly violated
assumptions of OLS regression. In this chapter, we discuss and show how to

© Springer Nature Switzerland AG 2021 181


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_9
182 9 Advanced Statistical Techniques: II

systematically test for those violations and then apply a specific technique
that may be used with macro panel data to address questions with regard to
short-run and long-run relationships between variables. The Stata commands
and syntax used to demonstrate the use of these advanced correlational
statistical techniques are included in an appendix at the end of the chapter.

9.2 The Context of Macro Panel Data


and an Appropriate Statistical Approach

As higher education data span more years, methodological issues connected


to macro panels become more important. This is most evident with respect
to state higher education finance data, which are now “maturing” with longer
periods of time or time series that would enable analysts to conduct more
sophisticated studies.
Addressing methodological, short-run (e.g., one to two years), and long-
run (e.g., 20 or more years) state-level higher education policy issues require
the existence of state-level panel data with long time series. Fortunately,
the State Higher Education Executive Officers (SHEEO) make available
data on selected state higher education finance variables that allow for a
demonstration of a statistical approach that addresses the issues discussed
above. More specifically, this chapter discusses an appropriate statistical
approach when using macro panel data. This statistical approach is a hetero-
geneous coefficient regression (HCR) with dynamic common correlated effects
(DCCE) and means group (MG) estimators. This chapter will demonstrate
the use of such an approach.

9.2.1 Heterogeneous Coefficient Regression

The origin of heterogeneous coefficient regression (HCR) models is grounded


in the statistics (Hildreth and Houck 1968) and econometrics literature. This
literature addressed testing (Swamy 1970) for and estimating (Hsiao 1975)
random coefficients when using micro or two dimensional (e.g., a large number
of individuals within firms over a short period of time) panel data. Unlike
other panel-based regression models (e.g., pooled ordinary least squares,
fixed-effects, and random-effects), random coefficient regression models do
not assume homogeneity with respect to estimated parameters or coefficients
across units (e.g., individuals, institutions, states). In random coefficient
models, estimated coefficients are heterogeneous or in other words, they are
free to vary across units.
9.2 The Context of Macro Panel Data and an Appropriate Statistical. . . 183

Cheslock and Rios-Aguilar (2011) discusses how random coefficient regres-


sion models are closely related to multilevel regression models. They explain
how economists use the former to control for varying or heterogeneous
coefficients across groups or time. In comparison, multilevel models are
often employed by higher education researchers to model the differences in
average beta coefficients at one level (e.g., students) by including variables
from another level (e.g., institutions or states) at a given period of time.
Economists (e.g., Engle and Granger 1987) have historically addressed the
topic of short-run and long-run coefficients within the framework of panel-
based multivariate time series analysis, more specifically error correction
models (ECM).

9.2.2 Macro Panel Data

Many higher education researchers have acknowledged that regression coef-


ficients are not homogenous. Utilizing multilevel longitudinal or micro
panel data, higher education analysts mainly focused on explaining how
estimated intercepts (i.e., the average of student outcomes or experiences) and
coefficients (i.e., the relationship between a student outcome and a student-
level predictor) may vary and how that variation is related to variables at
a higher dimension (e.g., colleges-level variables). Using mostly fixed-effects
regression models, some higher education researchers have utilized panel data
in which the number of time periods (T ) are substantially less than the
number of groups (N ) to examine the effects of higher education policy
on higher education outcomes (e.g., college graduation rates) and outputs
(e.g., the number of college degrees). Fixed-effect regression models with
pooled estimators, however, allow only the intercept to vary and assume
homogeneous coefficients across groups (e.g., institutions, states).
Economists extended the concept of random coefficient models and applied
them to data in which T is substantial (T ≥ 20). These are known as macro
panel datasets (Baltagi 2008). In most cases, studies using macro panel
datasets also involve examining the long-run relationship or cointegration
(more on this below) between economic variables among countries or juris-
dictions (e.g., provinces or states) within countries. This research occurred
as the development of macro panels became available over the past few years
(Baltagi 2008). For example, macro panel data are now provided by the World
Bank (i.e., https://data.worldbank.org/) and National Bureau of Economic
Research (i.e., Penn World Tables). State-level U.S. data are also available
from a variety of sources such as the Census Bureau, Bureau of Economic
Analysis (BEA), Bureau of Labor Statistics, Council of State Governments,
and National Center for Educational Statistics (NCES). Data from these
disparate sources, however, have to be assembled by individual researchers
for their own use as customized panel datasets. To create such customized
184 9 Advanced Statistical Techniques: II

panel datasets, one has to match data on the same state identification code,
such as the Federal Information Processing Standards (FIPS) code and same
year. Because of missing state or year data, the matching may yield a panel
dataset with a small number (N ) of states or short time (T ) period or both. A
likely outcome is T will be substantially smaller than the maximum number
(50) of states (N ).
With respect to state-level panel data for higher education, in only a very
few cases will T begin to approach N. However, when T approaches N or
results in a macro panel dataset, we can begin to address a variety of empirical
questions going forward. These questions may include the following: (1) What
are the average short-run and long-relationships between important state-
level policy variables? (2) Among individual states, what are the short-run
and long-relationships between important policy variables? (3) Given shocks
to the long-run relationship between policy variables or “equilibrium”, how
long does it take states to adjust to their “equilibrium”?
Using macro panel data, a HCR approach with DCCE and MG estimators,
we can address all of the above questions. This approach allows for consistent
estimates in the face of variables with nonstationary data (i.e., means and
variances do not remain constant over time) and takes into account cross-
sectional dependence or spillover effects between groups (e.g., states).

9.2.3 Common Correlated Effects Estimators

Pesaran (2006) laid the foundation for the use of panel regression models
augmented with cross-sectional averages or the common correlated effects
(CCE) estimation procedure, which are approximates of the common factors
or strong cross-sectional dependence. On the other hand, spatially correlated
common shocks that are geographically based or result in spillover effects
among specific regions are known as weak cross-sectional dependence (Chudik
et al. 2011).
Pesaran (2006) developed the common correlated effects (CCE) estimator
as a technique to address cross-sectional dependence. Building on his work,
Chudik and Pesaran (2015) extended the CCE estimator and employed it
in a dynamic panel data modeling framework. Pesaran (2006) as well as
Kapetanios et al. (2011) combined the CCE estimator with the mean group
(MG) estimator, referred to as a CCEMG estimator. Kapetanios contend the
CCEMG estimator is robust to variables that are composed of nonstationary
data.
Patel (2019) used regression models with dynamic CCE (DCCE) estima-
tors to examine the short-run and long-run relationship between state-level
minimum wage and the number of employees businesses plan to hire. Employ-
ing state-level data, Liddle (2017) applied regression models with DCCE
estimators to examine factors related to energy consumption. Passamani and
9.2 The Context of Macro Panel Data and an Appropriate Statistical. . . 185

Tomaselli (2018) used a regression model with a DCCE estimator to analyze


air pollution and health risk across the different sites in an Italian province.
But before we employ a HCR approach with DCCE estimators, we test for
the violation of the assumption of cross-sectional independence. We will use
these tests to detect both strong and weak cross-sectional dependence. The
results of these tests will be shown below.

9.2.4 HCR with a DCCE Estimator

In a HCR model with a DCCE estimator, the coefficients are randomly


distributed around a common mean that is reflected in the following
regression equation:

yi,t = αi + λi yi,t−1 + βi xi,t + ui,t , (9.1)

ui,t = γi ft + ei,t

yit is the outcome variable, xit is a vector of predictor variables, αi is the


constant term or intercept, λi and β i are coefficient vectors, i is state and t
is year. Compared to a fixed-effect regression model, the error, uit , is more
complex. It is composed of the unobserved common factors (with state-
specific factor loadings, γ i ), which are by ft . The error ei,t is independent
and identically distributed (iid ). The heterogeneous coefficients are randomly
distributed around a common mean. If lags of the cross-section averages of the
outcome and predictor variables are added to the Eq. (9.1), then unobserved
common factors between states are taken into account. This is reflected in
the following equation:


PT

yi,t = αi + λi yi,t−1 + βi xi,t + ui,t + θi,l z t−l + εi,t (9.2)
l=0

 
where β i,l is a vector of coefficients, θt−l y t−1 , xt−1 is a vector of the cross-
section means at time T, l is the number of lags, and PT is the maximum
number of cross-section lags. According to Chudik and Pesaran (2015), in
dynamic context where a lag of the dependent variable is an independent
variable, the minimum cross-section number
 √ of lags should be the cube root
3
of the total number of time periods T . Because the averages y t−1 ,
xt−1 are solely to control for unobserved common factors between groups,
the vector of θi,l coefficients in Eq. (9.2) has no meaningful interpretation.
186 9 Advanced Statistical Techniques: II

9.2.5 Error Correction Model Framework

Because we are also interested in estimating short-run and long-run coef-


ficients, Eq. (9.2) can be re-written as an error correction model (ECM)
framework with an autoregressive distributed lags (p)


p 
p
Δ yi,t = γ0i + αi (yi,t−1 − β2i xi,t ) + β1li Δ yi,t−1 + β2li Δ xi,t−1
l=1 l=1

P T
+ θi z t−1 + ui,t + εi,t
l=0
(9.3)

where the short-run relationships involve the terms with Δs, while the
long-run relationships are represented within the parentheses. The ECM
framework also allows for an estimation of how changes in the short-
run adjusts to the long-run relationship between variables or otherwise
known as “equilibrium” or “steady state”. The estimated error correction
(EC) parameter indicates the extent to which disequilibrium is dissipated
before the next time period. In general, when used with panel data, the
underlying assumption of ECMs is that the short-run coefficients and long-
run coefficients are the same across groups (e.g., states) or homogeneous.
Equation (9.3) allows us to estimate how a change in an independent
variable (e.g. GSP) affects state appropriations both at impact (Δx → Δy)
and in the long-run through “disturbing” the equilibrium relationship within
parentheses. That disturbance to the equilibrium is “corrected” at a rate of
−100α% per year, which is interpreted as the extent to which states adjust to
their long-run trend. It should be noted that the long-run trend may be either
increasing or decreasing. Using Eq. (9.3), we can: (1) estimate both short-run
and long-run coefficients for each state and then average them; (2) restrict
the long-run coefficients to be the same across all states; or (3) assume all
coefficients across all states are homogeneous. If we relax the assumption of
homogeneous coefficients and estimate state-specific short-run and long-run
coefficients, then we will have to invoke the mean group (MG) estimator.

9.2.6 Mean Group Estimator

The mean group (MG) estimator was proposed by Pesaran and Smith (1995)
and developed by Pesaran et al. (1999). It involves calculating the mean
of the coefficients from separate regressions for each group (e.g., state) in
a panel dataset with long time series and a large number of groups. The
MG estimator requires that T is large enough to estimate an ordinary least
squares (OLS) regression model for each group (e.g., state). Consequently, the
9.3 Demonstration of HCR with DCCE and MG Estimators 187

estimated coefficients may differ across groups. According to Ditzen (2018a),


if we “stack” the estimated coefficients of each group (e.g., state) from Eq.
(9.3) into the vector π i = (γ i , αi , and β i ), then the MG estimates are:

1 
N
π̂M G = π̂i (9.4)
N i=1

Equation (9.4) simply reflects the mean of the individual coefficients that
are estimated for each group.
The relaxation of the assumption of homogeneous coefficients warrants the
use of a HCR model. More specifically, a statistical test of the violation of
the homogeneity of coefficient assumption should be conducted (Pesaran et
al. 2008). The results of this test will also be shown below.

9.2.6.1 Short-Run and Long-Run Coefficients

The ECM framework includes autoregressive distributed lags (ARDLs),


which is used to determine the presence of long-run relationships between
time series. ADRLs will enable us to simultaneously estimate the short-run
and long-run relationship of the variables. Within an ECM framework, long-
run coefficients refer to the relationship between variables over an extended
period of time, usually the length of the available time series (e.g., 30 years)
while short-run coefficients reflect the relationship within a shorter time
period (e.g., 1 or 2 years). The long-run coefficients are usually based on levels
variables. The short-run coefficients are usually based on first- or second-
differences of variables.

9.3 Demonstration of HCR with DCCE and MG


Estimators

Using 40 years of state-level higher education finance and other data across
50 states, we demonstrate how a HCR model with a DCCE estimator can
be utilized to address the first question below by producing short-run and
long-run coefficients for selected variables of interest to higher education
policy analysts. We show how to utilize a HCR model with DCCE and MG
estimators to address the second question with regard to estimating short-
and long-run coefficients for individual states. So, now we demonstrate how
we can answer all three questions:
1. What are the average short-run and long-relationships between important
state-level policy variables?
188 9 Advanced Statistical Techniques: II

2. Among individual states, what are the short-run and long-relationships


between important policy variables?
3. Given shocks to the long-run relationship between policy variables or“
equilibrium”, how long does it take states to adjust to their “equilibrium”?
We use the Stata-user written xtdcce2133 (Ditzen 2018b) routine to
produce the estimates that will help to address these questions. (See Ditzen
(2016) for a comprehensive description of xtdcce2133.)

9.3.1 Macroeconomic Panel Data

This demonstration draws on State Higher Education Executive Organization


(SHEEO) data. SHEEO now provides up to 40 years of selected state-
level higher education finance data across 50 states. This allows for the
construction of state-level higher education finance data with a non-trivial
span of years. We draw from the U.S. Bureau of Economic Analysis (BEA)
to supplement the SHEEO data and construct a macro panel dataset. This
could be considered a macroeconomic panel. According to Pesaran (2015b),
macroeconomic panel data are typically characterized by cross-sectional
dependence, due to omitted common effects, spatial dependence and linkages
between nations or states.
Because this is a demonstration of statistical techniques of how to deal with
these issues, rather than a comprehensive examination of the determinants
of state appropriations to higher education, only a few variables are included
in this analysis. The dependent variable (lny1) in this analysis is state
appropriations to higher education. The independent variables include net
tuition revenue (lnx1), the number of full-time equivalent students (lnx2),
and gross state product (lnx4). For the sake of simplicity, we assume all the
independent variables are exogenous. Because they are skewed, the data for
all variables are log transformed.

9.3.2 Tests for Nonstationary Data

By looking at trends over time, we can check to see if data are stationary.
Figs. 9.1 and 9.2 show trends for log transformed state appropriations and
gross state product (GSP), respectively, by state. We can see there is an
upward trend in both series over time in all states. At least for these two
variables, the data appear to be nonstationary. What are the implications of
using nonstationary data in regression models? If we use an OLS regression
model (with state fixed-effects) to regress nonstationary state appropriations
data on nonstationary GSP data, we get an extremely high R2 (0.987). The
9.3 Demonstration of HCR with DCCE and MG Estimators 189

Fig. 9.1 Trends in log of appropriations by state, FY 1980 to FY 2018

results of the model suggest that a 0.8% change in state appropriations is


associated with 1% change in GSP (beta = 0.773, p < 0.001). These findings
are highly suspicious and more likely the result of using nonstationary data.
The proper way to check for nonstationary data is to conduct statistical
tests. We test to determine if the means and variances of all the variables
remain constant over time or the panel data are nonstationary. We use the
Stata routine xtpurt, with test options proposed by Herwartz and Siedenburg
(2008), Demetrescu and Hanck (2012), and Herwartz et al. (2019). In the
three test options, the null hypothesis is that the panels (i.e., states) contain
nonstationary data or unit roots. The results of the tests are shown below.
190 9 Advanced Statistical Techniques: II

Fig. 9.2 Trends in Log of GSP by state, FY 1980 to FY 2018

. xtpurt lny1, test(hs)Herwartz and Siedenburg (2008) unit-root test for lny1
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 2.3444 0.9905
------------------------------------------------------------------------------
. xtpurt lnx1, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx1
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 3.6476 0.9999
------------------------------------------------------------------------------
. xtpurt lnx2, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx2
9.3 Demonstration of HCR with DCCE and MG Estimators 191

-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 1.0687 0.8574
------------------------------------------------------------------------------
. xtpurt lnx4, test(hs)Herwartz and Siedenburg (2008) unit-root test for lnx4
-------------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs 0.9808 0.8366
------------------------------------------------------------------------------
. xtpurt lny1, test(dh)Demetrescou and Hanck (2012) unit-root test for lny1
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 3.1686 0.9992
------------------------------------------------------------------------------
. xtpurt lnx1, test(dh)Demetrescou and Hanck (2012) unit-root test for lnx1
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 4.0982 1.0000
------------------------------------------------------------------------------
. xtpurt lnx2, test(dh)Demetrescou and Hanck (2012) unit-root test for lnx2
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
192 9 Advanced Statistical Techniques: II

Time trend: Not included Lag orders: min=0 max=4


------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 1.9286 0.9731
------------------------------------------------------------------------------
. xtpurt lnx4, test(dh)Demetrescou and Hanck (2012) unit-root test for lnx4
-----------------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_dh 0.0673 0.5268
------------------------------------------------------------------------------
. xtpurt lny1, test(hmw) trendHerwartz et al. (2017) unit-root test for lny1
-----------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 36
Constant: Included Prewhitening: BIC
Time trend: Included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hmw 1.6813 0.9536
------------------------------------------------------------------------------
. xtpurt lnx1, test(hmw) trendHerwartz et al. (2017) unit-root test for lnx1
-----------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 36
Constant: Included Prewhitening: BIC
Time trend: Included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hmw 0.5814 0.7195
------------------------------------------------------------------------------
. xtpurt lnx2, test(hmw) trendHerwartz et al. (2017) unit-root test for lnx2
-----------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hmw 0.4823 0.6852
9.3 Demonstration of HCR with DCCE and MG Estimators 193

------------------------------------------------------------------------------
. xtpurt lnx4, test(hmw) trendHerwartz et al. (2017) unit-root test for lnx4
-----------------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 40
After rebalancing = 36
Constant: Included Prewhitening: BIC
Time trend: Included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hmw 3.0343 0.9988
------------------------------------------------------------------------------
(Tests for non-stationary data with first-differences)

We run all versions of the unit-root tests for heteroskedastic panels


(xtpurt) on all the variables in the model.
. xtpurt D1lny1, test(all)All methods unit-root test for D1lny1
--------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 39
After rebalancing = 34
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs -2.9944 0.0014
t_dh -2.8429 0.0022
------------------------------------------------------------------------------
. xtpurt D1lnx1, test(all)All methods unit-root test for D1lnx1
--------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 39
After rebalancing = 34
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs -4.3014 0.0000
t_dh -4.4279 0.0000
------------------------------------------------------------------------------
. xtpurt D1lnx2, test(all)All methods unit-root test for D1lnx2
--------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 39
After rebalancing = 34
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=4
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
194 9 Advanced Statistical Techniques: II

t_hs -3.0376 0.0012


t_dh -2.5600 0.0052
------------------------------------------------------------------------------
. xtpurt D1lnx4, test(all)All methods unit-root test for D1lnx4
--------------------------------------
Ho: Panels contain unit roots Number of panels = 50
Ha: Panels are stationary Number of periods = 39
After rebalancing = 35
Constant: Included Prewhitening: BIC
Time trend: Not included Lag orders: min=0 max=3
------------------------------------------------------------------------------
Name Statistic p-value
------------------------------------------------------------------------------
t_hs -2.3108 0.0104
t_dh -1.2934 0.0979
------------------------------------------------------------------------------

The test results above suggest that levels of the variables are nonstationary
in states. The first-differences of the variables, however, are stationary. This
indicates that all the variables are a stationary process or integrated (I ) of
order one (1).

9.3.3 Tests for Cointegration

An error correction modeling framework requires that we have to determine


if there is a cointegrating relationship among variables. If we regress the first
differences of the dependent variable on the independent variables and find
that the combination of the estimated beta coefficients of the independent
variables are less than 1, then all the variables are cointegrated. However,
we should use a formal statistical test to detect cointegrating relationships
among variables.
To test for these relationships, we use the Stata xtcointtest routine with
the Koa, Pedroni, Westerlund (2005) test options. For these test options,
the null hypothesis is no cointegration while the rejection of the null is
cointegration in all of the panels (e.g., i.e., states). We test for no cointegration
with and without demeaning (first subtracting the cross-sectional averages
from the series) the data.
. xtcointtest kao lny1 lnx1 lnx2 lnx4Kao test for cointegration
--------------------------
Ho: No cointegration Number of panels = 50
Ha: All panels are cointegrated Number of periods = 38

Cointegrating vector: Same


Panel means: Included Kernel: Bartlett
Time trend: Not included Lags: 1.98 (Newey-West)
AR parameter: Same Augmented lags: 1
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Modified Dickey-Fuller t -5.7545 0.0000
9.3 Demonstration of HCR with DCCE and MG Estimators 195

Dickey-Fuller t -4.4890 0.0000


Augmented Dickey-Fuller t -4.2430 0.0000
Unadjusted modified Dickey-Fuller t -6.3754 0.0000
Unadjusted Dickey-Fuller t -4.7369 0.0000
------------------------------------------------------------------------------
. xtcointtest kao lny1 lnx1 lnx2 lnx4, demeanKao test for cointegration
--------------------------
Ho: No cointegration Number of panels = 50
Ha: All panels are cointegrated Number of periods = 38
Cointegrating vector: Same
Panel means: Included Kernel: Bartlett
Time trend: Not included Lags: 1.90 (Newey-West)
AR parameter: Same Augmented lags: 1Cross-sectional means removed
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Modified Dickey-Fuller t -3.0070 0.0013
Dickey-Fuller t -2.2358 0.0127
Augmented Dickey-Fuller t -0.6417 0.2605
Unadjusted modified Dickey-Fuller t -5.7433 0.0000
Unadjusted Dickey-Fuller t -3.5973 0.0002
------------------------------------------------------------------------------
. xtcointtest pedroni lny1 lnx1 lnx2 lnx4Pedroni test for cointegration
------------------------------
Ho: No cointegration Number of panels = 50
Ha: All panels are cointegrated Number of periods = 39
Cointegrating vector: Panel specific
Panel means: Included Kernel: Bartlett
Time trend: Not included Lags: 0.00 (Newey-West)
AR parameter: Panel specific Augmented lags: 1
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Modified Phillips-Perron t 0.4033 0.3433
Phillips-Perron t -3.9530 0.0000
Augmented Dickey-Fuller t -3.7738 0.0001
------------------------------------------------------------------------------
. xtcointtest pedroni lny1 lnx1 lnx2 lnx4, demeanPedroni test for cointegration
------------------------------
Ho: No cointegration Number of panels = 50
Ha: All panels are cointegrated Number of periods = 39
Cointegrating vector: Panel specific
Panel means: Included Kernel: Bartlett
Time trend: Not included Lags: 3.00 (Newey-West)
AR parameter: Panel specific Augmented lags: 1Cross-sectional means removed
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Modified Phillips-Perron t -1.0765 0.1408
Phillips-Perron t -5.3902 0.0000
Augmented Dickey-Fuller t -4.6909 0.0000
------------------------------------------------------------------------------
. xtcointtest westerlund lny1 lnx1 lnx2 lnx4
Westerlund test for cointegration
---------------------------------
Ho: No cointegration Number of panels = 50
Ha: Some panels are cointegrated Number of periods = 40
Cointegrating vector: Panel specific
Panel means: Included
Time trend: Not included
AR parameter: Panel specific
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Variance ratio -3.7111 0.0001
196 9 Advanced Statistical Techniques: II

------------------------------------------------------------------------------
. xtcointtest westerlund lny1 lnx1 lnx2 lnx4, demean
Westerlund test for cointegration
---------------------------------
Ho: No cointegration Number of panels = 50
Ha: Some panels are cointegrated Number of periods = 40
Cointegrating vector: Panel specific
Panel means: Included
Time trend: Not included
AR parameter: Panel specificCross-sectional means removed
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Variance ratio -1.9944 0.0231
------------------------------------------------------------------------------

The results of the cointegration reveal that the variables in all the panels
(i.e., states) are cointegrated. We also conduct an ECM-based cointegration
test, developed by Westerlund (2007), which is robust to abrupt changes in
the estimated beta coefficients (i.e., structural breaks), serial correlation, and
heteroscedasticity.
. xtwest lny1 lnx1 lnx2 lnx4, constant lags(0 3)
Calculating Westerlund ECM panel cointegration tests..........
Results for H0: no cointegration
With 50 series and 3 covariates
Average AIC selected lag length: 1.02
Average AIC selected lead length: 0-----------------------------------------------+
Statistic | Value | Z-value | P-value |
-----------+-----------+-----------+-----------|
Gt | -2.903 | -5.023 | 0.000 |
Ga | -14.634 | -3.686 | 0.000 |
Pt | -17.658 | -3.858 | 0.000 |
Pa | -11.786 | -4.665 | 0.000 |
-----------------------------------------------+

The results of the ECM-based cointegration test reveal the presence of


a cointegration relationship among the variables. The Gt and Ga statistics
show there is cointegration in at least one state, while Pt and Pa statistics
indicate there is cointegration in the panel as a whole. The Westerlund ECM-
based cointegration test confirmed the long-run cointegration of the levels of
variables reflecting a linear combination of the variables rather than the levels
of the variables separately.

9.3.4 Tests for Cross-Sectional Independence

To detect the existence of cross-sectional dependence, a test was conducted


using the Stata user-written routine xtcdf (Wursten 2017). The null
hypothesis is cross-section independence. The results are shown in the output
below.
. xtcdf lny1 lnx1 lnx2 lnx4
xtcd test on variables lny1 lnx1 lnx2 lnx4
Panelvar: newid
Timevar: fy
9.3 Demonstration of HCR with DCCE and MG Estimators 197

------------------------------------------------------------------------------+
Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) |
----------------+--------------------------------------+----------------------|
lny1 + 210.504 0.000 40.00 + 0.95 0.95 |
lnx1 + 216.82 0.000 40.00 + 0.98 0.98 |
lnx2 + 193.755 0.000 40.00 + 0.88 0.88 |
lnx4 + 217.801 0.000 40.00 + 0.98 0.98 |
------------------------------------------------------------------------------+
Notes: Under the null hypothesis of cross-section independence, CD  N(0,1)
P-values close to zero indicate data are correlated across panel groups.

The results of this test suggest cross-sectional dependence. More specif-


ically, the high values of p statistics indicate there is strong cross-sectional
dependence between states (Pesaran 2015a). This suggests common national
or global shocks impact states differently and warrants the use of dynamic
common correlated effects (CCE) estimators.

9.3.5 Test of Homogeneous Coefficients

A final set of tests involve whether there is violation of the assumption of


homogeneous coefficients that would warrant the use of an HCR model. This
set of tests, based on the work of several researchers (Blomquist and West-
erlund 2013; Pesaran and Yamagata 2008), is robust to heteroscedasticity
and serial correlation (H0: slope coefficients are homogenous). We run one
test for the levels and another test for the first differences of the dependent
and independent variables presented above. We utilize the Stata user-written
xthst (Ditzen and Bersvendsen 2020) routine to run these tests. As shown
in the output below, the results of the tests indicate a violation of the
assumption of homogeneous coefficients.
. xthst D1.lny1 D1.L1.lny1 D1.lnx1 D1.lnx2 D1.lnx4, hac whitening
Test for slope homogeneity
(Blomquist, Westerlund. 2013. Economic Letters)
H0: slope coefficients are homogenous
-------------------------------------
Delta p-value
-4.481 0.000
adj. -4.619 0.000
-------------------------------------
HAC Kernel: bartlett with bandwidth 3
Variables partialled out: constant
. xthst lny1 L1.lny1 lnx1 lnx2 lnx4, hac whitening
Test for slope homogeneity
(Blomquist, Westerlund. 2013. Economic Letters)
H0: slope coefficients are homogenous
-------------------------------------
Delta p-value
233.740 0.000
adj. 240.719 0.000
-------------------------------------
198 9 Advanced Statistical Techniques: II

HAC Kernel: bartlett with bandwith 3


Variables partialled out: constant

Together, the results of the tests warrant the use of a heterogeneous coef-
ficient regression (HCR) approach. Because one of the independent variables
is a lag of the dependent variables and we want to take into account common
unobserved factors across states, the dynamic common correlated estimation
(DCCE) estimator is applied. Because we also want to estimate state-specific
coefficients, the MG estimator is applied. To address the question about both
the short-run and long-run dynamics existing between the dependent variable
(i.e., state appropriations) and selected independent variables, we utilize the
error correction modeling (ECM) framework as reflected in Eq. (9.3). The
ECM framework includes autoregressive distributed lags (ARDLs) of (1 1
1) and cross-sectional lags (3 3 3 3). The use of ARDLs will enable us
to simultaneously estimate the short-run and long-run relationship of the
variables, while the cross-sectional lags will allow us to take into account
cross-sectional dependence. (Three cross-sectional lags of the variables were
chosen to take into account cross-sectional dependence.)

9.3.6 Results of the HCR with DCCE and MG


Estimators

The results are shown below.


. xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc cr(_all)
cr_lags(3 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(ardl)
(Dynamic) Common Correlated Effects Estimator - Mean Group (CS-ARDL)
Panel Variable (i): newid Number of obs = 1750
Time Variable (t): fy Number of groups = 50
Degrees of freedom per group: Obs per group (T) = 35
without cross-sectional averages = 25
with cross-sectional averages = 9
Number of F(1250, 500) = 1.14
cross-sectional lags 3 to 3 Prob > F = 0.04
variables in mean group regression = 450 R-squared = 0.26
variables partialled out = 800 R-squared (MG) = 0.60
Root MSE = 0.04
CD Statistic = -1.33
p-value = 0.1828
-------------------------------------------------------------------------------
D.lny1| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+---------------------------------------------------------------
Short Run Est.|
---------------+---------------------------------------------------------------
Mean Group: |
LD.lny1| .0247934 .0480515 0.52 0.606 -.0693857 .1189726
LD.lnx1| -.0290142 .0436854 -0.66 0.507 -.114636 .0566075
LD.lnx2| .011547 .1082887 0.11 0.915 -.200695 .223789
LD.lnx4| .2466661 .1515284 1.63 0.104 -.0503242 .5436563
_cons| -1.092508 3.312723 -0.33 0.742 -7.585326 5.40031
L.lny1| -.7831215 .0569085 -13.76 0.000 -.8946601 -.6715829
9.3 Demonstration of HCR with DCCE and MG Estimators 199

lnx1| -.1796518 .074036 -2.43 0.015 -.3247596 -.0345439


lnx2| .2257471 .1218119 1.85 0.064 -.0129998 .4644941
lnx4| .4242121 .1473301 2.88 0.004 .1354503 .7129739
---------------+---------------------------------------------------------------
Long Run Est. |
---------------+---------------------------------------------------------------
Mean Group: |
lr__cons| -.2395862 1.761613 -0.14 0.892 -3.692284 3.213112
lr_lnx1| -.1005612 .0433068 -2.32 0.020 -.1854408 -.0156815
lr_lnx2| .1241079 .0722627 1.72 0.086 -.0175244 .2657402
lr_lnx4| .2068079 .0784353 2.64 0.008 .0530776 .3605382
lr_lny1| -1.783121 .0569085 -31.33 0.000 -1.89466 -1.671583
-------------------------------------------------------------------------------
Mean Group Variables: LD.lny1 LD.lnx1 LD.lnx2 LD.lnx4 _cons
Cross-sectional Averaged Variables: lny1(3) lnx1(3) lnx2(3) lnx4(3)
Long Run Variables: lr__cons lr_lnx1 lr_lnx2 lr_lnx4 lr_lny1
Cointegration variable(s): lr_lny1. xtcd2
Pesaran (2015) test for weak cross-sectional dependence.
Residuals calculated using predict, residuals from xtdcce2.
H0: errors are weakly cross-sectional dependent.
CD = -1.332
p-value = 0.183

The results above show the R2 generated by the DCCE estimator is


substantially lower than the R2 of the MG estimator. The cross-sectional
dependence (CD) statistic reveals the null hypothesis of weak cross-sectional
dependence cannot be rejected. The output also shows that none of the short-
run coefficients are statistically significant while all of the long-run coefficients
are statistically significant. Special note should be taken of the estimated error
correction (EC) coefficient, which is represented in the short-run estimates
by L.lny1 (beta = −0.783, p < 0.001). The EC coefficient indicates that
each year state appropriations partially adjust to shocks in the long-run with
other variables in the model or “equilibrium” relationship. This suggests that
22% of the disequilibrium remains in the next time period. On average, it
takes about two years for state appropriations to get back to its equilibrium
relationship with other variables in this particular model.
If we run xtdcce2 with the options lr(xtpmg) and exponent, we would
see the following output:
. xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc cr(_all) cr_lags(3 3 3 3)
lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(xtpmg) exponent
(Dynamic) Common Correlated Effects Estimator - Mean Group
Panel Variable (i): newid Number of obs = 1750
Time Variable (t): fy Number of groups = 50
Degrees of freedom per group: Obs per group (T) = 35
without cross-sectional averages = 25
with cross-sectional averages = 9
Number of F(1250, 500) = 1.14
cross-sectional lags 3 to 3 Prob > F = 0.04
variables in mean group regression = 450 R-squared = 0.26
variables partialled out = 800 R-squared (MG) = 0.60
Root MSE = 0.04
CD Statistic = -1.33
p-value = 0.1828
-------------------------------------------------------------------------------
D.lny1| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+---------------------------------------------------------------
Short Run Est.|
200 9 Advanced Statistical Techniques: II

---------------+---------------------------------------------------------------
Mean Group: |
LD.lny1| .0247934 .0480515 0.52 0.606 -.0693857 .1189726
LD.lnx1| -.0290142 .0436854 -0.66 0.507 -.114636 .0566075
LD.lnx2| .011547 .1082887 0.11 0.915 -.200695 .223789
LD.lnx4| .2466661 .1515284 1.63 0.104 -.0503242 .5436563
---------------+---------------------------------------------------------------
Long Run Est. |
---------------+---------------------------------------------------------------
Mean Group: |
ec| -.7831215 .0569085 -13.76 0.000 -.8946601 -.6715829
lnx1| -.555794 .240716 -2.31 0.021 -1.027589 -.0839993
lnx2| .5886828 .3961468 1.49 0.137 -.1877507 1.365116
lnx4| .4912524 .2131553 2.30 0.021 .0734756 .9090292
_cons| -4.048029 4.794766 -0.84 0.399 -13.4456 5.34954
-------------------------------------------------------------------------------
Mean Group Variables: LD.lny1 LD.lnx1 LD.lnx2 LD.lnx4 _cons
Cross-sectional Averaged Variables: lny1(3) lnx1(3) lnx2(3) lnx4(3)
Long Run Variables: ec lnx1 lnx2 lnx4 _cons
Cointegration variable(s): L.lny1Estimation of Cross-Sectional Exponent (alpha)
--------------------------------------------------------------
variable| alpha Std. Err. [95% Conf. Interval]
---------------+----------------------------------------------
residuals| .1295795 .0096851 .1105971 .1485619
--------------------------------------------------------------
0.5 <= alpha < 1 implies strong cross sectional dependence.

Above, we see the estimated EC coefficient, which has the same value
and statistical significance as in the previous output. But it is labeled EC
and located under in the mean group estimates of the long-run coefficients.
At the end of the output, we also see the residuals of the model. The
results indicate there is no strong, but perhaps “semi weak” cross sectional
dependence (Chudik et al. 2011). If we want to see the estimates for the
individual states, we include the option showindividual.
. xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc cr(_all) cr_lags(1 3 3 3)
lr(L1.l ny1 lnx1 lnx2 lnx4) lr_options(ardl) exponent showin
(Dynamic) Common Correlated Effects Estimator - Mean Group (CS-ARDL)
In the interest of space, only the EC coefficients for the individual states are shown below.

[output cut]
L.lny1| -.788311 .0568437 -13.87 0.000 -.8997226 -.6768995
-------------------------------------------------------------------------------
Individual Results
-------------------------------------------------------------------------------
L.lny1_1| -.5837888 .2024887 -2.88 0.004 -.9806593 -.1869183
L.lny1_2| -.5674347 .2964714 -1.91 0.056 -1.148508 .0136386
L.lny1_3| -.7773012 .6831592 -1.14 0.255 -2.116269 .5616663
L.lny1_4| -1.037638 .223311 -4.65 0.000 -1.475319 -.5999563
L.lny1_5| -1.565602 .4164949 -3.76 0.000 -2.381917 -.749287
L.lny1_6| -.8035877 .2823022 -2.85 0.004 -1.35689 -.2502855
L.lny1_7| -1.39013 .380831 -3.65 0.000 -2.136545 -.6437151
L.lny1_8| -1.018001 .444861 -2.29 0.022 -1.889912 -.1460892
L.lny1_9| -1.422223 .3332329 -4.27 0.000 -2.075347 -.7690981
L.lny1_10| -.8597892 .2950313 -2.91 0.004 -1.43804 -.2815385
L.lny1_11| -.6425055 .1664852 -3.86 0.000 -.9688106 -.3162005
L.lny1_12| -.7691972 .4746639 -1.62 0.105 -1.699521 .1611269
L.lny1_13| -1.058413 .2581337 -4.10 0.000 -1.564346 -.5524802
L.lny1_14| -.3722795 .433068 -0.86 0.390 -1.221077 .4765182
L.lny1_15| -.4163244 .2378935 -1.75 0.080 -.8825871 .0499383
L.lny1_16| -.4805138 .5863073 -0.82 0.412 -1.629655 .6686274
L.lny1_17| -1.087561 .3537436 -3.07 0.002 -1.780886 -.3942362
L.lny1_18| -1.010681 .2113508 -4.78 0.000 -1.424921 -.5964413
9.4 Summary 201

L.lny1_19| -.4166191 .2894833 -1.44 0.150 -.983996 .1507577


L.lny1_20| -.4422955 .1922242 -2.30 0.021 -.819048 -.065543
L.lny1_21| -1.649983 .3142504 -5.25 0.000 -2.265903 -1.034064
L.lny1_22| -.3008116 .6223256 -0.48 0.629 -1.520547 .9189241
L.lny1_23| -1.705115 .6926356 -2.46 0.014 -3.062656 -.3475739
L.lny1_24| -1.074856 .2885345 -3.73 0.000 -1.640373 -.5093385
L.lny1_25| -1.009633 .3998482 -2.53 0.012 -1.793321 -.2259449
L.lny1_26| -1.27471 .4477757 -2.85 0.004 -2.152334 -.3970856
L.lny1_27| -.6986542 .3403508 -2.05 0.040 -1.365729 -.0315789
L.lny1_28| -.5784721 .2062265 -2.81 0.005 -.9826686 -.1742756
L.lny1_29| -.8236275 .2014978 -4.09 0.000 -1.218556 -.428699
L.lny1_30| -.2324873 .3557744 -0.65 0.513 -.9297923 .4648177
L.lny1_31| -1.144746 .2940774 -3.89 0.000 -1.721127 -.5683644
L.lny1_32| -.2933085 .1908441 -1.54 0.124 -.667356 .0807389
L.lny1_33| -.5613086 .2601366 -2.16 0.031 -1.071167 -.0514502
L.lny1_34| -.4055969 .2784918 -1.46 0.145 -.9514307 .1402369
L.lny1_35| -.6672424 .3268385 -2.04 0.041 -1.307834 -.0266508
L.lny1_36| -1.076327 .2489671 -4.32 0.000 -1.564294 -.5883605
L.lny1_37| -.0592591 .2728849 -0.22 0.828 -.5941037 .4755855
L.lny1_38| -.5220578 .3219512 -1.62 0.105 -1.153071 .1089549
L.lny1_39| -.303516 .2603321 -1.17 0.244 -.8137575 .2067255
L.lny1_40| -1.462613 .4056265 -3.61 0.000 -2.257627 -.6675999
L.lny1_41| -.8884171 .2206186 -4.03 0.000 -1.320822 -.4560127
L.lny1_42| -.9639504 .2115151 -4.56 0.000 -1.378512 -.5493883
L.lny1_43| -.2876419 .2843551 -1.01 0.312 -.8449677 .2696839
L.lny1_44| -.7934451 .4495056 -1.77 0.078 -1.67446 .0875697
L.lny1_45| -.8049828 .228829 -3.52 0.000 -1.253479 -.3564861
L.lny1_46| -.7944073 .2383817 -3.33 0.001 -1.261627 -.3271877
L.lny1_47| -.6110047 .3717416 -1.64 0.100 -1.339605 .1175955
L.lny1_48| -1.109738 .3377478 -3.29 0.001 -1.771711 -.4477642
L.lny1_49| -.4015632 .3699277 -1.09 0.278 -1.126608 .3234818
L.lny1_50| -.1941924 .2192937 -0.89 0.376 -.6240001 .2356153
-------------------------------------------------------------------------------

[output cut]
Based on the output shown above, we can see that there is a substantial
amount of variability across states with respect to the estimated EC
coefficients. In most states, the EC coefficient is statistically significant.
In some states, appropriations partially adjust to shocks in the long-run
relationship with other variables. In other states, appropriations over-adjust
in the short-run to shocks and their state of disequilibrium is exacerbated
over the long term.

9.4 Summary

This chapter demonstrated the use of macro panel data and appropriate
statistical techniques to examine dynamic relationships between variables
when examining state-level higher education policy-oriented issues. The
macro panel data is composed of long time series (e.g., >20 years) across
many states. These statistical techniques include heterogeneous coefficient
regression (HCR) with dynamic coefficient common correlated estimation
(DCCE) and mean group estimators, which allows for the distinguishing
between short-run and long-run relationships between variables. The HCR
with DCCE and MG estimators allows for the distinguishing between
202 9 Advanced Statistical Techniques: II

short-run and long-run relationships between variables. These techniques


also enable analysts to examine adjustment to shocks to the long-run or
“equilibrium” relationships between policy variables. Finally, this chapter
showed that HCR with DCCE and MG estimators also allows for state-
specific estimates of short-run, long-run, and EC coefficients.

9.5 Appendix

*Chapter 9 Stata Syntax

*create Fig. 9.1. Trends in Log of Appropriations by State, FY 1980 to FY 2018


twoway (line lny1 fy), by(State) xlabel(1980 (8) 2018, labsize(small)) ///
ytitle(Logs) ytitle(Log of Appropriations) xtitle(Fiscal Year)

*create Fig. 9.2. Trends in Log of GSP by State, FY 1980 to FY 2018


twoway (line lnx4 fy), by(State) xlabel(1980 (8) 2018, labsize(small)) ///
ytitle(Logs) ytitle(Log of GSP) xtitle(Fiscal Year)

*Use the Stata routine xtpurt, with test options proposed by Herwartz and ///
Siedenburg (2008), Demetrescu and Hanck (2012), and ///
Herwartz et al. (2019). In the three test options, the null ///
hypothesis is that the panels (i.e., states) contain non-stationary data ///
or unit roots.

* xtpurt, with test options proposed by Herwartz and Siedenburg (hs)


xtpurt lny1, test(hs)
xtpurt lnx1, test(hs)
xtpurt lnx2, test(hs)
xtpurt lnx4, test(hs)

* xtpurt, with test options proposed by Demetrescu and Hanck (dh)


xtpurt lny1, test(dh)
xtpurt lnx1, test(dh)
xtpurt lnx2, test(dh)
xtpurt lnx4, test(dh)

* xtpurt, with test options proposed by Herwartz, Maxand, and Walle (hmw)
xtpurt lny1, test(hmw) trend
xtpurt lnx1, test(hmw) trend
xtpurt lnx2, test(hmw) trend
xtpurt lnx4, test(hmw) trend

* xtpurt, with all test options with first-differences (D1)


xtpurt D1lny1, test(all)
xtpurt D1lnx1, test(all)
xtpurt D1lx2, test(all)
xtpurt D1lnx4, test(all)

* xtcointtest - tests for cointegration


*test for no cointegration with and without demeaning ///
(first subtracting the cross-sectional averages from the series ) the data ///
xtcointtest kao lny1 lnx1 lnx2 lnx4
xtcointtest kao lny1 lnx1 lnx2 lnx4, demean
xtcointtest pedroni lny1 lnx1 lnx2 lnx4
xtcointtest pedroni lny1 lnx1 lnx2 lnx4, demean
xtcointtest westerlund lny1 lnx1 lnx2 lnx4
xtcointtest westerlund lny1 lnx1 lnx2 lnx4, demean

*ECM-based cointegration test, developed by Westerlund (2007), that is robust ///


to structural breaks in the intercept and slope of the cointegrated ///
References 203

regression, serial correlation, and heteroscedasticity.


xtwest lny1 lnx1 lnx2 lnx4, constant lags(0 3)

*Tests using Stata user-written routine xtcdf (Wursten 2017) for ///
cross-sectional independence, using updated version
ssc install xtcdf, replace
xtcdf lny1 lnx1 lnx2 lnx4

*Test of homogeneous coefficients utilize the Stata user-written ///


xthst (Ditzen and Bersvendsen 2020) routine
ssc install xthst, replace
xthst D1.lny1 D1.L1.lny1 D1.lnx1 D1.lnx2 D1.lnx4, hac whitening
xthst lny1 L1.lny1 lnx1 lnx2 lnx4, hac whitening

*HCR with DCCE and MG estimators


*using the Stata-user written xtdcce2133 (Ditzen 2018b)
search xtdcce2, all
*click on st0536, then install or type:
net install st0536.pkg, replace

*run a autoregressive model with distributed lags (ARDLs) of (1 1 1) and ///


cross-sectional with lags (3 3 3 3) within an ECM framework
xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc ///
cr(_all) cr_lags(3 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(ardl)

*Pesaran (2015) test for weak cross-sectional dependence


xtcd2

*run xtdcce2 with the options lr(xtpmg) and exponent


xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc ///
cr(_all) cr_lags(3 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(xtpmg) exponent

*If we want to see the estimates for the individual states, then we include the ///
option showindividual.
xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, ///
reportc cr(_all) cr_lags(1 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) ///
lr_options(ardl) exponent showin

*end

References

Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.
Blomquist, J., & Westerlund, J. (2013). Testing slope homogeneity in large panels with
serial correlation. Economics Letters, 121 (3), 374–378.
Cheslock, J. J., & Rios-Aguilar, C. (2011). Multilevel analysis in higher education research:
A multidisciplinary approach. In J. Smart & M. B. Paulsen (Eds.), Higher education:
Handbook of theory and research (Vol. 46, pp. 85–123). Springer.
Chudik, A., & Pesaran, M. H. (2015). Common correlated effects estimation of hetero-
geneous dynamic panel data models with weakly exogenous regressors. Journal of
Econometrics, 188 (2), 393–420.
Chudik, A., Pesaran, M. H., & Tosetti, E. (2011). Weak and strong cross-section
dependence and estimation of large panels. The Econometrics Journal, 14 (1), C45–
C90.
Demetrescu, M., & Hanck, C. (2012). A simple nonstationary-volatility robust panel unit
root test. Economics Letters, 117 (1), 10–13.
Ditzen, J. (2016). xtdcce: Estimating Dynamic Common Correlated Effects in Stata. In
SEEC Discussion Papers (No. 1601; SEEC Discussion Papers). Spatial Economics and
204 9 Advanced Statistical Techniques: II

Econometrics Centre, Heriot Watt University.


Ditzen, J. (2018a). Cross-country convergence in a general Lotka–Volterra model. Spatial
Economic Analysis, 13 (2), 191–211.
Ditzen, J. (2018b). Estimating dynamic common-correlated effects in Stata. The Stata
Journal, 18 (3), 585–617.
Ditzen, J., & Bersvendsen, T. (2020). XTHST: Stata module to test slope homogeneity
in large panels. In Statistical Software Components. Boston College Department of
Economics.
Engle, R. F., & Granger, C. W. J. (1987). Co-Integration and Error Correction:
Representation, Estimation, and Testing. Econometrica, 55 (2), 251–276.
Herwartz, H., & Siedenburg, F. (2008). Homogenous panel unit root tests under cross sec-
tional dependence: Finite sample modifications and the wild bootstrap. Computational
Statistics & Data Analysis, 53 (1), 137–150.
Herwartz, Helmut, Maxand, S., & Walle, Y. M. (2019). Heteroskedasticity-Robust Unit
Root Testing for Trending Panels. Journal of Time Series Analysis, 40 (5), 649–664.
Hildreth, C., & Houck, J. P. (1968). Some estimators for a linear model with random
coefficients. Journal of the American Statistical Association, 63 (322), 584–595.
Hsiao, C. (1975). Some estimation methods for a random coefficient model. Econometrica:
Journal of the Econometric Society, 43 (2), 305–325.
Kapetanios, G., Pesaran, M. H., & Yamagata, T. (2011). Panels with non-stationary
multifactor error structures. Journal of Econometrics, 160 (2), 326–348.
Liddle, B. (2017). Accounting for Nonlinearity, Asymmetry, Heterogeneity, and Cross-
Sectional Dependence in Energy Modeling: US State-Level Panel Analysis. Economies,
5 (3), 30.
Passamani, G., & Tomaselli, M. (2018). Air Pollution and Health Risks: A Statistical
Analysis Aiming at Improving Air Quality in an Alpine Italian Province. In C.
H. Skiadas & C. Skiadas (Eds.), Demography and Health Issues: Population Aging,
Mortality and Data Analysis (pp. 199–216). Springer International Publishing.
Patel, P. C. (2019). Minimum wage and transition of non-employer firms intending to hire
employees into employer firms: State-level evidence from the US. Journal of Business
Venturing Insights, 12, e00136.
Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a
multifactor error structure. Econometrica, 74 (4), 967–1012.
Pesaran, M. H. (2015a). Testing weak cross-sectional dependence in large panels. Econo-
metric Reviews, 34 (6–10), 1089–1117.
Pesaran, M. H. (2015b). Time Series and Panel Data Econometrics. Oxford University
Press.
Pesaran, M. H., Shin, Y., & Smith, R. P. (1999). Pooled mean group estimation of dynamic
heterogeneous panels. Journal of the American Statistical Association, 94 (446), 621–
634.
Pesaran, M. H., & Smith, R. (1995). Estimating long-run relationships from dynamic
heterogeneous panels. Journal of Econometrics, 68 (1), 79–113.
Pesaran, M. H., Ullah, A., & Yamagata, T. (2008). A bias-adjusted LM test of error cross-
section independence. The Econometrics Journal, 11 (1), 105–127.
Pesaran, M. H., & Yamagata, T. (2008). Testing slope homogeneity in large panels. Journal
of Econometrics, 142 (1), 50–93.
Swamy, P. A. (1970). Efficient inference in a random coefficient regression model.
Econometrica: Journal of the Econometric Society, 311–323.
Wells, R. S., Kolek, E. A., Williams, E. A., & Saunders, D. B. (2015). “How We Know
What We Know”: A Systematic Comparison of Research Methods Employed in Higher
Education Journals, 1996—2000 v. 2006—2010. The Journal of Higher Education,
86 (2), 171–198.
References 205

Westerlund, J. (2005). New simple tests for panel cointegration. Econometric Reviews,
24 (3), 297–316.
Westerlund, J. (2007). Testing for error correction in panel data. Oxford Bulletin of
Economics and Statistics, 69 (6), 709–748.
Wursten, J. (2017). XTCDF: Stata module to perform Pesaran’s CD-test for cross-sectional
dependence in panel context. In Statistical Software Components. Boston College
Department of Economics.
Chapter 10
Presenting Analyses
to Policymakers

Abstract This chapter discusses and demonstrates how we can present


analysis for presentation to higher education policymakers. The chapter
details how to present descriptive statistics in a user-friendly Microsoft Word
document format. The chapter also shows how we can use choropleth maps
to illustrate data spatially. The chapter also demonstrates how graphs and
tables of regression results and marginal effects are created.

Keywords Tables · Choropleth maps · Graphs · Marginal effects

10.1 Introduction

The analyses that were discussed and demonstrated in the previous chapters
range from simple descriptive statistics to advanced statistical techniques.
The consumers of the results of these analyses are also varied, and include,
but are not exclusive to, policymakers. The consumers of the results
of these analyses are also varied. Many consumers include, but are not
exclusive to policymakers. However, many analysts target their work toward
policymakers. Therefore, it is necessary to produce policymaker-friendly
presentations. Using some of the routines in Stata, this chapter demonstrates
how we can accomplish this critical part of higher education policy analysis
and evaluation. These routines, commands, and syntax are included in an
appendix at the end of the chapter.

© Springer Nature Switzerland AG 2021 207


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6_10
208 10 Presenting Analyses to Policymakers

10.2 Presenting Descriptive Statistics

Descriptive statistics are the most common form of quantitative analyses


that are presented to higher education policymakers. Although these forms
of analyses are rather basic, care should still be taken to present the data in
a clear way to policymakers. The presentation of descriptive statistics should
highlight key points, such as patterns and trends. This requires the use of
tables, charts, and graphs. Tables with lots of variables and data should
be avoided. Charts and graphs should be clearly labeled and uncluttered.
Below, we demonstrate the use of Stata commands and routines and create
presentation-ready tables, charts and tables displaying descriptive statistics
for policymakers and others.

10.2.1 Descriptive Statistics in Microsoft Word


Tables

The Stata user-written module asdoc (Shah 2019) is one of the most
comprehensive routines for creating presentation-ready tables in Microsoft
Word. For the most recent version of asdoc, in Stata, type:
net install asdoc, from(http://fintechprofessor.com) replace
To get a sense of the comprehensive nature of the asdoc module, type:
help asdoc
In this demonstration, we will use data (supplemented with state tax
revenue and personal income data) from the previous chapter. First, we
change our working directory to where we want to save our tables.
cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables”
Then we invoke the sum command for previously noted variables of inter-
est: state appropriations (y); net tuition revenue (x1); full-time equivalent
enrollment (x2); state total personal income (x3); gross state product (x4)
and; state tax revenue (x5)
. sum y x1 x2 x3 x4 x5
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 1,900 9.73e+08 1.45e+09 1.81e+07 1.57e+10
x1 | 1,900 5.57e+08 7.38e+08 7900000 5.22e+09
x2 | 1,900 178357.1 213093.8 10530 1639923
x3 | 1,900 1.60e+11 2.28e+11 4.02e+09 2.26e+12
x4 | 1,900 1.87e+11 2.73e+11 4.40e+09 2.66e+12
-------------+---------------------------------------------------------
x5 | 1,900 1.70e+10 2.58e+10 4.60e+08 2.43e+11
10.2 Presenting Descriptive Statistics 209

With the exception of x2, most of the variables have values that are in
scientific notation format. Therefore, before we can create a presentation-
ready table, the data for x1, x3, and x4 need to be rescaled to millions. We
can either create new variables that are rescaled by hand or utilize the Stata
user-written routine rescale to automatically rescale the new variables. To
do the latter, type the following:
net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace

To rescale the y, x1, x3, x4, and x5 into millions, we use replace and the
millions option.
rescale y, millions
rescale x1, millions
rescale x3, millions
rescale x4, millions
rescale x5, millions
The rerun the sum command and see the results below.
. sum y x1 x2 x3 x4 x5
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 1,900 973.1793 1452.862 18.1 15692.18
x1 | 1,900 556.8281 738.1183 7.9 5216.492
x2 | 1,900 178357.1 213093.8 10530 1639923
x3 | 1,900 159660.9 228314.1 4015.1 2263890
x4 | 1,900 187043.9 273491.1 4398.6 2657798
-------------+---------------------------------------------------------
x5 | 1,900 16974.89 25750.39 459.909 243082.1

Above we see the values are no longer in scientific notation format, but
they are not what we typically present to policymakers and other users. Each
variable should also be normalized in a way that makes them comparable
across states and over time. For example, state appropriation (y) is divided
by either population or FTE enrollment. Net tuition revenue (x1) should be
divided by full-time equivalent (FTE) enrollment. State total personal income
(x3), gross state product (x4), and state tax revenue (x5) should be divided by
population. Tandberg and Griffith (2013) suggest that state appropriations
per capita is a measure of adequacy or effort which is easily understood by
policymakers and the general public. However, they also caution that the
measure is limited in that larger population states are not always higher
income states.
In most cases, users would like to see one or two statistics, the mean and
the median. Generally, we also do not want to include decimal places, but
do want to include commas (format(%9.0fc)). We use the Stata command
tabstat (for documentation, type help tabstat), with the options below, to
produce the following:
210 10 Presenting Analyses to Policymakers

. tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median)


column(statistics) format(%9.0fc)
variable | mean p50
-------------+--------------------
y_pop | 201 189
x1fte | 4,178 3,479
x3_pop | 32,641 31,754
x4_pop | 38,230 37,070
x5_pop | 3,475 3,269
----------------------------------

We combine the use of asdoc and tabstat to make this presentation


ready with the necessary table titles and variable labels. To ensure that we
are replacing any existing tables with the same name, we include the option
replace. Because we want to show variable labels with long names, we also
include the option abb(.).
. asdoc tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median)
column(statistics) format(%9.0fc) dec(0) long title(Table 10.1 Descriptive
Statistics) save(Table 10.1.doc) replace label abb(.) replace

| mean p50
-------------+----------------------
y_pop | 201.0187 189.0372
x1fte | 4177.586 3479.234
x3_pop | 32641.16 31753.6
x4_pop | 38230.36 37069.8
x5_pop | 3474.925 3268.786
(note: file Table 10.1.doc not found)
Click to Open File: Table 10.1.doc
When we click to open Table 10.1, we see the following:

See Table 10.1.


Many state higher education policymakers, however, may be interested
in how their particular state compares to other states on several different
indicators or metrics. For example, state policymakers in Maryland may want
to know how their state compares to the rest of the nation. The policy analyst
may have to create a categorical variable representing Maryland (MD) and
create a table.

Table 10.1 Descriptive statistics


Mean Median
State appropriations per capita 201 189
Net tuition revenue per FTE enrollment 4,178 3,479
State personal income per capita 32,641 31,754
Gross state product per capita 38,230 37,070
State tax revenue per capita 3,475 3,269
Note: The current version of asdoc does not produce tables with numbers that include
commas in the Word. So we later added the commas in the numbers in the table.
10.2 Presenting Descriptive Statistics 211

gen MD=0
lab var MD ”Comparisons“
replace MD=1 if fips==24
label define MD1 1 Maryland 0 ”All Other States“
label values MD MD1
It is also useful to create a categorical variable that reflects different time
periods. In this example, we create a variable decade and code and label it
accordingly.
gen decade =0
lab var decade ”Decades“
replace decade =1 if fy>=1980 & fy<=1989
replace decade =2 if fy>=1990 & fy<=1999
replace decade =3 if fy>=2000 & fy<=2009
replace decade =4 if fy>=2010 & fy<=2018
label define decade1 1 ”1980 to 1989“ 2 ”1990 to 1999“ 3 ”2000 to 2009“ 4
”2010 to 2018“
label values decade decade1

We create a Microsoft Word table comparing Maryland to the rest of the


nation.
. asdoc table decade MD, contents(mean y_pop) format(%9.0fc) dec(0)
title(Table 10.2 Average State Appropriations per Population)
save(Table 10.2.doc) replace label abb(.) replace
-------------------------------------------------
| Comparisons
Decades | All Other States Maryland
-------------+-----------------------------------
1980 to 1989 | 116.3673 106.0188
1990 to 1999 | 171.4677 154.3371
2000 to 2009 | 238.3042 235.7748
2010 to 2018 | 265.0888 294.5185
-------------------------------------------------
Click to Open File: Table 10.2.doc

We click on Table 10.2.doc, adjust the column widths, bold some of the
text and numbers, and see the following:
Table 10.2 enables policymakers and other users to easily compare
Maryland’s state appropriations to all other states in the U.S. during different
time periods. Policy analysts can modify the Stata syntax code to create
different time periods and comparison groups in similar Microsoft Word
tables.

Table 10.2 Average state Comparisons


appropriations per
Decades All Other States Maryland
population
1980–1989 116 106
1990–1999 171 154
2000–2009 238 236
2010–2018 265 295
212 10 Presenting Analyses to Policymakers

10.3 Choropleth Maps

Policymakers and other users may also be interested in how key indicators
or metrics look across individual states. A table with statistics of even one
variable across 50 states would not be aesthetically pleasing to the eye. The
table above may be informative if some of the key variables in a specific time
period are displayed by state in a choropleth map (i.e., a thematic map based
on the statistics connected to a variable). This, however, requires a number
of steps in Stata. These steps are shown below.
1. create a subdirectory “Map” in the current working directory.
2. change to the working to Map subdirectory.
3. install the Stata user-written map creation module, maptile (Stepner 2017)
maptile_install using ”http://files.michaelstepner.com/geo_state.zip“, replace

4. install the Stata user-written program smap (Pisati 2018).1


ssc install spmap
5. install the Stata user-written program statastates (Schpero 2018) and
run statastates to add U.S. state identifiers (abbreviation, FIPS code,
and name).
ssc install statastates
statastates, name(state)
6. Create new variables (statefips and statename)
gen statefips = state_fips
gen statename = state
7. Create a choropleth map showing the values of one variable in one year or
change between two time periods. For a map showing state appropriations
per capita by state for fiscal year 2017, we use the Stata-user written
module maptile (Stepner 2017) with the options below. (For a complete
description of maptile options, type help maptile.)
maptile y_pop if fy==2017, geo(state) geoid(statefips) nquantiles(5)
rangecolor(gray*0.075 gray*1.0) legd(0) twopt(title(”State
Appropriations per Capita, 2017“ ”(in dollars)“))

This creates the following map (Fig. 10.1).

1 For basic description and demonstration of maptile, go to this link—https://files.

michaelstepner.com/maptile%20slides%202015-03%20_handout.pdf.
10.3 Choropleth Maps 213

Fig. 10.1 Map of appropriations per capita by state for fiscal year 2017

Policymakers are also interested in change over time in state appropriation


per FTE enrollment, which could also be illustrated in a map. Using a dataset
with the appropriate variable representing a percent change between two time
periods (pctchnge), this is created by the following syntax (Fig. 10.2):
use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data\Data10
- State appropriations per FTE enrollment (1980 and 2017).dta“, clear
maptile pctchnge , geo(state) geoid(statefips) rangecolor (gray*0.01 gray*1.2)
nq(7) legd(0) twopt(title(”Percent Change in State Appropriations
per FTE Enrollment“ ”Between FY 2009 & FY 2017“ ))

Using the options in maptile, the maps can be displayed using different
titles, legends, and color schemes.
214 10 Presenting Analyses to Policymakers

Fig. 10.2 Percent change in state appropriations per FTE enrollment” “between FY 2009
and FY 2017

10.4 Graphs

Graphs are also very useful tools to convey information to policymakers and
other users of data. They should, however, be simple and uncluttered. One
of the most informative graphs are simple line charts showing variables over
time. Using the Stata user-written module lgraph (Mak 2015), we can show
state appropriations per population for Maryland and the rest of the nation
over time (Fig. 10.3).
ssc install lgraph, replace
lgraph y_pop fy, nom by(MD) xlabel(1980(3)2018) bw title(”State Appropriations
Per Population“ ”FY 1980-2018“) ytitle(Dollars) legend(pos(12) col(2))

We should also show a state of interest compared to the rest of the states
within that state’s region or academic common market. In the following
example, we demonstrate how to create a graph with the appropriate labels
and titles showing Maryland state appropriations per FTE compared to other
states within the Southern Regional Education Board (SREB) (Fig. 10.4).
label define MDSREB1 0 ”All Other SREB States“ 1 ”Maryland“
label values MDSREB MDSREB1
lgraph yfte fy if region_compact==1 , nom by(MDSREB)
xlabel(1980(2)2018, labsize(vsmall)) bw title(”State Appropriations
Per FTE“ ”FY 1980-2018“) ytitle(Dollars) legend(pos(12) col(2))

Graphs can also be used to depict changes in the trend of an outcome


of interest that also includes a substantive change in state higher education
policy at a particular point in time. Using the difference-in-differences (DiD)
10.4 Graphs 215

Fig. 10.3 State appropriations per capita in Maryland and all other states, FY 1980–FY
2018

Fig. 10.4 State appropriations per capita in Maryland and all other SREB states, FY
1980 to FY 2018

example from Chap. 7, we can create a graph that depicts when Colorado
enacted Senate Bill 189 (SB 04-189) to establish the College Opportunity
216 10 Presenting Analyses to Policymakers

Fig. 10.5 Colorado net tuition revenue per FTE before and after SB 189 and all other
states

Fund (COF) program. We are able to see trends in Colorado’s net tuition
revenue before and after the enactment of SB 04-189, compared to net tuition
revenue in all other states during the same time period.
global y ”netuit_fte“
lgraph $y year, by(T) stat(mean) xline(2005) xlabel(1990(2)2016,
labsize(small)) ylab(, nogrid) scheme(s2mono) bw title(”Colorado’s Net
Tuition Revenue Per FTE“ ”Before and After Colorado Senate Bill 189“)
ytitle(Dollars) legend(pos(12) col(2))

While Fig. 10.5 shows Colorado compares to all other states, another
graph could show the Western Interstate Commission for Higher Education
(WICHE) states as a control group (Fig. 10.6).

10.4.1 Graphs of Regression Results

It is also useful to show graphs of regression results in a simple clear way. One
of the easiest and most flexible ways to do this is to use the Stata user-written
module, coefplot (Jann 2019a). (To download the most recent version of
coefplot in Stata, type ssc install coefplot, replace). We demonstrate
the use of this routine within a broader context of providing information to
policymakers and other users.
10.4 Graphs 217

Fig. 10.6 Colorado net tuition revenue per FTE before and after SB 189 and all other
WICHE states

We start with the following question that state higher education policy-
makers may want to ask. On average, what is the short-run relationship
between changes in net tuition revenue and state appropriations? How
analysts answer this question may be based on the set of assumptions she
or he makes with regard to statistical techniques that are employed. How
the results are presented, however, should be based on the audience. If the
audience includes policymakers and others who are less interested in the
statistical methods and assumption of those methods and more interested
in the results, then a simple graph may suffice. An analyst may choose to
employ pooled ordinary least squares (OLS) or a more advanced statistical
technique such as heterogeneous coefficient regression (HCR) with dynamic
common correlated estimation (DCCE) and mean group (MG) estimators.
However, she or he should display results, in a simple and clear manner
for policymakers and other users who may or may not be familiar with
the technique employed. To demonstrate this, we provide examples that are
based on regression models ranging from pooled OLS regression to HCR
with DCCE and MG estimators. In these examples, we use macro panel
data spanning 38 years across 50 states. In each of the examples, all the
variables are log transformed for easier interpretation of the results. Because
we are interested in the short-run relationship, the first-difference of net
tuition revenue (lnnetut) is regressed on lag of the first-difference of state
appropriations (lnstateap), first-difference of full-time equivalent students
218 10 Presenting Analyses to Policymakers

(lnfte), and first-difference of state personal income (lnperinc). To address


the possibility of reverse causation, the regressors are lagged by 1 year.
The first example is based on a pooled OLS regression model.
. reg D1.lnnetut L1.D1.lnstateap L1.D1.lnfte L1.D1.lnperinc
Source | SS df MS Number of obs = 1,900
-------------+---------------------------------- F(3, 1896) = 29.15
Model | .585187777 3 .195062592 Prob > F = 0.0000
Residual | 12.6853048 1,896 .006690562 R-squared = 0.0441
-------------+---------------------------------- Adj R-squared = 0.0426
Total | 13.2704925 1,899 .006988148 Root MSE = .0818
------------------------------------------------------------------------------
D.lnnetut | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnstateap |
LD. | -.036019 .0283188 -1.27 0.204 -.0915582 .0195202
|
lnfte |
LD. | .49001 .0526422 9.31 0.000 .3867673 .5932527
|
lnperinc |
LD. | .0927009 .0633669 1.46 0.144 -.0315753 .2169771
|
_cons | .0621964 .0038032 16.35 0.000 .0547375 .0696553
------------------------------------------------------------------------------

Based on recommendations by Jann (2014), we use Stata’s mata syntax to


extract the estimated coefficients from the matrix produced by the regression
models:2
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))
After slightly modifying the coefplot syntax provided by Jann, we create a
graph of the coefficients from the OLS regression results above.
coefplot, xline(0) drop(_cons) mlabel format(%9.2g) mlabposition(0) msymbol(i)
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) lwidth(. medium))
rescale(10) levels(95 99) coeflabels(LD.lnstateap = ”State
Appropriations“ LD.lnfte = ”FTE Enrollment“ LD.lnperinc = ”State
Personal Income“) ytitle(10 Percent Change in . . .) xtitle(Change in Net
Tuition Revenue)

Figure 10.7 resembles a box chart, but it is actually a bar chart. It shows
the independent variables on the vertical axis and the change (scaled up by
10) in the dependent variable on the horizontal axis. The bars (which reflect
95% confidence intervals) that touch the zero line indicate the regression
coefficients of those particular independent variables are not significantly
different from zero. (The lines extending from each of the bars reflect a 99%
confidence interval). We can see from Fig. 10.7 that the bar representing
state appropriations touches the zero line. Therefore, we can easily show and
explain the results from the pooled OLS regression in the figure above.

2 Mata is a programming language, For a complete description of mata, see Mata Reference

Manual Release 16 https://www.stata.com/manuals/m.pdf.


10.4 Graphs 219

Fig. 10.7 Pct. change in appropriations, FTE and personal income due a Pct change in
net tuition revenue

But what if we use the Stata user-written routine xtmg (Eberhardt 2013)
which allows us to relax the OLS assumptions of homogeneous coefficients and
cross-sectional independence when using panel data? We will invoke xtmg to
run the regression model using Common Correlated Effects and Mean Group
(CCEMG) estimators. (To install the most recent version of xtmg in Stata,
type ssc install xtmg, replace). The CCE estimator takes into account
cross-sectional dependence. The MG estimator produces state-specific model
beta coefficients, which are averaged across the panel.
. xtmg Dlnnetut LDlnstateap LDlnfte LDlnperinc, cce
Pesaran (2006) Common Correlated Effects Mean Group estimator
All coefficients present represent averages across groups (newid)
Coefficient averages computed as unweighted means
Mean Group type estimation Number of obs = 1,900
Group variable: newid Number of groups = 50
Obs per group:
min = 38
avg = 38.0
max = 38
Wald chi2(3) = 26.72
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------------
Dlnnetut | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------------+----------------------------------------------------------------
LDlnstateap | -.0097169 .0411761 -0.24 0.813 -.0904206 .0709869
LDlnfte | .1521267 .0761399 2.00 0.046 .0028952 .3013583
LDlnperinc | -.466823 .1204983 -3.87 0.000 -.7029952 -.2306507
__00000M_Dlnnetut | .9814087 .1471938 6.67 0.000 .6929142 1.269903
220 10 Presenting Analyses to Policymakers

__00000L_LDlnstateap | .0389984 .0805769 0.48 0.628 -.1189294 .1969263


__00000L_LDlnfte | -.1537738 .2039478 -0.75 0.451 -.553504 .2459565
__00000L_LDlnperinc | .4572597 .1699901 2.69 0.007 .1240854 .7904341
_cons | .0011686 .0062927 0.19 0.853 -.0111648 .013502
--------------------------------------------------------------------------------------
Root Mean Squared Error (sigma): 0.0676
Cross-section averaged regressors are marked by the suffix:
_Dlnnetut, _LDlnstateap, _LDlnfte, _LDlnperinc respectively.

We repeat the Stata syntax to extract the estimated coefficients from the
matrix produced by the regression model with the CCEMG estimator:
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))

Then we modify the coefplot syntax to include the variables of interest


from the regression model with the CCEMG estimator. We also change the
orientation from horizontal to vertical, add titles, and bold the text we want
to bring attention to in the graph.3
coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) mlabel format(%9.2g)
mlabposition(0) msymbol(i) ciopts(recast(. rbar) barwidt(. 0.35)
fcolor(. white) lwidth(. medium)) rescale(10) levels(95 99)
coeflabels(LDlnstateap = ”{bf:State Appropriations}“ LDlnfte = ”FTE
Enrollment“ LDlnperinc = ”State Personal Income“, labsize(medium))
vertical title(”Short- Run Change in {bf:Net Tuition Revenue}
Due to a 10% Change in“ ”{bf:State Appropriations}
(controlling for other factors)“, size(medium) margin(small)
justification(center))

The graph is shown below.


We see in Fig. 10.8, with titles and bolded text, the user is directed to the
areas of the graph that focus on the information that is most relevant to the
results from the regression model with CCEMG.
We can use the results in a graph from an even more advanced model, the
HCR model with DCCE and MG estimators and a first-order autoregressive
distributed lag (ARDL), of each of the variables. We quietly (qui) run the
model to suppress the output.
qui xtdcce2 Dlnnetut L1.Dlnnetut LDlnstateap LDlnfte LDlnperinc, reportc
cr(_all) cr_lags(3 3 3 3) lr(L1.Dlnnetut LDlnstateap LDlnfte
LDlnperinc) lr_options(ardl)
We then retype
the following syntax.
mata: st_matrix(”e(box)“,
(st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

Then we reenter the coefplot syntax from above [not shown here]. The
result is the graph shown below.
We see from Fig. 10.9 the results are the same with regard to the influence
of a short-run change in net tuition revenue (not statistically significant) as a

3 For a complete description and examples of the options for coefplot, see Jann, B. (2019,

May 28). coefplot—Plotting regression coefficients and other estimates in Stata. http://
repec.sowi.unibe.ch/stata/coefplot/getting-started.html.
10.5 Marginal Effects (with Continuous Variables) and Graphs 221

Fig. 10.8 Pct. change in appropriations, FTE and personal income due a Pct. change in
net tuition revenue

result of a 10% change in state appropriations, controlling for other variables.


The results presented in Fig. 10.9, however, are based on the HCR model with
DCCE and MG estimators. This model takes into account nonstationary
data, heterogeneous coefficients, and cross-sectional dependence (all of which
policymakers and other users are most likely to have no interest in knowing
but may be of interest to policy researchers).

10.5 Marginal Effects (with Continuous Variables)


and Graphs

Marginal effects and graphs are another way to present the results of
regression models to policymakers and other users. Combined with most
regression models that are composed of continuous variables, the Stata
commands margins and coefplot provide a way to carry this out. This
section will discuss and demonstrate the use of these very flexible commands
as a way to provide information to policymakers.
Marginal effects are the changes in the dependent variable due to changes
in a specific continuous independent variable, holding all other independent
variables constant. They are calculated for one variable (y) by defining the
marginal effect to be the change (Δ) in another variable (x ) or a partial
222 10 Presenting Analyses to Policymakers

Fig. 10.9 Pct. change in appropriations, FTE and personal income due a Pct. change in
net tuition revenue

derivative (Δy/Δx ) of a function f (x, y). Using calculus, the derivative


provides the rate of change over a very small interval that is approaching
zero. The average derivative computed across all observations is the average
marginal effect (AME). The marginal effect at the average (MEA) is the
derivative at the average of variables. Marginal effects, based on the results
of a regression model, are predictions. After running a regression model,
marginal effects can show the change in the dependent variable, given a
change in an independent variable (holding all other independent variables
at some constant level). Marginal effects also allow analysts to look at the
percent change of the dependent variable given a change in a particular
independent variable, holding other independent variables at their median,
various percentiles, or at other specified levels.
Stata’s margins command can also be used to estimate elasticity. The
concept of elasticity is a percent change in a dependent variable, given a
1% change in an independent variable, holding all other variables at some
constant level. This concept is helpful when variables are measured using
different metrics. Consequently, marginal effects are useful when interpreting
regression results for policymakers and other nontechnical audiences. Utiliz-
ing regression models, the concept of marginal effects is demonstrated below.
Suppose a state legislator wants to find out how the number of adminis-
trators (executives and managers) in public higher education changes with
state funding of public colleges and universities. Using data and Stata, we
demonstrate how a higher education policy analyst could approach this.
10.5 Marginal Effects (with Continuous Variables) and Graphs 223

We change our working directory.


. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data“

C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data

Then we open a file with the relevant Stata file, based on IPEDS data.
. use ”Example 10.dta“, clear

Upon careful inspection of the data, we see that the dataset spans 46 states
and 13 years (2000–2012) with a 1 year gap (2001 is missing). So, we drop
the first year (2000). It is preferable that we have no yearly gaps in the data
when we are including 1-year lags of independent variables in our regression
models.
drop if year==2000
(46 observations deleted)

In this analysis, the dependent variable, administrators (adminstaff),


is measured by the total number of executive and managerial employees.
The main independent variable is net tuition revenue (net_tuition_rev).
The other independent variables include state appropriation (state_appro),
revenue from the federal government (fedrev_r), and full-time equivalent
students ( FTE_enroll). We will use global marco names to reduce the
number of keystrokes.
. global y ”adminstaff“
. global x1 ”net_tuition_rev_adj“
. global x2 ”state_appro_adj“
. global x3 ”fedrev_r“
. global x4 ”FTE_enroll“

Descriptive statistics [not shown] indicate the data are highly skewed.
Because prior testing [not shown] revealed there is serial correlation and
cross-sectional dependence in the data among the variables we plan to use, a
pooled OLS regression model with Driscoll-Kraay (D-K) errors. Because we
want to avoid reverse causation, we lag the independent variables by 1 year
in the regression model.
. xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
Regression with Driscoll-Kraay standard errors Number of obs = 460
Method: Pooled OLS Number of groups = 46
Group variable (i): id F( 4, 9) = 184.89
maximum lag: 2 Prob > F = 0.0000
R-squared = 0.8083
Root MSE = 867.9764
-------------------------------------------------------------------------------------
| Drisc/Kraay
adminstaff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | 1.21e-06 1.24e-07 9.81 0.000 9.34e-07 1.49e-06
|
state_appro_adj |
L1. | 2.47e-07 1.27e-07 1.95 0.083 -3.99e-08 5.34e-07
224 10 Presenting Analyses to Policymakers

|
fedrev_r |
L1. | -1.67e-07 1.62e-07 -1.03 0.330 -5.35e-07 2.00e-07
|
FTE_enroll |
L1. | .0025754 .0014926 1.73 0.119 -.000801 .0059518
|
_cons | 136.6033 74.6967 1.83 0.101 -32.37241 305.5789
-------------------------------------------------------------------------------------

Then we use margins to calculate the average marginal effects (AME).


. margins, dydx(L1.$x1 L1.$x2 L1.$x3 L1.$x4)
Average marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
dy/dx w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
-------------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | 1.21e-06 1.24e-07 9.81 0.000 9.72e-07 1.46e-06
|
state_appro_adj |
L1. | 2.47e-07 1.27e-07 1.95 0.051 -1.59e-09 4.95e-07
|
fedrev_r |
L1. | -1.67e-07 1.62e-07 -1.03 0.303 -4.86e-07 1.51e-07
|
FTE_enroll |
L1. | .0025754 .0014926 1.73 0.084 -.00035 .0055007
-------------------------------------------------------------------------------------

Given the very large numbers, the AME are difficult to interpret. So we
should make an effort to calculate the elasticities, using the option eyex.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4)
could not calculate numerical derivatives -- discontinuous region with missing
values
encountered
r(459);

This clearly does not work! Why? The “average” elasticity for none of the
independent variables can be calculated. Instead, we should try to calculate
the elasticities of each of the variables at their average or mean levels.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((mean) _all)
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 8.99e+08 (mean)
L.state_apj = 1.11e+09 (mean)
L.fedrev_r = 8.26e+08 (mean)
L.FTE_enroll = 200801.8 (mean)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .5804154 .0613005 9.47 0.000 .4602685 .7005622
10.5 Marginal Effects (with Continuous Variables) and Graphs 225

|
state_appro_adj |
L1. | .1456065 .0764509 1.90 0.057 -.0042345 .2954474
|
fedrev_r |
L1. | -.0734286 .0703213 -1.04 0.296 -.2112558 .0643986
|
FTE_enroll |
L1. | .2748182 .1556239 1.77 0.077 -.0301991 .5798356
-------------------------------------------------------------------------------------

Because we know that data are highly skewed, we should also calculate
elasticities for variables at the median rather than the mean to see if the
results are substantially different.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at(( median) _all)
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 6.07e+08 ( median)
L.state_apj = 7.86e+08 ( median)
L.fedrev_r = 5.95e+08 ( median)
L.FTE_enroll = 150336 ( median)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .5436902 .067622 8.04 0.000 .4111535 .6762268
|
state_appro_adj |
L1. | .1432976 .0776547 1.85 0.065 -.0089028 .2954979
|
fedrev_r |
L1. | -.0735113 .0701083 -1.05 0.294 -.210921 .0638984
|
FTE_enroll |
L1. | .2857197 .1582156 1.81 0.071 -.0243771 .5958164
-------------------------------------------------------------------------------------

We see that at the median, only the change in net tuition revenue has
an effect on the change in the number of administrators. The results suggest
that a 1% increase in net tuition revenue contributes to a 0.54% increase in
administrators at public colleges and universities. This is only slightly less
than the 0.58% increase at the mean tuition revenue.

10.5.1 Marginal Effects (Elasticities) and Graphs

Next, we should display these results in a graph similar to Fig. 10.9. To do so,
we save the marginal effects (in terms of elasticities) by including the option
post in the following syntax.
margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all) post
226 10 Presenting Analyses to Policymakers

We enter the following syntax.


mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

We then modify the coefplot syntax to produce the graph with the
relevant titles.
coefplot, xline(0) keep(L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r
L.FTE_enroll) mlabel format(%9.2g) mlabposition(0) msymbol(i)
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white)
lwidth(. medium)) levels(95 99) coeflabels(L.net_tuition_rev_adj
= ”{bf:Net Tuition Revenue}“ L.state_appro_adj = ”State
Appropriations“ L.fedrev_r = ”Federal Revenue“ L.FTE_enroll
= ”FTE Enrollment“) title(”Percent Change in {bf:Administrators}
Due to a 1% Change in“ ”{bf:Net Tuition Revenue} (controlling for
other factors)“, size(medium) margin(small) justification
(center))

Figure 10.10 can be displayed by analysts to clearly explain the results of


an advanced regression model Using this figure, we can easily show that a
0.5% increase in administrators is brought about by 1% increase in net tuition
revenue (at net tuition revenue and all other variables at their median levels).
We may want to:
1. to show the percent change (a rescaled to 10) on the vertical axis;
2. show the independent variables on the horizontal axis and;
3. create custom legends with regard to the significance of the independent
variables (This may require additional explanation to a lay audience).

Fig. 10.10 Pct. change in administrators due to a Pct. change in net tuition revenue
10.5 Marginal Effects (with Continuous Variables) and Graphs 227

To carry out steps 1 through 3, we enter a very long line of syntax that
produces the graph below.
coefplot (., keep(L.net_tuition_rev_adj) color(black))
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r)
color(gray)) (., keep (L.FTE_enroll) color(gray)), legend(on) xline(0)
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8)
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal
Revenue“ L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) title(”Percent
Change in {bf:Administrators} Due to a 10% Change in“ ”{bf:Net Tuition
Revenue} (controlling for other factors)“, size(medium) margin(small)
justification(center)) addplot(scatter @b @at, ms(i) mlabel(@b)
mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) rescale(10)
p2(nokey) p3(nokey) p1(label(”Different from Zero“)) p4(label(”Ignore -
not different zero“)) ytitle(Percent) xtitle(”At the Median“,
size(small))

The graph looks like this (Fig. 10.11).


Using the same steps above, a graph can also be created to show the
change in administrators with respect to the change at the 25th percentile of
net tuition revenue (and other variables). This is accomplished by replacing
“median” with “p25” in the margins syntax as follows:
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p25) _all) post
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll

Fig. 10.11 Pct. change in administrators due to a Pct. change in net tuition revenue
228 10 Presenting Analyses to Policymakers

at : L.net_tuitj = 2.57e+08 (p25)


L.state_apj = 3.70e+08 (p25)
L.fedrev_r = 2.34e+08 (p25)
L.FTE_enroll = 67108.5 (p25)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .4636343 .0776525 5.97 0.000 .3114382 .6158304
|
state_appro_adj |
L1. | .1355566 .077485 1.75 0.080 -.0163113 .2874245
|
fedrev_r |
L1. | -.0581332 .0555159 -1.05 0.295 -.1669424 .0506759
|
FTE_enroll |
L1. | .2563384 .1394503 1.84 0.066 -.0169792 .5296561
-------------------------------------------------------------------------------------

We take note of which independent variables are statistically significant


(p < 0.05).
The mata syntax from above is then reentered.
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

We modify (i.e., xtitle(”At the 25th Percentile“, size(small)) and use


coefplot syntax above to produce the graph below.
We see from Fig. 10.12, administrators increase by 4.6% for every 10%
increase in net tuition revenue and other variables at the 25th percentile.

Fig. 10.12 Pct. change in administrators due to a Pct. change in net tuition revenue
10.6 Marginal Effects and Word Tables 229

After rerunning the regression model, the margins syntax is then changed
to reflect elasticities at the 75th percentile of net tuition revenue and other
variables.
. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p75) _all) post
Conditional marginal effects Number of obs = 460
Model VCE : Drisc/Kraay
Expression : Fitted values, predict()
ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
at : L.net_tuitj = 1.21e+09 (p75)
L.state_apj = 1.32e+09 (p75)
L.fedrev_r = 1.02e+09 (p75)
L.FTE_enroll = 232360.3 (p75)
-------------------------------------------------------------------------------------
| Delta-method
| ey/ex Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
net_tuition_rev_adj |
L1. | .6224404 .0570452 10.91 0.000 .5106339 .7342469
|
state_appro_adj |
L1. | .1381845 .0706619 1.96 0.051 -.0003102 .2766792
|
fedrev_r |
L1. | -.0719685 .0693115 -1.04 0.299 -.2078166 .0638796
|
FTE_enroll |
L1. | .2534847 .1466091 1.73 0.084 -.0338638 .5408332
-------------------------------------------------------------------------------------

The mata syntax is reentered.


mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

A slightly modified (i.e., xtitle(”At the 75th Percentile“, size(small))


version of the coefplot syntax is reentered to create the following graph below.
Figure 10.13 clearly shows that administrators increase by 5.9% for every
10% increase at the 75th percentile of net tuition revenue and other variables.
Taken together, the figures (Figs. 10.11, 10.12, and 10.13] show that the
change in administrators increases with the change at every level of net tuition
revenue. But that change is even greater at higher levels of net tuition revenue.

10.6 Marginal Effects and Word Tables

It may be useful to produce a publication-ready table that could be included


as part of an appendix in reports provided to policymakers and other
consumers of the information. These tables could show the detailed results
of regression models that were used to produce the graphs discussed in the
previous section. The creation of these tables can be accomplished by using
the Stata-user written routine esttab which is part of the program estout
(Jann 2019b). To use esttab, we install the most recent version of esttab
(net install st0085_2, replace). Then we enter the following syntax.
230 10 Presenting Analyses to Policymakers

Fig. 10.13 Pct. change in administrators due to a Pct. change in net tuition revenue

First, we change the working directory to where we would like to place a


Word table.
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“

Then, we repeat the following steps:


1. Run our OLS regression model
2. calculate elasticities
3. produce a Word table with the results in step 2
Steps 1and 2—elasticities at the 25th percentile
qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((p25 ) _all) cont post
eststo marginalp25

Steps 1and 2—elasticities at the 50th percentile (median)


qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((median ) _all) cont post
eststo marginalmed

Steps 1and 2—elasticities at the 75h percentile


qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((p75 ) _all) cont post
eststo marginal75
10.7 Marginal Effects (with Categorical Variables) and Graphs 231

Table 10.3 Percent change in administrators due to a 1% change in net tuition revenue,
controlling for other factors (state appropriations, federal revenue, and FTE enrollment)
25th percentile Median 75th percentile
L.Net tuition revenue 0.427*** 0.531*** 0.590***
(0.079) (0.078) (0.076)
L.State appropriations 0.135 0.149 0.150
(0.073) (0.080) (0.079)
L.Federal revenue -0.0195 -0.0246 -0.0258
(0.085) (0.107) (0.112)
L.FTE enrollment 0.193 0.211 0.202
(0.151) (0.164) (0.159)
Observations 322 322 322
Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

Step 3—create Word file


esttab marginalp25 marginalmed marginalp75 using Table_Appendix, label se(3)
title(”Percent Change in Administrators“ ”Due to a One Percent Change“
”in Net Tuition Revenue, Controlling for Other Factors“ ”(State
Appropriations, Federal Revenue, and FTE Enro llment)“) mtitle(”25th
Percentile“ ”Median“ ”75th Percentile“) nonumbers rtf replace
[Stata output cut]
(output written to Table_Appendix.rtf)

We can click Table_Appendix.rtf to access Word the table (Table 10.3).


In Word, the table above can be edited and placed in an appendix.

10.7 Marginal Effects (with Categorical Variables)


and Graphs

Marginal effects can also be used with categorical variables to answer a range
of policy questions. For example, suppose higher education policymakers
would like to know if the relationship between administrators and net tuition
revenue differ by the extent to which higher education is regulated by the
state. In this example, we measure regulation by whether (Yes = 1) or not
(No = 0) a state has a higher education consolidated governing board (CGB).
The following steps are carried out to produce a graph of the marginal
effects by whether or not states have a consolidated governing board.
1. Shorthand notation and global macros are used to save keystrokes.
gen y = adminstaff
global x ”L1.net_tuition_rev_adj L1.state_appro_adj L1.fedrev_r L1.FTE_enroll“
232 10 Presenting Analyses to Policymakers

2. We “quietly” run a pooled OLS regression with D-K standard errors for
states with no consolidated governing board.
qui xtscc y $x if CGB==0

3. The marginal effects, specifically the elasticities, are calculated.


qui margins, eyex(*) post

4. We enter the mata syntax.


mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

5. The calculation of the marginal effects is stored.


eststo NoCGB

Steps 1 through 5 are repeated for states with a consolidated governing


board.
qui xtscc y $x if CGB==1
qui margins, eyex(*) post
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
eststo CGB

The graph, with appropriate labels and titles, is then created using the
following syntax.
coefplot NoCGB CGB, xline(0) format(%9.0f) rescale(10) recast(bar)
barwidth(0.3) fcolor(*.5) coeflabels(L.net_tuition_rev_adj = ”{bf:Net
Tuition Revenue}“ L.state_appro_adj = ”State Appropriations“ L.fedrev_r
= ”Federal Revenues“ L.FTE_enroll = ”FTE Enrollment“, labsize(small))
vertical p1(label(”No CGB“)) p4(label(”CGB“)) ytitle(Percent) ylabel(-
4(2)10) title(”Percent Change in {bf:Administrators} Due to a 10% Change
in“ ”{bf:Net Tuition Revenue} (controlling for other factors)“,
size(medium) margin(small) justification(center))

The graph is shown below.


Figure 10.14 shows that confidence interval lines cross the horizontal line
at zero for most of the bars, indicating they are not different from zero. It
also indicates that the percent increase in administrators as it relates to net
tuition revenue only occurs in states with no consolidated governing boards.

10.8 Summary

Many questions from higher education policymakers can be answered by


providing basic descriptive statistics. Using the Stata user-written routine
asdoc, this chapter showed how the results from basic descriptive statistics
can be presented in Word tables for presentations to higher education
policymakers and other consumers of this information. This chapter also
demonstrated how policy questions that can be answered with spatial
descriptive statistics can be displayed in maps. Other policy questions require
the use of more advanced statistical techniques, like regression models. This
10.9 Appendix 233

Fig. 10.14 Pct. change in administrators due to 10% Change In Net Tuition Revenue
(and other factors) by Consolidated Governing Board (CGB)

chapter demonstrates how the Stata commands margins and coefplot can
be used to create graphs to show the results to policymakers and others who
may not be familiar with or interested in regression models.

10.9 Appendix
*Chapter 10 Syntax

*Use the Stata user-written module asdoc (Shah, 2019) to create ///
presentation-ready tables in Word
net install asdoc, from(http://fintechprofessor.com) replace

*change our working directory to where we want to save our tables


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“

*open a dataset
use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data\Example 10.1.dta“

*we invoke the sum to produce descriptive statistics


sum y x1 x2 x3 x4 x5

*We can either create new variables that are rescaled original variables by ///
hand or utilize the Stata user-written routine rescale automatically to ///
rescale the new variables

*install rescale
net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace
234 10 Presenting Analyses to Policymakers

*rescale the y, x1, x3, x4,and x5 into millions, we use replace and the ///
millions option
rescale y, millions
rescale x1, millions
rescale x3, millions
rescale x4, millions
rescale x5, millions

*rerun the sum command


sum y x1 x2 x3 x4 x5

*use the Stata command tabstat


tabstat y_pop x1fte x3_pop x4_pop x5_pop , statistics(mean median) ///
column(statistics) format(%9.0fc)

*combine the use of asdoc and tabstat, use replace and ///
and option abb(.) and options
asdoc tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median) ///
column(statistics) format(%9.0fc) dec(0) long ///
title(Table 10.1 Descriptive Statistics) save(Table 10.1.doc) ///
replace label abb(.) replace

*create a categorical variable representing Maryland (MD) and create a table


gen MD=0
lab var MD ”Comparisons“
replace MD=1 if fips==24
label define MD1 1 Maryland 0 ”All Other States“
label values MD MD1

*create a categorical variable that reflects different time periods ///


In this example, we create a variable decade and code and label it accordingly
gen decade =0

*label variable
lab var decade ”Decades“
replace decade =1 if fy>=1980 & fy<=1989
replace decade =2 if fy>=1990 & fy<=1999
replace decade =3 if fy>=2000 & fy<=2009
replace decade =4 if fy>=2010 & fy<=2018

*label values and connect with variable


label define decade1 1 ”1980 to 1989“ 2 ”1990 to 1999“ 3 ”2000 to 2009“ ///
4 ”2010 to 2018“
label values decade decade1

*create a Word table comparing Maryland to the rest of the nation ///
asdoc table decade MD, contents(mean y_pop) format(%9.0fc) dec(0) ///
title(Table 10.2 Average State Appropriations per Population) ///
save(Table 10.2.doc) replace label abb(.) replace

*Choropleth Maps
*change to the working to Map sub-directory
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Maps“

*install the Stata user-written map creation module, maptile (Stepner, 2017)
maptile_install using ”http://files.michaelstepner.com/geo_state.zip“, replace

*install the Stata user-written program smap (Pisati, 2018)


ssc install spmap

*install the Stata user-written program statastates (Schpero, 2018)


ssc install statastates

*run statastates to add U.S. state identifiers (abbreviation, FIPS code, ///
and name)
statastates, name(state)
10.9 Appendix 235

*create new variables (statefips and statename)


gen statefips = state_fips
gen statename = state
gen x2_1000 = x2/1000

*Create a choropleth map showing the values of one variable in one year ///
or change between two time periods, using the Stata-user written module ///
maptile (Stepner, 2017)
ssc install maptile, replace

*create a map - Fig. 10.1 Map of Appropriations per Capita by State for ///
Fiscal Year 2017
maptile y_pop if fy==2017, geo(state) geoid(statefips) nquantiles(5) ///
rangecolor(gray*0.075 gray*1.0) legd(0) ///
twopt(title(”State Appropriations per Capita, 2017“ ”(in dollars)“))

*change working directory


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data
use “Data10 - State appropriations per FTE enrollment (1980 and 2017).dta”, clear

*create a map - Fig. 10.2 Percent Change in State Appropriations per ///
FTE Enrollment“ ”Between FY 2009 & FY 2017
maptile pctchnge , geo(state) geoid(statefips) ///
rangecolor(gray*0.01 gray*1.2) nq(7) legd(0) ///
twopt(title(“Percent Change in State Appropriations per FTE Enrollment” ///
“Between FY 2009 & FY 2017” ))

*install Stata user-written module lgraph (Mak, 2015) ///


to show state appropriations per population for Maryland and the rest of ///
the nation over time
ssc install lgraph, replace

*create Fig. 10.3. State Appropriations Per Population“ ”FY 1980-2018“


lgraph y_pop fy, nom by(MD) xlabel(1980(3)2018) bw ///
title(”State Appropriations Per Population“ ”FY 1980-2018“) ///
ytitle(Dollars) legend(pos(12) col(2))

*create a graph with the appropriate labels and titles showing Maryland ///
state appropriations per FTE compared to other states within the Southern ///
Regional Education Board (SREB)

*create variable MDSREB


gen MDSREB =0 if region_compact==1
replace MDSREB = 1 if fips==24
label define MDSREB1 0 ”All Other SREB States“ 1 ”Maryland“
label values MDSREB MDSREB1

*create Fig. 10.4. State Appropriations per Capita in Maryland and All ///
Other SREB States, FY 1980 to FY 2018
lgraph yfte fy if region_compact==1 , nom by(MDSREB) xlabel(1980(2)2018, ///
labsize(vsmall)) bw title(”State Appropriations Per FTE“ ”FY 1980-2018“) ///
ytitle(Dollars) legend(pos(12) col(2))

*create a graph that depicts when Colorado enacted Senate Bill 189 ///
(SB 04-189) to establish the College Opportunity Fund (COF) program to see ///
trends in Colorado’s net tuition revenue before and after the enactment of ///
SB 04-189, compared to net tuition revenue in all other states during the ///
same time period.

*open datasetfro Chapter 7


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 7\Stata files“
use ”Example 7.1.dta“, clear

*create variable netuit_fte


gen netuit_fte = x1/x2
236 10 Presenting Analyses to Policymakers

*We create the treatment variable (T).


gen T=0
replace T=1 if state==”CO“

*create Fig. 10.5. Colorado Net Tuition Revenue per FTE Before and ///
After SB 189 and All Other States
global y ”netuit_fte“
lgraph $y fy & fy>1999, by(T) stat(mean) xline(2005) xlabel(2000(2)2016, ///
labsize(small)) ylab(, nogrid) scheme(s2mono) bw ///
title(”Colorado’s Net Tuition Revenue Per FTE“ ///
”Before and After Colorado Senate Bill 189“) ytitle(Dollars) ///
legend(pos(12) col(2))

*create Fig. 10.6. Colorado Net Tuition Revenue per FTE Before and ///
After SB 189 and All Other WICHE States
gen COWICHE =0 if region_compact==2
replace COWICHE = 1 if fips==8
label define COWICHE1 0 ”All Other WICHE States“ 1 ”Colorado“
label values COWICHE COWICHE1
global y ”netuit_fte“
lgraph $y fy if region_compact==2 & fy>1999, nom by(COWICHE) ///
stat(mean) xline(2005) ///
xlabel(2000(2)2016, labsize(small)) ylab(, nogrid) scheme(s2mono) bw ///
title(”Colorado’s Net Tuition Revenue Per FTE“ ///
”Before and After Colorado Senate Bill 189“) ytitle(Dollars) ///
legend(pos(12) col(2))

*Graphs of Regression Results


* install the Stata user-written module, coefplot (Jann, 2019a) ///
ssc install coefplot, replace

*The first example is based on a pooled OLS regression model.


reg D1.lnnetut L1.D1.lnstateap L1.D1.lnfte L1.D1.lnperinc

*we use Stata’s mata syntax to extract the estimated coefficients from the ///
matrix produced by the regression models
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))

*After slightly modifying the coefplot syntax provided by Jann, we create ///
a graph of the coefficients from the OLS regression results above.

*Fig. 10.7. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue
coefplot, xline(0) drop(_cons) mlabel format(%9.2g) mlabposition(0) ///
msymbol(i) ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LD.lnstateap = ”State Appropriations“ ///
LD.lnfte = ”FTE Enrollment“ LD.lnperinc = ”State Personal Income“) ///
ytitle(10 Percent Change in . . .) xtitle(Change in Net Tuition Revenue)

*We use the Stata user-written routine xtmg (Eberhardt, 2013) that allows ///
us to relax the OLS assumptions of homogeneous coefficients and ///
cross-sectional independence when using panel data

* install the most recent version of xtmg in Stata ///


ssc install xtmg, replace

*create differenced and lagged differenced variables


gen Dlnnetut = D1.lnnetut
gen LDlnstateap = LD1.lnstateap
gen LDlnfte = LD1.lnfte
gen LDlnperinc = LD1.lnperinc

*we use the CCE estimator, which takes into account ///
cross-sectional dependence. The MG estimator produces ///
state-specific model beta coefficients, which are averaged across the panel
xtmg Dlnnetut LDlnstateap LDlnfte LDlnperinc, cce
10.9 Appendix 237

*We repeat the Stata syntax to extract the estimated coefficients from ///
the matrix produced by the regression model with the CCEMG estimator
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))

*we modify the coefplot syntax to include the variables of interest from ///
the regression model with the CCEMG estimator. We also change the ///
orientation from horizontal to vertical, add titles, and bold the text //
we want to bring attention to it in the graph.

*create Fig. 10.8. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue (controlling for other factors)
coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) ///
mlabel format(%9.2g) mlabposition(0) msymbol(i) ///
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LDlnstateap ///
= ”{bf:State Appropriations}“ ///
LDlnfte = ”FTE Enrollment“ ///
LDlnperinc = ”State Personal Income“, ///
labsize(medium)) ///
vertical title(”Short-Run Change in {bf:Net Tuition Revenue} Due to a 10% Change in“ ///
”{bf:State Appropriations} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))

*create a graph from a HCR model with DCCE and MG estimators and ///
a first-order autoregressive distributed lag (ARDL), of each of the variables.
qui xtdcce2 Dlnnetut L1.Dlnnetut LDlnstateap LDlnfte LDlnperinc, ///
reportc cr(_all) cr_lags(3 3 3 3) lr(L1.Dlnnetut LDlnstateap LDlnfte ///
LDlnperinc) lr_options(ardl)

*create Fig. 10.9. Pct. Change in Appropriations, FTE and Personal Income due ///
a Pct Change in Net Tuition Revenue (controlling for other factors)
coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) ///
mlabel format(%9.2g) mlabposition(0) msymbol(i) ///
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) ///
lwidth(. medium)) rescale(10) levels(95 99) ///
coeflabels(LDlnstateap ///
= ”{bf:State Appropriations}“ ///
LDlnfte = ”FTE Enrollment“ ///
LDlnperinc = ”State Personal Income“, ///
labsize(medium)) ///
vertical title(”Short-Run Change in {bf:Net Tuition Revenue} Due to a 10% Change in“ ///
”{bf:State Appropriations} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))

*Marginal Effects (with Continuous Variables) and Graphs


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data“
use ”Example 10.dta“, clear

*We use global marco names to save on key strokes.


global y ”adminstaff“
global x1 ”net_tuition_rev_adj“
global x2 ”state_appro_adj“
global x3 ”fedrev_r“
global x4 ”FTE_enroll“

*we use a pooled OLS regression model with Driscoll-Kraay (D-K) errors. ///
and lagged independent variables by one year.
xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4

*margins to calculate the average marginal effects (AME)


margins, dydx(L1.$x1 L1.$x2 L1.$x3 L1.$x4)

*calculate the elasticities of each of the variables at their average //


or mean levels
238 10 Presenting Analyses to Policymakers

*calculate elasticities for variables at the median rather than the mean //
to see if the results are substantially different
margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all)

*Marginal Effects (Elasticities) and Graphs


margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all) post
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

*create Fig. 10.10. Pct. Change in Administrators Due to a Pct. ///


Change in Net Tuition Revenue (controlling for other factors)
coefplot, xline(0) keep(L.net_tuition_rev_adj L.state_appro_adj ///
L.fedrev_r L.FTE_enroll) mlabel format(%9.2g) mlabposition(0) msymbol(i) ///
ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) lwidth(. medium)) ///
levels(95 99) coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ ///
L.FTE_enroll = ”FTE Enrollment“) ///
title(”Percent Change in {bf:Administrators} Due to a 1% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) ///
margin(small) justification(center))

*show the percent change (a rescaled to 10) on the vertical axis; show the ///
independent variables on the horizontal axis and; create custom legends ///
with regard to the significance of the independent variables
*create Fig. 10.11
coefplot (., keep(L.net_tuition_rev_adj) color(black)) ///
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r) color(gray)) ///
(., keep (L.FTE_enroll) color(gray)), legend(on) xline(0) ///
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ ///
L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) ///
margin(small) justification(center)) addplot(scatter @b @at, ms(i) ///
mlabel(@b) mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) ///
rescale(10) p2(nokey) p3(nokey) p1(label(”Different from Zero“)) ///
p4(label(”Ignore - not different zero“)) ytitle(Percent) ///
xtitle(”At the Median“, size(small))

*a graph can also be created to show the change in administrators //


with respect to the change at the 25th percentile of net tuition revenue ///
(and other variables)
xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p25) _all) post
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

*create Fig. 10.12


coefplot (., keep(L.net_tuition_rev_adj) color(black)) ///
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r) color(gray)) ///
(., keep (L.FTE_enroll) color(gray)), legend(on) xline(0) ///
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ ///
L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) ///
margin(small) justification(center)) addplot(scatter @b @at, ms(i) ///
mlabel(@b) mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) ///
rescale(10) p2(nokey) p3(nokey) p1(label(”Different from Zero“)) ///
p4(label(”Ignore - not different zero“)) ytitle(Percent) ///
xtitle(”At the 25th Percentile“, size(small))

*a graph can also be created to show the change in administrators //


with respect to the change at the 75th percentile of net tuition revenue ///
(and other variables)
xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
10.9 Appendix 239

margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p75) _all) post


mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))

*create Fig. 10.13


coefplot (., keep(L.net_tuition_rev_adj) color(black)) ///
(., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r) color(gray)) ///
(., keep (L.FTE_enroll) color(gray)), legend(on) xline(0) ///
nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ ///
L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) ///
margin(small) justification(center)) addplot(scatter @b @at, ms(i) ///
mlabel(@b) mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) ///
rescale(10) p2(nokey) p3(nokey) p1(label(”Different from Zero“)) ///
p4(label(”Ignore - not different zero“)) ytitle(Percent) ///
xtitle(”At the 75th Percentile“, size(small))

*Marginal Effects and Word Tables


*install the Stata-user written routine esttab which is part of the ///
routine estout (Jann, 2019b)
net install st0085_2.pkg, replace

*change the working directory to where we would like to place a Word table
cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“

*elasticities at the 25th percentile


qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((p25 ) _all) cont post
eststo marginalp25

*elasticities at the median


qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((p50 ) _all) cont post
eststo marginalmed

*elasticities at the 75th percentile


qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4
qui margins, eyex(*) at((p75 ) _all) cont post
eststo marginalp75

*create Word file


esttab marginalp25 marginalmed marginalp75 using Table_Appendix, label ///
se(3) title(”Percent Change in Administrators“ ”Due to a One Percent Change“ ///
”in Net Tuition Revenue, Controlling for Other Factors“ ///
”(State Appropriations, Federal Revenue, and FTE Enrollment)“) ///
mtitle(”25th Percentile“ ”Median“ ”75th Percentile“) ///
nonumbers rtf replace

*Marginal Effects (with Categorical Variables) and Graphs


cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data“
use ”Example 10.dta“, clear

*Shorthand notation and global macros are used to save keystrokes.


global y ”adminstaff“
global x ”L1.net_tuition_rev_adj L1.state_appro_adj L1.fedrev_r L1.FTE_enroll“
qui xtscc y $x if CGB==0
qui margins, eyex(*) post
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
eststo NoCGB
qui xtscc y $x if CGB==1
qui margins, eyex(*) post
mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
eststo CGB
240 10 Presenting Analyses to Policymakers

*Fig. 10.14. Pct. Change in Administrators Due to 10% Change in Net ///
Tuition Revenue (and other factors) by Consolidated Governing Board (CGB)
coefplot NoCGB CGB, xline(0) format(%9.0f) rescale(10) recast(bar) ///
barwidth(0.3) fcolor(*.5) ///
coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ ///
L.state_appro_adj = ”State Appropriations“ ///
L.fedrev_r = ”Federal Revenues“ ///
L.FTE_enroll = ”FTE Enrollment“, labsize(small)) ///
vertical p1(label(”No CGB“) color(gray)) ///
p4(label(”CGB“) color(black)) ytitle(Percent) ylabel(-4(2)10) ///
title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ///
”{bf:Net Tuition Revenue} (controlling for other factors)“, ///
size(medium) margin(small) justification(center))

*end

References

Eberhardt, M. (2013). XTMG: Stata module to estimate panel time series models with
heterogeneous slopes. https://econpapers.repec.org/software/bocbocode/s457238.htm
Jann, B. (2014). Plotting regression coefficients and other estimates. The Stata Journal,
14 (4), 708–737.
Jann, B. (2019a). COEFPLOT: Stata module to plot regression coefficients and other
results. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s457686.html
Jann, B. (2019b). ESTOUT: Stata module to make regression tables. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s439301.html
Mak, T. (2015). LGRAPH: Stata module to draw line graphs with optional error bars. In
Statistical Software Components. Boston College Department of Economics. https://
ideas.repec.org/c/boc/bocode/s456849.html
Pisati, M. (2018). SPMAP: Stata module to visualize spatial data. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s456812.html
Schpero, W. L. (2018). STATASTATES: Stata module to add US state identifiers to
dataset. In Statistical Software Components. Boston College Department of Economics.
https://ideas.repec.org/c/boc/bocode/s458205.html
Shah, A. (2019). ASDOC: Stata module to create high-quality tables in MS Word from
Stata output. In Statistical Software Components. Boston College Department of
Economics. https://ideas.repec.org/c/boc/bocode/s458466.html
Stepner, M. (2017). MAPTILE: Stata module to map a variable. In Statistical Software
Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/
bocode/s457986.html
Tandberg, D. A., & Griffith, C. (2013). State Support of Higher Education: Data, Measures,
Findings, and Directions for Future Research. In M. B. Paulsen (Ed.), Higher Education:
Handbook of Theory and Research (Vol. 28, pp. 613–685). Springer Netherlands. https:/
/doi.org/10.1007/978-94-007-5836-0_13
Index

A Cumby-Huizinga (C-H) general test of the


ARIMA model, 156, 177 residuals, 154, 177
Arithmetic mean, 81 Current Population Survey (CPS), 21
ARMAX model, 156–158, 160, 162, 177
Autocorrelation of the residuals, 150, 154,
158 D
Autoregressive (AR1), 151, 153 Descriptive statistics, 1, 4, 5, 13, 38, 46,
Autoregressive parameter, 156 79, 80, 85, 86, 91, 92, 95, 100, 101,
Autoregressive terms, 156 207, 208, 232, 233
Average beta coefficients, 183, 185 DF-GLS unit root, 147, 149
Difference-in-differences (DiD) Placebo
Tests, 133–134
B Difference-in-differences (DiD) estimator,
Beginning Postsecondary Students 128
Longitudinal Study (BPS), 20 Differences-in-differences, 1, 4
Box chart, 93, 101, 218 Digest of Education Statistics, 21, 22, 33,
Breusch and Pagan Lagrangian multiplier 39, 43
test, 136 Durbin-Watson (D-W) test, 151
Dynamic coefficient common correlated
estimation, 183, 184, 203
C
C-H autocorrelation general test of the
residuals, 167, 178 E
Coefficient of variation (CV), 85 ECM-based cointegration test, 198
Cointegration, 146, 185, 196–198, 204 Elasticities, 224, 225, 228, 230, 231,
Common correlated effects, 184, 186, 199 237–239
Common Correlated Effects and Mean Error correction, 185, 196, 199
Group (CCEMG) estimators, 218 Exploratory data analysis (EDA), 4, 79, 92
Correlograms, 163
Crosstabs, 87
Cross tabulations, 87
Cumby-Huizinga (C-H) general test for F
autocorrelation, 166 Feasible generalized least squares (FGLS),
121

© Springer Nature Switzerland AG 2021 241


M. Titus, Higher Education Policy Analysis Using Quantitative Techniques,
Quantitative Methods in the Humanities and Social Sciences,
https://doi.org/10.1007/978-3-030-60831-6
242 Index

First differenced, 149 National Association of State Student


First-order autocorrelation (AR1), 150, Grant and Aid Programs
151, 154, 163, 164 (NASSGAP), 22
Fixed-effects regression, 121, 122, 128, 129, National Education Longitudinal Study of
134, 136–138, 140, 141, 169, 170, 1988 (NELS:88), 19
172, 173, 178, 185 National Postsecondary Student Aid Study
Frequencies, 86, 88, 100, 101 (NPSAS), 20
Friedman’s test of cross sectional National Science Foundation (NSF), 23
independence, 169 Non-stationary data, 146, 147, 163, 186,
190–196, 204, 221
Normal distribution, 92, 153
G Null hypothesis, 107, 113, 136, 147–149,
Group-wise heteroscedasticity, 118 153, 154, 156, 158, 166, 170–172,
174, 191, 196, 198, 199, 201

H
Hausman test, 136, 138–140, 143 O
Heterogeneous coefficient regression, 183, Ordinary least squares (OLS), 1, 4, 103,
184, 199, 203, 217 188, 217
Heteroscedasticity, 105, 117, 118, 120, 121, Other Sources of National Data, 22
139, 153, 154, 167, 173, 198, 199, 204
High School Longitudinal Study of 2009
(HSLS: 09), 19 P
Histogram, 92, 93, 101, 102 Partial autocorrelations, 150, 154, 163, 177
Homoscedasticity, 117, 120, 142, 151 Pesaran cross-sectional dependence test,
172
Pesaran’s test of cross sectional
I independence, 169
Interaction effect, 112 Pooled (POLS) regression model with
Interaction terms, 111–114, 141 dummy variables, 127
Pooled OLS (POLS) regression, 108
Prais-Winsten (P-W) estimator, 153
L Presentation-ready tables in Microsoft
Levene test, 120 Word, 208
Line charts, 213
Long-run, 183–186, 188, 189, 198, 199,
201–203 R
Random-effects regression, 4, 103, 134–136,
141, 163, 174
M Regional Compacts, 23
Marginal effect at the average (MEA), 222 Regression models with Driscoll and Kraay
Mean group, 183, 186, 188, 200–203, 217 (D-K) standard errors, 173
Median, 10, 13, 80, 81, 85, 90, 93, 95, 209, Residual-versus-fitted plot, 116
210, 222, 225–227, 230, 233, 234,
237, 239
S
Modified Dickey-Fuller test, 147
Scatter plots, 79, 96, 98, 102
Moving-average parameter, 156
Short-run, 183–186, 188, 189, 199, 201–203,
Moving average terms, 156
216, 221
Multivariate regression, 103, 121, 122
Skewed distributions, 93
Standard deviation, 80, 85, 88, 107, 108,
127
N
Index 243

Standard errors, 81, 108, 118, 121, 135, U


156, 165, 167, 168, 173–175, 178, Unit circle, 162
223, 231 Unit root, 147–149, 165, 166, 178
Unit-specific beta coefficients, 183

T
Test for assumption of homogeneous W
coefficients, 199 Weighted least squares (WLS), 121
Test for weak cross-sectional dependence, Westerlund (2005) test, 196
171, 201, 204 Within-group estimator, 122
The College Board, 22
Time-invariant categorical variables, 91, 92
Time series regression model, 153, 177 Y
Two-way tables, 88 Year fixed-effects, 175, 178, 179

You might also like