You are on page 1of 9

PATTERNS, CAUSALITY AND PREDICTION

DATA ANALYSIS FOR BUSINESS, ECONOMICS AND


POLICY
Textbook content
by Gbor Bks and Gbor Kzdi
[17/December/2016]

Summary of key features of the textbook


Patterns, Causality and Prediction: Data Analysis for Business, Economics and Policy is a
textbook aimed primarily at business, applied economics and public policy students. It
may be taught at MBA, EMBA, MA Economics (non-PhD track), MSc in Business
Economics/Management, MA in Public Policy, PhD in Management and comparable
programs.
The textbook material may be fully covered in a year-long course (for example, in the
first year of a two-year Master programs or PhD programs) It covers material for a series
of courses or modules, and chapters may be used to assemble programs of various
lengths.
Our textbook covers integrated knowledge of methods and tools traditionally scattered
around various fields such as econometrics, machine learning and practical business
statistics. State-of-the art knowledge in data analysis includes traditional regression
analysis, causal analysis of the effects of interventions, predictive analytics using
regression and machine learning tools, and practical skills for working with real-life data
and collecting data.
We cover relatively few methods but help students gain a deep intuitive understanding.
We put a lot of emphasis on the interpretation and visualization of results.
Applied knowledge can be acquired only by working through many applications.
Students will use real-life data; learn how to manage analytical projects from scratch as
we provide data and code as part of an online ancillary platform. The textbook
supports Microsoft Excel, R and Stata, emphasizing the latter two software.
The illustration studies and examples presented in the textbook are all connected to
real life problems in business and public policy. We are cooperating with major
companies, NGOs and policy institutions to jointly develop interesting case studies.
Our textbook is complemented with extensive online material including data, code,
additional case studies, practice questions, sample exams and data exercises.
1

The textbook
Proposed title
Patterns, Causality and Prediction: Data Analysis for Business, Economics and Policy

Completion date, length and illustrations


The proposed length is a 600 pages long textbook + online material. At this stage we
consider this to be a stand-alone textbook.
In terms of illustrations, we plan to add graphs and tables. As many as 200 graphs and
100 tables are to be expected. Color is useful but not essential. Latex type setting may
be carried out by authors.
The book is intended to be completed in the first half of 2018.

Motivation
The ongoing data revolution has major consequences for businesses and policy-makers
alike: more and better data is available to support decision making. As a result there is
a growing need for professionals who can learn from available data and can collect
relevant data.
There is need for analysts who can assess the effects of business and policy practices,
carry out predictions and work with real-life data, small and big. The ability to visualize
and interpret results is also becoming extremely important. Not only analysts but users of
analyses need many of these skills to translate results to decisions and commission data
analysis and data collection.
There is need for analysts with a skills set that integrates traditional statistical analysis with
machine learning methods. There is need for analysts with a deep and applicable
knowledge of the most reliable methods. There is need for analysts who can write their
own code and work with real-life data that is often messy and complicated. There is
need for analysts who can understand the business and policy context and tailor their
analysis to answer substantive questions.
The isolated and often formalistic textbooks of econometrics and machine learning
offer fragmented skills and knowledge, cover many more methods than needed, rarely
provide instructions or code for software implementation, ignore the messy and
complicated nature of real life data, and often focus on academic applications. The
more practical textbooks of business statistics, survey statistics and other applied fields
do not cover many important data analysis methods, and when they do, they not offer
a deep understanding of those methods.
Our textbook addresses all four needs: the need for analysts with an integrated
knowledge who understand and can apply the most robust methods, who can work
with real-life data, and who build their work to address real-life problems.

Need 1: integrated knowledge


Our textbook covers integrated knowledge of methods and tools traditionally scattered
around various fields such as econometrics, machine learning and practical business
statistics. State-of-the art knowledge in data analysis includes traditional regression
analysis, causal analysis of the effects of interventions, predictive analytics using
regression and machine learning tools, and practical skills for working with real-life data
and collecting data. Covering all in an integrated textbook allows for comparing the
strengths and weaknesses of methods and tools in an explicit way. Our textbook
provides such comparisons throughout.
We believe that most instructors will be able to teach courses based on this textbook
without a major effort to upgrading their skills. Many of the methods in the textbook are
parts of the courses most instructors teach already, and the additional methods are
introduced in intuitive ways and in connection with other methods. We will offer a great
deal of supporting material and support to help ease transition to using this textbook.

Need 2: few methods but deep understanding


Practitioners need a deep working knowledge of the most robust methods that can
give good answers to most questions. To respond to that need we cover relatively few
methods instead of providing a complete menu. At the same time we help students
gain a deep intuitive understanding of how, why, and for what problems those
methods work, how to interpret their results, and what their potential limitations are. We
emphasize intuition but also present simple formulae and derivations when they help
intuition. We illustrate the working of methods and their advantages and disadvantages
through real life problems.
We put a lot of emphasis on the interpretation and visualization of results. Appropriate
interpretation is and indispensable skill to connect ones analysis to substantive
questions and assess the applicability of the results of other analysts as well. The role of
visualization is emphasized both to help the analysts in making the best decisions during
the analysis and presenting its results in a powerful way.

Need 3: working with real data


Applied knowledge can be acquired only by working through many applications. The
textbook fosters hands-on work through numerous examples, both within the main text
and as supplementary material. Illustration studies in the textbook explain how methods
work; fully developed case studies included in the ancillary material answer real-life
questions using real-life data; additional exercises invite students to replicate analysis
and carry out further projects with appropriate guidance.
The challenges of real life data analysis are discussed with a great emphasis through all
examples. In contrast with most textbooks on the market we do not provide neat and
clean data for the exercises. Instead, we start from scratch, with data that is often
unstructured and messy and guide students to get to the analysis stage using
appropriate tools. Moreover, we invite students to get the data themselves for their
assignments. The textbook includes chapters that cover data collection in detail.
3

The textbook supports Microsoft Excel, R and Stata, emphasizing the latter two software.
The needs of data professionals of all stripes extend beyond Excel. R and Stata are the
most widely used software for the methods covered in the textbook. Both include
powerful tools for data management and visualization as well.

Need 4: solving real life problems


The illustration studies, case studies and exercises are all connected to real life problems
in business and public policy. Analyzing real life problems is the only way to acquire the
skills to connect data analysis to answering real life questions. Such examples also tend
to get students more engaged than artificial or academic problems.
We plan to cooperate with major companies, NGOs and policy institutions to jointly
develop these examples. This cooperation ensures that our examples are truly
connected to the issues faced by business and policy. At the same time it helps position
our partners as attractive employers for data savvy students. We already have several
partnerships in the area of marketing, banking, consulting, and policy research.

Key topics
Patterns: regression analysis
Uncovering patterns in the data can be an important goal in itself, and it is the
prerequisite to establishing cause and effect and carrying out predictions. The textbook
starts with simple regression analysis, the method that compares expected y for
different values of x to learn the patterns of association between the two variables. It
discusses nonparametric regressions and focuses on the linear regression. It builds on
simple linear regression and goes on to enriching it with nonlinear functional forms,
generalizing from a particular dataset to other data it represents, adding more
explanatory variables, etc. The textbook also covers regression analysis for time series
data, panel data, binary dependent variables, as well as nonlinear models such as logit
and probit. Understanding the intuition behind the methods, their applicability in various
situations, and the correct interpretation of their results are the constant focus of the
textbook.
Causality: learning the effects of interventions
Decisions in business and policy are often centered on specific interventions, such as
changing monetary policy, modifying health care financing, changing the price or
other attributes of products, or changing the media mix in marketing. Learning the
effects of such interventions is an important purpose of data analysis. The textbook
incorporates the basic concepts and methods used by program evaluation (the
framework of potential outcomes, the benefits of randomized assignment, etc.). It also
covers related methods used in business, such as A/B testing.
Prediction: carrying out predictions
Data analysis in business and policy applications is often aimed at prediction. The
textbook introduces tools to evaluate predictions, such as loss functions or the Brier
4

score. It emphasizes the importance of out-of-sample prediction, the role of stationarity,


the dangers of overfitting and the use of training and testing samples and crossvalidation. It presents and compares the most important predictive models that may be
useful in various situations such as time series regressions, classification tools and treebased machine learning methods.
Real life: Collecting data and working with data
Data collection is often an integral part of data analysis in many situations. Regardless
of its source, data in real life needs cleaning and restructuring before it can be
analyzed. Even after extensive cleaning the data used in the analysis is typically
different from the ideal dataset that would serve the analysis best. Analysts need to use
appropriate tools to collect, clean and restructure data, and they need to have a
thorough understanding of the differences between ideal data and available data to
interpret their results in appropriate ways. The need to integrate practical data work
with statistical analysis and machine learning has increased with the advance of Big
Data large and unstructured datasets are becoming more and more common. The
textbook introduces the most important tools of data collection, data management
and cleaning, and it discusses the consequences of measurement issues on the results
of analysis. These topics are included as separate chapters, and they are emphasized
in the case studies as well.
Big Data: New opportunities and new challenges
Big Data presents opportunities to better answer old questions and ask new questions. It
offers great advantages when applying many traditional statistical methods and allows
for developing new methods. At the same time analyzing Big Data presents new
challenges, too. We include explicit discussion of these opportunities and challenges in
relation to uncovering and generalizing patterns, learning the effects of interventions
and carrying out predictions, within each of the sections of the book.

Online material
The textbook will be supported by a set of additional material available online.
Data and lab
Each chapter is accompanied by the data used in the illustration studies with a full but
concise description, and the description of how to implement data management,
cleaning, analysis, and visualization in Excel, R and Stata. We also provide the R and
Stata codes themselves that produce all results shown in the textbook, starting with raw
data. Students can learn coding by first understanding and then tinkering with code
that works. We plan to store these Data and lab sections online, some elements
possibly turned into videos or interactive exercises similar to those on datacamp.com.

Additional case studies


In addition to the illustration studies in the textbook, we will provide additional case
studies that allow for studying the entire process of data analysis from the substantive
business or policy question through collecting or accessing data, managing and
cleaning data, carrying out the analysis, presenting and interpreting its results, and
addressing the original substantive questions. Case studies aim at answering a question
rather than simply illustrating a method. In addition to showing tools in real life use, case
studies will also present decision points (e.g. which model or functional form to pick,
what data source to rely on, how to treat missing observations) as well.
Case studies are standalone documents, occasionally produced in collaboration with
partners. We plan to start with 5-10 studies along with launching the textbook, but case
study development will continue after the publication.
Practice questions, sample exams, data exercises
To help instructors, we will provide a large number of practice questions accompanying
each section, as well as sample exams on material covered by multiple chapters.
We plan to develop an online tool with an ample selection of practice questions and
answers to help students check if they acquired the knowledge and skills covered in
each chapter. These questions can also be used as parts of assignments or exams, and
they provide templates for instructors to design similar questions with little effort. We shall
attach appropriate tags to each question to relate them to tools and methods as well
as level of difficulty.
We will provide longer exam questions to choose from and guidance to modify the
questions to tailor them to specific audiences. These exam questions typically touch on
material covered in multiple chapters of the textbook; we shall indicate the required
chapters for each question.
Third, we provide hands-on data analysis exercises next to each chapter. These
exercises invite students to use the data of the illustration studies covered in each
chapter to produce additional results, collect similar data and replicate the analysis,
replicate relevant parts of the additional case studies, and produce similar case studies
in different settings.
Prerequisites
We plan to complement the textbook with a collection of prerequisites, made
available online. These include basic statistics, the basics of coding in R and Stata, as
well as data management and data cleaning tools mentioned in the introductory
chapter. We shall also direct students to additional resources that cover those
prerequisites in greater detail.

Outline
I. PATTERNS: REGRESSION ANALYSIS
1. How to approach and describe data
Key characteristics of a dataset. Types of observations (cross-section, time series,
other structures), types of variables. Describing data (source, types of
observations and variables, descriptive statistics, distributions).
Visualization of basic features of data, histograms, kernel densities, box plots
Frequent data problems and cleaning data. Common issues with data (zero
values, missing values, errors, duplicates, dates, spelling). Suspicious values and
benchmarking. How to form realistic expectations.
2. Simple regression analysis
Regression as comparison of means. Definitions. Visualizing regression.
Nonparametric regression. Linear model. Learning simple linear regression
parameters. Predicted dependent variable and the residual. Goodness of fit.
Correlation and causality.
Graphical representation, scatterplots, visualizing nonparametric and linear
regression.
3. Uncovering non-linear patterns in regression analysis
Transforming variables (taking logs, normalizing by size, standardized variables),
piecewise linear spline, quadratic and other polynomials. When to worry and
when not to worry about nonlinear pattern
4. Inference: Generalizing from our data
Repeated samples. Confidence Interval as the prime tool to make inference. SE.
Robust SE. Layers of external validity.
Presenting regression results in standard table format. Visualizing and interpreting
confidence bounds around regression lines.
5. Multiple linear regression analysis
Uses of multiple regression: multiple associations, controlling for confounders,
improving prediction. Categorical explanatory variables, interactions. Omitted
variable bias, bad controls. Modelling and interpretation.
Presenting and summarizing results: advice on tables and graphs.
6. Probability models
Linear regression with binary outcome: the linear probability model.
Nonlinear probability models: logit and probit. Coefficients and marginal
differences.
7. Analysis of time series data
Frequency, trends, seasonality. Taking differences. SE robust to serial correlation.
Time series graphs
8. Messy data
Dealing with missing data, influential observations, weights, standardization of
variables.

Classical measurement error in the dependent variable or the explanatory


variable. Other types of measurement error. Division bias.
9. Model specification
Multiple measures of variables in focus. Selecting single measures; combining
multiple measures as score or Principal Component.
What controls to include (a substantive question); how to include them (a
statistical question). Tools dealing with multicollinearity, functional forms,
interactions (RMSE, AIC & BIC, ridge regression and LASSO).
Visualization of issues and solutions (variable importance etc.).
10. Data collection to find patterns
Access admin data; collect data from Web (scraping); design surveys.
Simple-to-complicated approach: basic ideas are simple, we show solutions to a
few simple cases and offer guidelines for more complicated cases (for which
need expertise).
11. Whats different with Big Data
The increased role of exploratory data analysis. Data mining in big data, the
CRISP-DM process. New kinds of patterns (networks, clusters, tipping points).
Overfitting. Size and increased likelihood of finding patterns.
New possibilities for visualizing patterns and quality of analysis

II. CASUALITY: UNCOVERING THE FFECTS OF INTERVENTIONS


12. A framework to think about cause and effect
The potential outcomes framework and the concept of the counterfactual.
Effect heterogeneity and average effects.
13. Cause and effect in regression analysis
Unobserved heterogeneity. Selection, confounders, reverse causality (twoway causality).
Direct approaches to measure cause and effect: randomized experiments;
natural experiments in observational data.
Indirect approaches to measure cause and effect in observational data:
control variables; assessing direction of causality from timing.
14. Design and analysis of randomized experiments
Designing experiments (incl. A/B testing); randomization in practice; power
calculations. Multi-arm interventions.
Compliance, treatment contamination. Intent-to-treat effects and LATE.
Other issues with experiments. External validity, replication.
15. Hypothesis testing
Recap from chapter 4: repeated samples, confidence interval, standard
error.
Null and alternative hypotheses. False negatives, false positives, types of error.
Size and power. p-value.
Testing multiple hypotheses and p-hacking.
16. Identification from changes
8

Difference-in-differences. Synthetic controls.


17. Identification of policy effect with time series data
Trends, seasonality, unit roots (concept and testing). Impulse responses, VAR,
Cointegration
18. Regression with longitudinal data
Cluster standard errors. FE, regression in differences. Balanced and
unbalanced panels. Pooled time series.
19. Studying the effects of interventions in Big Data
Machine learning techniques to help assessing the heterogeneity of effects;
model selection in non-experimental setup
20. How to think about results of causal analysis
Constructive skepticism
Understand data and methods. Robustness checks, replication. Metaanalysis. Publication bias.

III. PREDICTION: REGRESSION AND MACHINE LEARNING


21. A Framework for prediction
Categorical outcomes, continuous outcomes. Predicted values. Prediction error.
Measures of fit. Loss function.
Overfitting. Cross-validation. Bias-variance trade-off. Training, validation, test
samples.
22. Regression-based prediction of categorical outcomes
Linear probability, logit, probit. Linear discriminant analysis. Evaluating predictions
of categorical outcomes (calibration, brier score).
Visualization of model performance, ROC graph / AUC
23. Tree-based prediction of categorical outcomes
Classification and Regression Tree (CART). Pruning, boosting, ensemble methods.
Random forest in action. Discussion of other methods.
24. Predicting continuous outcomes
Regression vs random forest. Interpretability vs precision: comparison of methods.
25. Forecasting: prediction in time series
Additional issues with forecasting. ARIMA, VAR vs. tree-based methods.
26. Unsupervised learning
Prediction without Y. Cluster analysis. K-nearest neighbors classifier
27. Prediction with Big Data
Parsing and structuring data. Velocity, generalization and flexibility. Where to
look for advice. Digression on infrastructure and access
Complexity and size, random sampling.
28. How to think about results of predictive analytics
Constructive skepticism
Understand data and methods. Wisdom of the crowd.
Stationarity, the role of causality.
9