You are on page 1of 35

Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Taming the Factor Zoo

Guanhao Feng1 Stefano Giglio2 Dacheng Xiu3

1
City University of Hong Kong

2
Yale University

3
University of Chicago

November 13, 2017


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Motivation

I Hundreds of risk factors or “anomalies” in the last 30 years, e.g.,


Harvey et al. (2015, RFS).

I How can we bring more discipline to this “zoo” of factors?

I How can we discriminate truly useful pricing factors from redundant


and useless factors?

I Do we really have 300 factors, or 300 times the same factor?

I Suppose we have a ”new” factor: is it truly new or redundant?

I We want to find new anomalies and new dimensions of risk


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Motivation

I We consider prominent 100 factors introduced in the last 30 years


I 14 factors on top finance journals in the last 5 years only

factornames avg tstat avg factornames avg tstat avg

Percent Accruals 11 1.18 HXZ Investment 38 4.15


Cash Holdings 19 1.34 HXZ Profitability 57 4.51
HML Devil 27 1.55 Employee Growth 7 0.63
Gross profitability 21 1.71 RMW 36 3.08
Organizational Capital 41 2.67 CMA 31 3.23
Betting Against Beta 92 5.42 Intermediary Investment 116 3.57
Quality Minus Junk 50 4.09 Convertible Debt 7 0.9
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Challenges

I How many are new? And new, Relative to what?


I What is the right benchmark? FF3 (Market-Size-Value)? Others?
FF3 (mkt/size/value) Avg. Ret.

λs tstat avg.ret. tstat


id Factor Description (bp) (bp)
85 Percent Accruals -18 -1.81* 11 1.18
86 Cash Holdings 26 1.24 19 1.34
87 HML Devil -108 -2.27** 27 1.55
88 Gross profitability 42 2.62*** 21 1.71*
89 Organizational Capital 46 2.77*** 41 2.67***
90 Betting Against Beta 19 1.02 92 5.42***
91 Quality Minus Junk 40 2.35** 50 4.09***
92 HXZ Investment -3 -0.27 38 4.15***
93 HXZ Profitability 34 2.22** 57 4.51***
94 Employee Growth -2 -0.13 7 0.63
95 RMW 22 1.62 36 3.08***
96 CMA 1 0.05 31 3.23***
98 Intermediary Investment 63 0.98 116 3.57***
99 Convertible Debt 6 0.61 7 0.90
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Challenges
I How about controlling all factors proposed by the literature?

I Inefficient or even not feasible!

I Will use machine learning / model selection to reduce the


dimensionality of the factor zoo

I Use model selection to select the benchmark

I Which of the many machine learning models?

I The most appropriate tool depends on your assumptions


I If you believe there is a small-dimensional factor model, LASSO is
the most appropriate (will review in a few slides)
I We will use LASSO as benchmark, but our results extend to other
methods (elastic nets, random forests, etc).
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

This paper

A new methodology that uses machine learning/model selection to


decide whether a new factor is useful

Three key ideas of our method

1. Use model selection

2. ... to evaluate new factors relative to all existing factors

3. ... taking into account potential model-selection mistakes


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

This paper

I Use model selection

I Rather than comparing the new factor with all 300 controls, use
model selection to pick the “best model” out of those

I This weeds out useless factors from the hundreds of existing ones

I ... to evaluate new factors relative to all existing factors

I ... taking into account potential model-selection mistakes


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

This paper

I Use model selection

I ... to evaluate new factors relative to all existing factors


I What machine learning (LASSO) is not useful for
I What are the identities of the ”true” factors?

I However, we can use LASSO as to control for the Sharpe ratio


achievable with existing factors
I Does my new factor contribute to increasing the Sharpe ratio?

I ... taking into account potential model-selection mistakes


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

This paper

I Use model selection

I ... to evaluate new factors relative to all existing factors

I ... taking into account potential model-selection mistakes

I We want to ask: does the new factor improve over the


LASSO-selected model?

I What if LASSO ”forgets” some factors?

I We develop econometric methods robust to the model selection


errors
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Our Empirical Findings

I We show that several factors – such as profitability and investments


– have statistically significant explanatory power beyond the
hundreds of factors proposed in the past.
I The selected models are unstable (hence model selection mistakes
are inevitable), but our inference is!
I If our method had been applied year by year starting in 1994, 14 out
of the 99 factors proposed afterwards would have been considered
useful, yet a large majority would have been identified as redundant
or useless.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Model Setup

I Consider observed factor vt = {gt , ht } and the true factor model,


where gt is d × 1, ht is p × 1:

mt = γ0−1 (1 − λ|v vt ). (1)

I What determines expected returns?

E(rt ) − ιn γ0 = Cv λv = Cg λg + Ch λh .

where Cv , n × (d + p), are (univariate) covariances between rt ,


n × 1, and vt , (d + p) × 1.

I λv are risk prices: nonzero if a factor is useful


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Model Setup

I Both useless and redundant factors have zero λs

I The difference is that for redundant ones, their return covariances


are cross-sectionally correlated with the true factors exposures.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

The Curse of Dimensionality

I If the set of control factors ht is small, easy

I Run a cross-sectional regression of average returns on sample


covariances Cbg and Cbh .
I If ht is high dimensional, such an estimator is inefficient or even
infeasible
I convergence rate is p 1/2 n−1/2
I infeasible if p > n
I Use variable selection to reduce the dimensionality of ht first
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

LASSO (least absolute shrinkage and selection operator)

I Tibshirani (1996) proposes LASSO in a high-dimensional regression


setting, where few coefficients are nonzero, the so called “sparsity”:

y = |{z}
X β +ε, kβk0 ≤ s.
|{z}
n×p p×1

The LASSO estimator is defined as the solution to this convex


optimization problem:
2
βbLASSO = arg minp n−1 ky − X βk2 + τ n−1 kβk1
β∈R
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Sparsity

Why is “Sparsity” is a reasonable assumption?

I The AP literature has been using it without knowing it

I It gives a parsimonious representation of the expected returns, hence


potentially leads to a better OOS performance.

I It is easier to interpret than alternative approaches, such as PCA.


Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

The wrong way to use LASSO

I Take all my data (1825 assets, 99 factors, 1980-2016 monthly)

I What is the best parsimonious model that LASSO selects?

I Factors Selected, first half of the sample: 1. Market; 36. Share


Turnover; 47. Change in Tax Expense; 64. Industry Sales
Concentration; 65. R&D to Market Capitalization; 68. Zero Trading
Days; 69. Abnormal Earnings Announcement Volume

I Factors Selected, second half of the sample: 16. Sales to


Inventory; 17. Sales to Receivables; 30. Carhart Momentum; 48.
Illiquidity; 62. Change in 6-Month Momentum; 73. Change in
Shares Outstanding
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

The wrong way to use LASSO


I Now, let’s do this more systematically
I Randomly draw a bootstrap subsample, check if each factor is
selected

0.8

0.7 Sales to Inventory


Market

0.6

Carhart MOM
0.5 Growth in Capital Expenditure
Selection rate

0.4 Change in 6m MOM

0.3

0.2 Value

0.1 Size

0
0 10 20 30 40 50 60 70 80
Factor ID
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

The wrong way to use LASSO

I LASSO very unstable, cannot be used to choose whether a factor is


important

I However, LASSO can still be used to select controls

I I don’t care about the identities of the control


I Suppose A and B are the same factor. Lasso sometimes picks A,
sometimes B
I I want to evaluate if C improves over the existing factors (A and B)

I For this, doesn’t matter if I pick A or B as control

I Lesson: Machine learning can be used for some purposes but not
others
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Naive Method

I Single-Selection Approach

1. Do cross-sectional LASSO of average returns on sample covariances


Cbh with a penalty on λh . The selected variables are collected in I1 .

2. Do cross-sectional regression of average returns on Cbg and Cbh [I1 ].

I Inference on the coefficients will be wrong


I In any finite sample, the selected model is likely to be wrong
I Omitted variable bias because of missing true factors
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Our Inference Methodology

I Double-Selection Approach

1. Do cross-sectional LASSO of average returns on Cbh . The factors


whose corresponding Cbh have non-zero coefficients are collected in I1 .

2. Do another cross-sectional LASSO of Cbg onto Cbh . The selected


factors are collected in I2 .
S
3. Estimate λg by regressing average returns onto Cbg and Cbh [I1 I2 ].

I We show this estimator is consistent and follows a CLT


I We develop heteroscedasticity robust standard errors
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Simulation Design

I Our DGP is Fama-French five-factor model.


I gt contains three factors: one useful (CMA), one redundant, and
one useless.
I ht is a large set of factors that includes four useful factors, and the
rest being useless and redundant factors.
I p = 300, n = 500, T = 600.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Simulation

1st Selection
Selection rate 1

0.5

0
50 100 150 200 250 300
Factor ID (simulated)
2nd Selection
1
Selection rate

0.5

0
50 100 150 200 250 300
Factor ID (simulated)
Total Selection
1
Selection rate

0.5

0
50 100 150 200 250 300
Factor ID (simulated)
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Simulation

Double-Selection: Useful Single-Selection: Useful

0.4 0.4

0.2 0.2

0 0
-5 0 5 -5 0 5

Double-Selection: Redundant Single-Selection: Redundant

0.4 0.4

0.2 0.2

0 0
-5 0 5 -5 0 5

Double-Selection: Useless Single-Selection: Useless

0.4 0.4

0.2 0.2

0 0
-5 0 5 -5 0 5
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Zoo of Factors

I We have collected (from K. French’s web, Authors’ webs, AQR, etc)


and constructed (using characteristics from Green et al. 2016) in
total 99 monthly factors covering from Jul. 1980 to Dec. 2016.
I This covers all main anomaly categories: momentum,
value-versus-growth, investment, profitability, intangibles, and
trading frictions.
I To capture the potential nonlinearity of the SDF, we add as controls
197 factors that include 99 squared terms for these primary risk
factors, and 98 interaction terms between Small Minus Big and each
other factor
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Test Portfolios

I Portfolios v.s. Individual Stocks


I Pros: stable beta, high signal-to-noise ratio, no missing data
I Cons: smaller sample size in the cross-section, selection bias
I We use a total of 1,825 portfolios as test assets, including a
standard set of 175 portfolios from Kenneth French’s website, and
66 multiple sets of bivariate-sorted 5 × 5 portfolios using size
(market equity) and each other characteristic.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Are New Factors Useful?

(1) (2) (3) (4) (5)


DS SS FF3 No Selection Avg. Ret.

λs tstat λs tstat λs tstat λs tstat avg.ret. tstat


id Factor Description (bp) (DS) (bp) (DS) (bp) (bp) (bp)
84 Maximum Return -50 -0.40 -134 -2.93*** 15 0.43 -291 -1.03 3 0.16
85 Percent Accruals -6 -0.25 17 1.36 -18 -1.81* -34 -0.58 11 1.18
86 Cash Holdings -102 -1.48 52 1.85* 26 1.24 -143 -0.87 19 1.34
87 HML Devil 44 0.53 29 0.44 -108 -2.27** -20 -0.10 27 1.55
88 Gross profitability -22 -0.50 11 0.37 42 2.62*** 50 0.49 21 1.71*
89 Organizational Capital -44 -0.99 -46 -1.67* 46 2.77*** 20 0.23 41 2.67***
90 Betting Against Beta -17 -0.51 -41 -1.61 19 1.02 -22 -0.32 92 5.42***
91 Quality Minus Junk 118 2.84*** 21 0.72 40 2.35** 84 0.95 50 4.09***
92 HXZ Investment 41 1.93* -1 -0.05 -3 -0.27 25 0.58 38 4.15***
93 HXZ Profitability 136 4.00*** 15 0.68 34 2.22** 10 0.18 57 4.51***
94 Employee Growth 14 0.36 -20 -0.92 -2 -0.13 -27 -0.32 7 0.63
95 RMW 122 4.66*** -2 -0.09 22 1.62 105 1.46 36 3.08***
96 CMA 64 2.37** 6 0.31 1 0.05 21 0.37 31 3.23***
97 Intermediary Capital -35 -0.54 -67 -1.00 97 1.86* 92 0.58
98 Intermediary Investment 53 0.65 -71 -0.92 63 0.98 17 0.10 116 3.57***
99 Convertible Debt -10 -0.59 -12 -0.90 6 0.61 -60 -1.70* 7 0.90
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Stable Inference v.s. Unstable Model Selection

I It is important to remark that the list of factors selected as controls


by LASSO does not have (nor does it need to have) a direct
economic interpretation.
I We follow the statistical machine learning literature to choose tuning
parameters using cross-validation.
I Selected models are, however, sensitive to the choice of tuning
parameters.
I The inference about λg is robust.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Resampling: Factor Selection Rates

0.8

0.7

0.6

0.5
Selection rate

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250
Factor ID

None of the 248 factors, except for the Market (No. 1) and Sales to Inventory
(No. 16) are actually selected in more than 70% of the samples, and most of
the factors are selected in 5% to 30% of the subsamples, but not in the others.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Resampling: Factor t-statistics


Maximum Return Percent Accruals Cash Holdings HML Devil
0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4

Gross profitability Organizational Capital Betting Against Beta Quality Minus Junk
0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4

HXZ Investment HXZ Profitability Employee Growth RMW


0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4

CMA Intermediary Capital Intermediary Investment Convertible Debt


0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Evaluating Factors Recursively

I One of the motivations for using our methodology is that it can help
distinguish useful from useless and redundant factors as they are
introduced in the literature.
I Over time, this should help limit the proliferation of factors.
I We test factors introduced during the year whether they are useful
or redundant relative to factors existing up to then.
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

(1) (2) (3)


Year # Assets # Controls New factors (IDs)
1994 450 65 23 24
1995 500 71 25 26 27
1996 500 80 28 29
1997 550 86 30
1998 575 89 31 32 33 34 35 36
1999 725 107 37 38
2000 750 113 39 40 41 42
2001 800 125 43 44 45
2002 825 134 46 47 48
2003 875 143 49 50 51
2004 925 152 52 53 54 55 56
2005 1025 167 57 58 59 60 61
2006 1100 182 62 63 64 65 66 67
2007 1275 203 69 70 71
2008 1350 212 72 73 74 75
2009 1450 224 76 77 78 79
2010 1525 236 80 81 82 83
2011 1625 248 84 85
2012 1675 254 86
2013 1700 257 87 88 89
2014 1750 266 90 91 92 93 94
2015 1825 281 95 96
2016 1825 287 97 98 99
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Robustness Checks

(1) (2) (3) (4) (5)


Bivariate 5 × 5 Sequential 5 × 5 Pre-1994 Tradable Factors Elastic Net

λs tstat λs tstat λs tstat λs tstat λs tstat


id Factor Description (bp) (DS) (bp) (DS) (bp) (DS) (bp) (DS) (bp) (DS)
84 Maximum Return -50 -0.40 -108 -0.86 -167 -1.19 74 0.60 -103 -0.82
85 Percent Accruals -6 -0.25 -19 -0.81 3 0.12 -38 -1.47 -11 -0.42
86 Cash Holdings -102 -1.48 -143 -2.00** -27 -0.40 59 0.73 -99 -1.40
87 HML Devil 44 0.53 -22 -0.28 -35 -0.43 5 0.07 78 1.03
88 Gross profitability -22 -0.50 -34 -0.77 -41 -0.92 -55 -1.03 -6 -0.11
89 Organizational Capital -44 -0.99 -32 -0.75 -51 -1.20 -67 -1.52 -49 -1.15
90 Betting Against Beta -17 -0.51 -25 -0.73 -42 -1.31 16 0.52 -2 -0.06
91 Quality Minus Junk 118 2.84*** 107 2.60*** 193 4.51*** 31 0.81 91 2.28**
92 HXZ Investment 41 1.93* 48 2.28** 64 2.83*** 48 2.10** 42 1.85*
93 HXZ Profitability 136 4.00*** 121 3.95*** 150 4.20*** 80 2.43** 64 1.88*
94 Employee Growth 14 0.36 52 1.37 -11 -0.31 10 0.26 -2 -0.05
95 RMW 122 4.66*** 117 3.43*** 117 2.77*** 94 2.56*** 111 3.53***
96 CMA 64 2.37** 59 2.27** 74 2.92*** 56 2.02** 19 0.70
97 Intermediary Capital -35 -0.54 -29 -0.48 -16 -0.27 -31 -0.49 -36 -0.47
98 Intermediary Investment 53 0.65 74 0.90 27 0.32 -96 -1.17 -37 -0.40
99 Convertible Debt -10 -0.59 -8 -0.50 13 0.95 14 1.14 6 0.30
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Conclusion: Practical Lessons

1. Machine learning and model selection can make important mistakes

2. But these can be accounted for!

3. Strongest new factors: profitability and investments

4. Many others are redundant (including gross profitability, BAB)

5. Application to portfolio management: evaluate new anomalies


against existing set of possible investments

6. Application to evaluating fund managers: are you just repackaging


existing risk exposures (styles, factors)? Are you adding value after
properly measuring risk exposures?
I Example: a fund that gives exposure to political shocks – not
spanned by any existing factors
Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Link to the paper

I This paper: Taming the Factor Zoo


https://papers.ssrn.com/sol3/papers.cfm?abstract id=2934020
I Related paper on how to construct hedging portfolios robust to
missing factors
https://papers.ssrn.com/sol3/papers.cfm?abstract id=2865922

You might also like