Taming the Factor Zoo with LASSO

Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical
Taming the Factor Zoo
Guanhao Feng1 Stefano Giglio2 Dacheng Xiu3
1
City University of Hong Kong
2
Yale University
3
University of Chicago
November 13, 2017

Motivation
I Hundreds of risk factors or “anomalies” in the last 30 years, e.g.,

Harvey et al. (2015, RFS).
I How can we bring more discipline to this “zoo” of factors?
I How can we discriminate truly useful pricing factors from redundant

and useless factors?
I Do we really have 300 factors, or 300 times the same factor?
I Suppose we have a ”new” factor: is it truly new or redundant?
I We want to find new anomalies and new dimensions of risk

Motivation
I We consider prominent 100 factors introduced in the last 30 years

I 14 factors on top finance journals in the last 5 years only
factornames avg tstat avg factornames avg tstat avg
Percent Accruals 11 1.18 HXZ Investment 38 4.15

Cash Holdings 19 1.34 HXZ Profitability 57 4.51
HML Devil 27 1.55 Employee Growth 7 0.63
Gross profitability 21 1.71 RMW 36 3.08
Organizational Capital 41 2.67 CMA 31 3.23
Betting Against Beta 92 5.42 Intermediary Investment 116 3.57
Quality Minus Junk 50 4.09 Convertible Debt 7 0.9
Challenges
I How many are new? And new, Relative to what?

I What is the right benchmark? FF3 (Market-Size-Value)? Others?
FF3 (mkt/size/value) Avg. Ret.
λs tstat avg.ret. tstat

id Factor Description (bp) (bp)
85 Percent Accruals -18 -1.81* 11 1.18
86 Cash Holdings 26 1.24 19 1.34
87 HML Devil -108 -2.27** 27 1.55
88 Gross profitability 42 2.62*** 21 1.71*
89 Organizational Capital 46 2.77*** 41 2.67***
90 Betting Against Beta 19 1.02 92 5.42***
91 Quality Minus Junk 40 2.35** 50 4.09***
92 HXZ Investment -3 -0.27 38 4.15***
93 HXZ Profitability 34 2.22** 57 4.51***
94 Employee Growth -2 -0.13 7 0.63
95 RMW 22 1.62 36 3.08***
96 CMA 1 0.05 31 3.23***
98 Intermediary Investment 63 0.98 116 3.57***
99 Convertible Debt 6 0.61 7 0.90
Challenges
I How about controlling all factors proposed by the literature?
I Inefficient or even not feasible!
I Will use machine learning / model selection to reduce the

dimensionality of the factor zoo
I Use model selection to select the benchmark
I Which of the many machine learning models?
I The most appropriate tool depends on your assumptions

I If you believe there is a small-dimensional factor model, LASSO is
the most appropriate (will review in a few slides)
I We will use LASSO as benchmark, but our results extend to other
methods (elastic nets, random forests, etc).
This paper
A new methodology that uses machine learning/model selection to

decide whether a new factor is useful
Three key ideas of our method
1. Use model selection
2. ... to evaluate new factors relative to all existing factors
3. ... taking into account potential model-selection mistakes

This paper
I Use model selection
I Rather than comparing the new factor with all 300 controls, use
model selection to pick the “best model” out of those
I This weeds out useless factors from the hundreds of existing ones
I ... to evaluate new factors relative to all existing factors
I ... taking into account potential model-selection mistakes

This paper

I What machine learning (LASSO) is not useful for
I What are the identities of the ”true” factors?
I However, we can use LASSO as to control for the Sharpe ratio

achievable with existing factors
I Does my new factor contribute to increasing the Sharpe ratio?

This paper
I We want to ask: does the new factor improve over the

LASSO-selected model?
I What if LASSO ”forgets” some factors?
I We develop econometric methods robust to the model selection

errors
Our Empirical Findings
I We show that several factors – such as profitability and investments

– have statistically significant explanatory power beyond the
hundreds of factors proposed in the past.
I The selected models are unstable (hence model selection mistakes
are inevitable), but our inference is!
I If our method had been applied year by year starting in 1994, 14 out
of the 99 factors proposed afterwards would have been considered
useful, yet a large majority would have been identified as redundant
or useless.
Model Setup
I Consider observed factor vt = {gt , ht } and the true factor model,

where gt is d × 1, ht is p × 1:
mt = γ0−1 (1 − λ|v vt ). (1)
I What determines expected returns?
E(rt ) − ιn γ0 = Cv λv = Cg λg + Ch λh .
where Cv , n × (d + p), are (univariate) covariances between rt ,

n × 1, and vt , (d + p) × 1.
I λv are risk prices: nonzero if a factor is useful

Model Setup
I Both useless and redundant factors have zero λs
I The difference is that for redundant ones, their return covariances

are cross-sectionally correlated with the true factors exposures.
The Curse of Dimensionality
I If the set of control factors ht is small, easy
I Run a cross-sectional regression of average returns on sample

covariances Cbg and Cbh .
I If ht is high dimensional, such an estimator is inefficient or even
infeasible
I convergence rate is p 1/2 n−1/2
I infeasible if p > n
I Use variable selection to reduce the dimensionality of ht first
LASSO (least absolute shrinkage and selection operator)
I Tibshirani (1996) proposes LASSO in a high-dimensional regression

setting, where few coefficients are nonzero, the so called “sparsity”:
y = |{z}
X β +ε, kβk0 ≤ s.
|{z}
n×p p×1
The LASSO estimator is defined as the solution to this convex

optimization problem:
2
βbLASSO = arg minp n−1 ky − X βk2 + τ n−1 kβk1
β∈R
Sparsity
Why is “Sparsity” is a reasonable assumption?
I The AP literature has been using it without knowing it
I It gives a parsimonious representation of the expected returns, hence

potentially leads to a better OOS performance.
I It is easier to interpret than alternative approaches, such as PCA.

The wrong way to use LASSO
I Take all my data (1825 assets, 99 factors, 1980-2016 monthly)
I What is the best parsimonious model that LASSO selects?
I Factors Selected, first half of the sample: 1. Market; 36. Share

Turnover; 47. Change in Tax Expense; 64. Industry Sales
Concentration; 65. R&D to Market Capitalization; 68. Zero Trading
Days; 69. Abnormal Earnings Announcement Volume
I Factors Selected, second half of the sample: 16. Sales to

Inventory; 17. Sales to Receivables; 30. Carhart Momentum; 48.
Illiquidity; 62. Change in 6-Month Momentum; 73. Change in
Shares Outstanding

I Now, let’s do this more systematically
I Randomly draw a bootstrap subsample, check if each factor is
selected
0.8
0.7 Sales to Inventory

Market
0.6
Carhart MOM
0.5 Growth in Capital Expenditure
Selection rate
0.4 Change in 6m MOM
0.3
0.2 Value
0.1 Size
0
0 10 20 30 40 50 60 70 80
Factor ID
I LASSO very unstable, cannot be used to choose whether a factor is

important
I However, LASSO can still be used to select controls
I I don’t care about the identities of the control

I Suppose A and B are the same factor. Lasso sometimes picks A,
sometimes B
I I want to evaluate if C improves over the existing factors (A and B)
I For this, doesn’t matter if I pick A or B as control
I Lesson: Machine learning can be used for some purposes but not
others
Naive Method
I Single-Selection Approach
1. Do cross-sectional LASSO of average returns on sample covariances

Cbh with a penalty on λh . The selected variables are collected in I1 .
2. Do cross-sectional regression of average returns on Cbg and Cbh [I1 ].
I Inference on the coefficients will be wrong

I In any finite sample, the selected model is likely to be wrong
I Omitted variable bias because of missing true factors
Our Inference Methodology
I Double-Selection Approach
1. Do cross-sectional LASSO of average returns on Cbh . The factors

whose corresponding Cbh have non-zero coefficients are collected in I1 .
2. Do another cross-sectional LASSO of Cbg onto Cbh . The selected

factors are collected in I2 .
S
3. Estimate λg by regressing average returns onto Cbg and Cbh [I1 I2 ].
I We show this estimator is consistent and follows a CLT

I We develop heteroscedasticity robust standard errors
Simulation Design
I Our DGP is Fama-French five-factor model.

I gt contains three factors: one useful (CMA), one redundant, and
one useless.
I ht is a large set of factors that includes four useful factors, and the
rest being useless and redundant factors.
I p = 300, n = 500, T = 600.
Simulation
1st Selection
Selection rate 1
0.5
0
50 100 150 200 250 300
Factor ID (simulated)
2nd Selection
1
Selection rate
0.5
0
50 100 150 200 250 300
Total Selection
1
Selection rate
0.5
0
50 100 150 200 250 300
Simulation
Double-Selection: Useful Single-Selection: Useful
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
Double-Selection: Redundant Single-Selection: Redundant
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
Double-Selection: Useless Single-Selection: Useless
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
Zoo of Factors
I We have collected (from K. French’s web, Authors’ webs, AQR, etc)

and constructed (using characteristics from Green et al. 2016) in
total 99 monthly factors covering from Jul. 1980 to Dec. 2016.
I This covers all main anomaly categories: momentum,
value-versus-growth, investment, profitability, intangibles, and
trading frictions.
I To capture the potential nonlinearity of the SDF, we add as controls
197 factors that include 99 squared terms for these primary risk
factors, and 98 interaction terms between Small Minus Big and each
other factor
Test Portfolios
I Portfolios v.s. Individual Stocks

I Pros: stable beta, high signal-to-noise ratio, no missing data
I Cons: smaller sample size in the cross-section, selection bias
I We use a total of 1,825 portfolios as test assets, including a
standard set of 175 portfolios from Kenneth French’s website, and
66 multiple sets of bivariate-sorted 5 × 5 portfolios using size
(market equity) and each other characteristic.
Are New Factors Useful?
(1) (2) (3) (4) (5)

DS SS FF3 No Selection Avg. Ret.
λs tstat λs tstat λs tstat λs tstat avg.ret. tstat

id Factor Description (bp) (DS) (bp) (DS) (bp) (bp) (bp)
84 Maximum Return -50 -0.40 -134 -2.93*** 15 0.43 -291 -1.03 3 0.16
85 Percent Accruals -6 -0.25 17 1.36 -18 -1.81* -34 -0.58 11 1.18
86 Cash Holdings -102 -1.48 52 1.85* 26 1.24 -143 -0.87 19 1.34
87 HML Devil 44 0.53 29 0.44 -108 -2.27** -20 -0.10 27 1.55
88 Gross profitability -22 -0.50 11 0.37 42 2.62*** 50 0.49 21 1.71*
89 Organizational Capital -44 -0.99 -46 -1.67* 46 2.77*** 20 0.23 41 2.67***
90 Betting Against Beta -17 -0.51 -41 -1.61 19 1.02 -22 -0.32 92 5.42***
91 Quality Minus Junk 118 2.84*** 21 0.72 40 2.35** 84 0.95 50 4.09***
92 HXZ Investment 41 1.93* -1 -0.05 -3 -0.27 25 0.58 38 4.15***
93 HXZ Profitability 136 4.00*** 15 0.68 34 2.22** 10 0.18 57 4.51***
94 Employee Growth 14 0.36 -20 -0.92 -2 -0.13 -27 -0.32 7 0.63
95 RMW 122 4.66*** -2 -0.09 22 1.62 105 1.46 36 3.08***
96 CMA 64 2.37** 6 0.31 1 0.05 21 0.37 31 3.23***
97 Intermediary Capital -35 -0.54 -67 -1.00 97 1.86* 92 0.58
98 Intermediary Investment 53 0.65 -71 -0.92 63 0.98 17 0.10 116 3.57***
99 Convertible Debt -10 -0.59 -12 -0.90 6 0.61 -60 -1.70* 7 0.90
Stable Inference v.s. Unstable Model Selection
I It is important to remark that the list of factors selected as controls

by LASSO does not have (nor does it need to have) a direct
economic interpretation.
I We follow the statistical machine learning literature to choose tuning
parameters using cross-validation.
I Selected models are, however, sensitive to the choice of tuning
parameters.
I The inference about λg is robust.
Resampling: Factor Selection Rates
0.8
0.7
0.6
0.5
Selection rate
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250
Factor ID
None of the 248 factors, except for the Market (No. 1) and Sales to Inventory
(No. 16) are actually selected in more than 70% of the samples, and most of
the factors are selected in 5% to 30% of the subsamples, but not in the others.
Resampling: Factor t-statistics

Maximum Return Percent Accruals Cash Holdings HML Devil
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
Gross profitability Organizational Capital Betting Against Beta Quality Minus Junk
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
HXZ Investment HXZ Profitability Employee Growth RMW

0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
CMA Intermediary Capital Intermediary Investment Convertible Debt

0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0 0 0 0
-2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4
Evaluating Factors Recursively
I One of the motivations for using our methodology is that it can help
distinguish useful from useless and redundant factors as they are
introduced in the literature.
I Over time, this should help limit the proliferation of factors.
I We test factors introduced during the year whether they are useful
or redundant relative to factors existing up to then.
(1) (2) (3)

Year # Assets # Controls New factors (IDs)
1994 450 65 23 24
1995 500 71 25 26 27
1996 500 80 28 29
1997 550 86 30
1998 575 89 31 32 33 34 35 36
1999 725 107 37 38
2000 750 113 39 40 41 42
2001 800 125 43 44 45
2002 825 134 46 47 48
2003 875 143 49 50 51
2004 925 152 52 53 54 55 56
2005 1025 167 57 58 59 60 61
2006 1100 182 62 63 64 65 66 67
2007 1275 203 69 70 71
2008 1350 212 72 73 74 75
2009 1450 224 76 77 78 79
2010 1525 236 80 81 82 83
2011 1625 248 84 85
2012 1675 254 86
2013 1700 257 87 88 89
2014 1750 266 90 91 92 93 94
2015 1825 281 95 96
2016 1825 287 97 98 99
Robustness Checks
(1) (2) (3) (4) (5)

Bivariate 5 × 5 Sequential 5 × 5 Pre-1994 Tradable Factors Elastic Net
λs tstat λs tstat λs tstat λs tstat λs tstat

id Factor Description (bp) (DS) (bp) (DS) (bp) (DS) (bp) (DS) (bp) (DS)
84 Maximum Return -50 -0.40 -108 -0.86 -167 -1.19 74 0.60 -103 -0.82
85 Percent Accruals -6 -0.25 -19 -0.81 3 0.12 -38 -1.47 -11 -0.42
86 Cash Holdings -102 -1.48 -143 -2.00** -27 -0.40 59 0.73 -99 -1.40
87 HML Devil 44 0.53 -22 -0.28 -35 -0.43 5 0.07 78 1.03
88 Gross profitability -22 -0.50 -34 -0.77 -41 -0.92 -55 -1.03 -6 -0.11
89 Organizational Capital -44 -0.99 -32 -0.75 -51 -1.20 -67 -1.52 -49 -1.15
90 Betting Against Beta -17 -0.51 -25 -0.73 -42 -1.31 16 0.52 -2 -0.06
91 Quality Minus Junk 118 2.84*** 107 2.60*** 193 4.51*** 31 0.81 91 2.28**
92 HXZ Investment 41 1.93* 48 2.28** 64 2.83*** 48 2.10** 42 1.85*
93 HXZ Profitability 136 4.00*** 121 3.95*** 150 4.20*** 80 2.43** 64 1.88*
94 Employee Growth 14 0.36 52 1.37 -11 -0.31 10 0.26 -2 -0.05
95 RMW 122 4.66*** 117 3.43*** 117 2.77*** 94 2.56*** 111 3.53***
96 CMA 64 2.37** 59 2.27** 74 2.92*** 56 2.02** 19 0.70
97 Intermediary Capital -35 -0.54 -29 -0.48 -16 -0.27 -31 -0.49 -36 -0.47
98 Intermediary Investment 53 0.65 74 0.90 27 0.32 -96 -1.17 -37 -0.40
99 Convertible Debt -10 -0.59 -8 -0.50 13 0.95 14 1.14 6 0.30
Conclusion: Practical Lessons
1. Machine learning and model selection can make important mistakes
2. But these can be accounted for!
3. Strongest new factors: profitability and investments
4. Many others are redundant (including gross profitability, BAB)
5. Application to portfolio management: evaluate new anomalies

against existing set of possible investments
6. Application to evaluating fund managers: are you just repackaging

existing risk exposures (styles, factors)? Are you adding value after
properly measuring risk exposures?
I Example: a fund that gives exposure to political shocks – not
spanned by any existing factors
Link to the paper
I This paper: Taming the Factor Zoo

https://papers.ssrn.com/sol3/papers.cfm?abstract id=2934020
I Related paper on how to construct hedging portfolios robust to
missing factors
https://papers.ssrn.com/sol3/papers.cfm?abstract id=2865922

Taming the Factor Zoo with LASSO

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Taming the Factor Zoo with LASSO

Uploaded by

Copyright:

Available Formats

Introduction Setup Using LASSO to select controls: Our Methodology Simulation Empirical

Taming the Factor Zoo

Guanhao Feng1 Stefano Giglio2 Dacheng Xiu3

November 13, 2017

I Hundreds of risk factors or “anomalies” in the last 30 years, e.g.,

I How can we bring more discipline to this “zoo” of factors?

I How can we discriminate truly useful pricing factors from redundant

I Do we really have 300 factors, or 300 times the same factor?

I Suppose we have a ”new” factor: is it truly new or redundant?

I We want to find new anomalies and new dimensions of risk

I We consider prominent 100 factors introduced in the last 30 years

factornames avg tstat avg factornames avg tstat avg

Percent Accruals 11 1.18 HXZ Investment 38 4.15

I How many are new? And new, Relative to what?

λs tstat avg.ret. tstat

I Inefficient or even not feasible!

I Will use machine learning / model selection to reduce the

I Use model selection to select the benchmark

I Which of the many machine learning models?

I The most appropriate tool depends on your assumptions

A new methodology that uses machine learning/model selection to

Three key ideas of our method

1. Use model selection

2. ... to evaluate new factors relative to all existing factors

3. ... taking into account potential model-selection mistakes

I Use model selection

I ... to evaluate new factors relative to all existing factors

I ... taking into account potential model-selection mistakes

I Use model selection

I ... to evaluate new factors relative to all existing factors

I However, we can use LASSO as to control for the Sharpe ratio

I ... taking into account potential model-selection mistakes

I Use model selection

I ... to evaluate new factors relative to all existing factors

I ... taking into account potential model-selection mistakes

I We want to ask: does the new factor improve over the

I What if LASSO ”forgets” some factors?

I We develop econometric methods robust to the model selection

Our Empirical Findings

I We show that several factors – such as profitability and investments

I Consider observed factor vt = {gt , ht } and the true factor model,

mt = γ0−1 (1 − λ|v vt ). (1)

I What determines expected returns?

where Cv , n × (d + p), are (univariate) covariances between rt ,

I λv are risk prices: nonzero if a factor is useful

I Both useless and redundant factors have zero λs

I The difference is that for redundant ones, their return covariances

The Curse of Dimensionality

I If the set of control factors ht is small, easy

I Run a cross-sectional regression of average returns on sample

LASSO (least absolute shrinkage and selection operator)

I Tibshirani (1996) proposes LASSO in a high-dimensional regression

The LASSO estimator is defined as the solution to this convex

Why is “Sparsity” is a reasonable assumption?

I The AP literature has been using it without knowing it

I It gives a parsimonious representation of the expected returns, hence

I It is easier to interpret than alternative approaches, such as PCA.

The wrong way to use LASSO

I Take all my data (1825 assets, 99 factors, 1980-2016 monthly)

I What is the best parsimonious model that LASSO selects?

I Factors Selected, first half of the sample: 1. Market; 36. Share

I Factors Selected, second half of the sample: 16. Sales to