You are on page 1of 20

Attention Intensity as Contrarian Factor Tilt

HangukQuant1, 2 *

February 5, 2023

1 2

Abstract

There have been many efforts at using information from message boards, internet traffic
and other alternative sources of crowding data for making stock market decisions. We present
moderate evidence on excess returns by processing data on investor Google search activity. We
examine style exposure and significance post-adjustment for Fama factors and introduce the
Fama French, CAPM and performance measure computation to the Russian Doll engine.

1
*1: hangukquant@gmail.com, hangukquant.substack.com
2
*2: DISCLAIMER: the contents of this work are not intended as investment, legal, tax or any other advice, and
is for informational purposes only. It is illegal to make unauthorized copies, forward to an unauthorized user or to
post this article electronically without express written consent by HangukQuant.

1
1 Introduction

We refer interested readers on using Google Trend data for stock trading to the existing literature on
the internet. Here we dive straight into the methodology, design choice and experimental analysis
of results. Updated code containing factor analysis and mathematical computations may be found
in the attendant post. We test to see an increase in investor attention towards stocks negatively
predict excess returns.

1.1 A Note of Precaution

There is absolutely no warranty or guarantee implied with this product. Use at your own risk. I
provide no guarantee that it will be functional, destructive or constructive in any sense of the word.
Use at your own risk. Trading is a risky operation.

1.2 Methodology

The interest over time (IOT) data on Google Trends is obtainable via their public API (limited
rates), or obtainable vis-a-vis API services implementing rotating proxy servers. Our data is ob-
tained from the SERP API. When obtaining search queries from the IOT database, the values are
scaled to integers between 0 − 100 within the search period. Daily data are provided for periods up
to 9 months. However, shorter periods have more clarity since the magnitude of rounding errors
are smaller. Additionally, in financial markets, search changes are likely important at local regions
as opposed to over a global range of periods lasting up to a few months. This is because investor
attention is shown to last for up to a few days and diminish quickly up to weeks. When processing
IOT data, it is important that we only work on rate of change data, since the scaling at day t is
affected by days t′ > t when querying for dates [0..t, ..t′ , ..T ]. This is often mishandled in existing
literature. Working with rate of change data still takes in some lookahead bias, but this is greatly
diminished. To account for such quirks in the data available, while maintaining fair resolution of
data, we obtain data from overlapping 3 months for 6 months at a time. Period under test is from
2004 ∼ 2020 for stocks present in the NDX today. We handle the survivorship bias in our experi-
mental analysis. Data is obtained via the following code, with search query made as ’{stockName}
stock’, and simple processing methods are employed.

2
1 import os
2 import pytz
3 import asyncio
4 import functools
5 import numpy as np
6 import pandas as pd
7 from pprint import pprint
8 from pprint import pformat
9 from serpapi import GoogleSearch
10 from dotenv import load_dotenv
11 load_dotenv ()
12

13 from datetime import datetime


14 from datetime import timedelta
15 from dateutil . relativedelta import relativedelta
16 from quantyhlib . general_utils import load_file , save_file
17

18 def get_query ( query , start_dt , end_dt ) :


19 try :
20 s = start_dt . strftime ( ’%Y -% m -% d ’)
21 e = end_dt . strftime ( ’%Y -% m -% d ’)
22 params = {
23 " q " : query ,
24 " date " : f " { s } { e } " ,
25 " tz " : " 0 " ,
26 " data_type " : " TIMESERIES " ,
27 " api_key " : os . getenv ( " SERP_KEY " ) ,
28 " engine " : " google_trends " ,
29 }
30 gsearch = GoogleSearch ( params )
31 results = gsearch . get_dict ()
32 iot_timeline = results [ " interest_over_time " ][ " timeline_data " ]
33 df = pd . DataFrame ( iot_timeline )
34 df = pd . concat ([ df . drop ( columns = " values " ) , df [ " values " ]. apply ( lambda x : pd .
Series ( x [0]) ) ] , axis =1)
35 df [ " datetime " ] = df [ " timestamp " ]. apply ( lambda ts : datetime . fromtimestamp (
int ( ts ) , tz = pytz . utc ) )

3
36 df = df . reset_index ( drop = True ) . set_index ( " datetime " )
37 except Exception as err :
38 return pd . DataFrame ()
39 return df
40

41 async def get_insts_iot_data ( insts ) :


42 start_dt = datetime (2004 ,1 ,1 , tzinfo = pytz . utc )
43 end_dt = datetime (2020 ,1 ,1 , tzinfo = pytz . utc )
44 periods =[]
45 tmp = start_dt
46 delta =6
47 while tmp < end_dt :
48 periods . append (( tmp , tmp + relativedelta ( months = delta ) ) )
49 tmp += relativedelta ( months = delta /2)
50

51 loop = asyncio . get_event_loop ()


52

53 for inst in insts :


54 try :
55 query = inst + " stock "
56 # query_dfs =[ get_query ( query , period [0] , period [1]) for period in
periods ]
57 tasks =[
58 loop . run_in_executor (
59 None ,
60 functools . partial ( get_query , query , period [0] , period [1])
61 ) for period in periods
62 ]
63 query_dfs = await asyncio . gather (* tasks )
64 print ( query_dfs )
65 save_file ( f " data /{ inst }. dump " , query_dfs )
66 except Exception as err :
67 print ( inst , " ERROR " )
68 input ( err )
69

70 if __name__ == " __main__ " :


71 from data_service . data_master import DataMaster

4
72 data_master = DataMaster ()
73 index_service = data_master . get_indices_service ()
74 equity_service = data_master . get_equity_service ()
75 comps = index_service . get_index_components ( " NDX " )
76 insts = list ( comps . Code )
77 asyncio . run ( get_insts_iot_data ( insts = insts ) )

2 Data Processing

Unfortunately, many of our entries are ‘0’. To compute for rate of change on zero entries (which
would result in zero-division errors), we add a small value (ten) to all of the search interest values.
We compute the rate of change, and to get a stable signal for trading purposes, we apply a double
rolling mean. The signals are constructed from the following code:

1 def build_signals ( inst , gtrend_dfs ) :


2 date_pool = set () . union (*[ list ( df . index ) for df in gtrend_dfs ])
3 align_df = pd . DataFrame ( index = sorted ( date_pool ) )
4 gtrend_sers =[]
5 for gtrend_df in gtrend_dfs :
6 try :
7 gtrend_df [[ " extracted_value " ]] = gtrend_df [[ " extracted_value " ]]. astype
( " float " )
8 gtrend_sers . append ( align_df . join ( gtrend_df . extracted_value ) )
9 except Exception as err :
10 pass
11 try :
12 df = pd . concat ( gtrend_sers , axis =1)
13 df +=10
14 df /= df . shift (1) # rate of change
15 df = df . apply ( np . mean , axis =1) # average over data points
16 df = df . rolling (10) . mean () . rolling (10) . mean () # smoothed rate of change
17 except Exception as err :
18 print ( inst , " no signal " )
19 df = pd . DataFrame ()
20 return df

5
A long portfolio is constructed using weights proportional to a stepwise rank function of the
negative interest values. As one may observe, our approach is not designed to optimize the signal
extraction. Our focus is on demonstrating an application of the dataset under a rough approximator
and subjecting it to sound quantitative analysis.

Using the Russian Doll framework, the backtest code can be run by the following:

1 import pytz
2 import asyncio
3 import pandas as pd
4 import numpy as np
5 import matplotlib . pyplot as plt
6

7 from datetime import datetime


8

9 from pprint import pprint


10 from pprint import pformat
11 from quantyhlib . alpha import Alpha
12 from quantyhlib . alpha import Amalgapha
13 from quantyhlib . general_utils import save_file , load_file
14 from data_service . data_master import DataMaster
15

16 ’’’
17 query for the last 7 days will have hourly search trends ( the so - called real time
data ) ,
18 daily data is only provided for query period shorter than 9 months and up to 36
hours before your search
19 >> in order to trade live you need to stitch historical and real time data
20 ’’’
21 class GTREND ( Alpha ) :
22

23 def __init__ (
24 self ,
25 trade_range = None ,
26 instruments =[] ,
27 execrates = None ,
28 commrates = None ,
29 longswps = None ,

6
30 shortswps = None ,
31 dfs ={} ,
32 p ositional_inertia =0
33 ):
34 super () . __init__ (
35 trade_range = trade_range ,
36 instruments = instruments ,
37 execrates = execrates ,
38 commrates = commrates ,
39 longswps = longswps ,
40 shortswps = shortswps ,
41 dfs = dfs ,
42 positional_inertia = positional_inertia
43 )
44

45 def param_generator ( self , shattered ) :


46 return super () . param_generator ( shattered = shattered )
47

48 async def c o m p u t e _ s i g n a l s _ u n a l i gn e d ( self , shattered = True , param_idx =0 , index =


None ) :
49 alphas = []
50 aligner = pd . DataFrame ( index = index )
51 for inst in self . instruments :
52 tmp = self . dfs [ inst + " _gtrend " ]. rename ( inst ) \
53 if not len ( self . dfs [ inst + " _gtrend " ]) == 0 \
54 else pd . Series ( index = index , name = inst )
55 alphas . append ( tmp )
56 alphadf = pd . concat ( alphas , axis =1)
57 alphadf . columns = self . instruments
58 self . pad_ffill_dfs [ " alphadf " ] = alphadf
59 return
60

61 async def c om p ut e_ s ig n al s_ a li g ne d ( self , shattered = True , param_idx =0 , index =


None ) :
62 return
63

64 ’’’ Expose strategy custom signal dfs , invriskdf , eligiblesdf ’’’

7
65 def i n s t a n t i a t e _ e l i g i b i l i t i e s _ a n d _ s t r a t _ v a r i a b l e s ( self , delta_lag =0) :
66 eligibles = []
67 self . alphadf = self . pad_ffill_dfs [ " alphadf " ]
68 for inst in self . instruments :
69 inst_eligible = (~ np . isnan ( self . alphadf [ inst ]) ) \
70 & self . activedf [ inst ] \
71 & ( self . voldf [ inst ] > 0) . astype ( " bool " ) \
72 & self . baseclosedf [ inst ] > 0
73 eligibles . append ( inst_eligible )
74 self . invriskdf = np . log (1 / self . voldf ) / np . log (1.3)
75 self . eligiblesdf = pd . concat ( eligibles , axis =1)
76 self . eligiblesdf . columns = self . instruments
77 self . eligiblesdf . astype ( " int8 " )
78 self . rankdf = ( -1 * self . alphadf ) . rank (
79 axis =1 , method = " average " , na_option = " keep " , ascending = True
80 )
81 return
82

83 def c ompute_forecasts ( self , portfolio_i , date , eligibles_row ) :


84 return self . rankdf . loc [ date ] , np . sum ( eligibles_row )
85

86 def p ost_risk_management ( self , nominal_tot , positions , weights , eligibles_i =


None , eligibles_row = None , * args , ** kwargs ) :
87 return nominal_tot , positions , weights
88

89 async def main () :


90 from os import listdir
91 from os . path import isfile , join
92 load_dump = True
93 if not load_dump :
94 datafiles = [ f for f in listdir ( " ./ data " ) if isfile ( join ( " ./ data " , f ) ) ]
95 insts = [ datafile . split ( " . " ) [0] for datafile in datafiles ]
96 gtrend_dfss =[ load_file ( f " ./ data /{ datafile } " ) for datafile in datafiles ]
97 signal_dfs =[ build_signals ( insts , gtrend_dfs ) for inst , gtrend_dfs in zip (
insts , gtrend_dfss ) ]
98 save_file ( " gtrendsignals . dump " , ( insts , signal_dfs ) )
99 else :

8
100 ( insts , signal_dfs ) = load_file ( " gtrendsignals . dump " )
101

102 period_start = datetime (2004 ,1 ,1 , tzinfo = pytz . utc )


103 period_end = datetime (2020 ,1 ,1 , tzinfo = pytz . utc )
104

105 if not load_dump :


106 data_master = DataMaster ()
107 equity_service = data_master . get_equity_service ()
108 print ( insts , len ( insts ) )
109 insts_dfs = await equity_service . asyn_batch_get_ohlcv (
110 tickers = insts ,
111 read_db = False ,
112 insert_db = False ,
113 granularity = " d " ,
114 engine = " eodhistoricaldata " ,
115 period_start = period_start ,
116 period_end = period_end
117 )
118 print ( insts_dfs )
119 save_file ( " ohlcv . dump " , ( insts , insts_dfs , period_start , period_end ) )
120 else :
121 ( insts , insts_dfs , period_start , period_end ) = load_file ( " ohlcv . dump " )
122

123 # exc : { ’ CEG ’, ’ GFS ’, ’ RIVN ’, ’ LCID ’, ’ ABNB ’} listed > 2020
124 print ( " exc : " , set ( insts ) . difference ( set ([ inst for inst , inst_df in zip ( insts ,
insts_dfs ) if len ( inst_df ) != 0]) ) )
125 insts =[ inst for inst , inst_df in zip ( insts , insts_dfs ) if len ( inst_df ) != 0]
126 signal_dfs =[ signal_df for signal_df , inst_df in zip ( signal_dfs , insts_dfs ) if
len ( inst_df ) != 0]
127 insts_dfs =[ df for df in insts_dfs if len ( df ) != 0]
128

129 dfs ={}


130 for inst , inst_df , signal_df in zip ( insts , insts_dfs , signal_dfs ) :
131 dfs [ inst + " % USD " ] = inst_df . reset_index ( drop = True ) . set_index ( " datetime " )
132 dfs [ inst + " % USD_gtrend " ] = signal_df
133

134 gtrendstrat = GTREND (

9
135 trade_range =( period_start , period_end ) ,
136 instruments =[ inst + " % USD " for inst in insts ] ,
137 execrates =[0.001]* len ( insts ) ,
138 dfs = dfs
139 )
140 df = await gtrendstrat . run_simulation ( verbose = True )

3 Results

We present strategy results, which are not to be taken without the attendant commentary. First,
log returns are presented. Results are shown assuming no friction. Trade results assuming 0.1%
market spread are also presented.

sortino: 1.632 sharpe: 1.207 mean ret: 0.252


median ret: 0.394 stdev ret: 0.209 var ret: 0.044
skew ret: -0.531 kurt exc: 1.835 cagr: 0.258
omega(0): 1.223 VaR95: -0.036 cVaR95: -0.046
gain to pain: 1.451 directionality: 1.0

Results in general indicate good risk-adjusted returns, significant left skew, leptokurtic returns
with high cagr with vol targeted to 0.20. The return density histogram is shown, with median in
dotted red and mean in solid red.

10
Left skew is observed by both the pearson skew computation and mean sitting to the left of
median value. Directionality indicates that our portfolio is constructed long only. As a result our
results at this stage has no specific implications on the value-add of the strategy. Our simulation
consists of a long basket of NDX components being picked (survivors) from the largest and most
successful tech stocks. It would be more useful to analyze its factor loading to the components,

11
which we will conduct later. First, we present other statistical exhibits of our trading simulations
made available to the Russian Doll engine.

In order to see that our strategy is not replicating a buy-and-hold portfolio of a small subset of
names, we can observe its sum of weights over time. We see that the positions are evenly distributed
among the NDX components.

We also present 1-y rolling drawdowns, 1-y rolling max drawdonws, 3-y rolling CAGR, 3-y
rolling Calmar ratios, and the Ulcer index included as part of our backtest engine:

12
As mentioned, since our portfolio is a long basket consisting of stocks that were successful
in the first place, we would like to see how our strategy fared against the broader market, and
what proportion of returns were attributed to beta effects. We operate under zero rate (risk-free)
assumptions. The Russian Doll now supports a one-factor model to the SP500 index.

13
Sure enough, it has significant factor loading to the S&P 500.

Dep. Variable: y R-squared: 0.614


Model: OLS Adj. R-squared: 0.614
Method: Least Squares F-statistic: 6379.
Date: Sat, 04 Feb 2023 Prob (F-statistic): 0.00
Time: 09:30:13 Log-Likelihood: 13604.
No. Observations: 4015 AIC: -2.720e+04
Df Residuals: 4013 BIC: -2.719e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0007 0.000 5.453 0.000 0.000 0.001
x 0.9065 0.011 79.868 0.000 0.884 0.929

We are interested in how much excess returns are introduced by our signal tilt. The intercept
is strongly significant with t-statistic of 5.453. Annualized alpha is 17.8 percent. However, this is a
non-representative factor model of our universe constituents. A more suitable null strategy would
be an equal-weighted portfolio of all constituents of our strategy under concern. A parity index
consisting of the strategy instruments are constructed to be the market, and the one-factor analysis
is repeated.

14
Dep. Variable: y R-squared: 0.791
Model: OLS Adj. R-squared: 0.791
Method: Least Squares F-statistic: 1.516e+04
Date: Sat, 04 Feb 2023 Prob (F-statistic): 0.00
Time: 09:30:38 Log-Likelihood: 14833.
No. Observations: 4015 AIC: -2.966e+04
Df Residuals: 4013 BIC: -2.965e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0002 9.52e-05 1.832 0.067 -1.22e-05 0.000
x 1.1383 0.009 123.116 0.000 1.120 1.156

We see that the excess returns are only moderately significant after adjusting for asset universe,
with p-value 0.07. The R-squared fit increases and the excess returns from factor tilt is only 4
percent per annum. There is still some evidence of Google trend data indicating the ability the
predict future returns. Although some may be satisfied to conclude the study at this point, a trader
should always be looking to express his edge in the cheapest method and/or diversification possible.
We shall question whether our signals represent classical style exposure to well known factors. Our
Russian Doll engine now supports multi-factor modelling to strategy components, for conducting

15
studies such as the Fama-French style analysis on daily resolution. The price-to-book ratio and
market cap are constructed based on last publicly available information basis - and the factors are
computed from returns in the highest third less the lowest third for each relevant factor.

Below are the statistical tests for small-minus-big factors

Dep. Variable: y R-squared: 0.091


Model: OLS Adj. R-squared: 0.091
Method: Least Squares F-statistic: 401.1
Date: Sat, 04 Feb 2023 Prob (F-statistic): 4.07e-85
Time: 09:30:38 Log-Likelihood: 11885.
No. Observations: 4015 AIC: -2.377e+04
Df Residuals: 4013 BIC: -2.375e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0009 0.000 4.732 0.000 0.001 0.001
x 0.0207 0.001 20.027 0.000 0.019 0.023

showing statistically significant positive factor loading for smb factor.

16
Dep. Variable: y R-squared: 0.080
Model: OLS Adj. R-squared: 0.080
Method: Least Squares F-statistic: 350.0
Date: Sat, 04 Feb 2023 Prob (F-statistic): 5.93e-75
Time: 09:30:39 Log-Likelihood: 11861.
No. Observations: 4015 AIC: -2.372e+04
Df Residuals: 4013 BIC: -2.371e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0005 0.000 2.369 0.018 8.21e-05 0.001
x -0.0187 0.001 -18.709 0.000 -0.021 -0.017

We see that there is statistically significant negative factor loading for vmg factor. This is no
surprise, given the outperformance of big growth companies in the past decade. Now, building a
multi-factor model

Dep. Variable: y R-squared: 0.795


Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 5187.
Date: Sat, 04 Feb 2023 Prob (F-statistic): 0.00
Time: 09:30:39 Log-Likelihood: 14876.
No. Observations: 4015 AIC: -2.974e+04
Df Residuals: 4011 BIC: -2.972e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 6.261e-05 9.5e-05 0.659 0.510 -0.000 0.000
mkt 1.1221 0.010 112.612 0.000 1.103 1.142
smb -0.0009 0.001 -1.634 0.102 -0.002 0.000
vmg -0.0045 0.000 -9.242 0.000 -0.005 -0.004

see that the intercept is no longer statistically significant when adjusted for market, growth and size

17
factors, despite the excess return being marginally positive. See also that the size factor diminishes
in significance, and flips beta when the market and growth regressors are adjusted for in the model.
There is some quantitative evidence that our excess return is due to stylistic exposure to smaller size
and larger growth factors within the NDX basket. It may be possible for an investor to adopt this
style exposures to her portfolio with a cheaper replicating style ETF than the strategy proposed.

A summary of the results from the factor models are presented, with spx representing the
factor model to S&P500, while mkt represents factor model to an equal weight basket within our
strategy:

spxalpha: 0.001 spxbeta: 0.906 spxjensen: 0.178


spxtreynor: 0.278 sharpe smb: 0.245 sharpe vmg: -2.228
sharpe mkt: 1.12 mktcor: 0.889 mktalpha: 0.0
mktbeta: 1.138 mktjensen: 0.044 mkttreynor: 0.222

It is notable that some figures presented might be fairly absurd without context. For instance,
the sharpe of the value minus growth factor was −2.228 - put otherwise - the growth minus value
approach would have achieved significant risk-adjusted returns. However, this is not surprising,
since the NDX components consist of precisely the basket of growth stocks which performed well in
the past. One needs to be mindful of the sampling method used in obtaining the case study. Much
of financial literature fail to address such concerns. I cast serious doubt on the validity of their
results and statistical significance.

Another feature introduced is that we want to observe the returns of our alpha model against
the parity basket along the range of return values. We can plot the kernel smoothed CDF of returns
generated from the alpha model (orange) and the parity (blue) model.

18
Unfortunately, stochastic dominance is not observed. However, for the range of returns past
zero, the probability of obtaining returns at least as great as that threshold is at least as large as
the parity model in the alpha model.

Last but not least, simulated against a no-inertia position switching portfolio with 0.01% market
spread, we can obtain cost sensitivity of our strategy against its inherent turnover. All Monte Carlo
permutation hypothesis tests for asset picking, asset timing and decision making returned p-values
of 0.20 ∼ 0.30, indicating moderate but weak evidence of value-add.

4 Conclusion and Future Works

The concerns of this paper are multifold - we wanted to demonstrate the potential use of Google
trend data to construct market signals. We wanted to show the use of sound quantitative analysis
and return attribution to market and style factors. It is possible that many strategies available
online are derivative of classical style exposures meshed together. In these cases, replicating the style
exposures in an ETF might offer cheaper alternatives than replicating the strategy. Diversification
can be decomposed into assets, geography, time, exposure and more. Stylistic diversification is
important. This machinery is now available for use under the Russian Doll engine for traders to
analyze their own proprietary ideas.

Author’s Note: hopefully, the content within and discussions from previous work such as in

19
paper link indicate the difficulty of finding true, stable alphas. I hope that the commentary from
these works train you to adopt skeptical view of the academic literature that are easily accessible
to us - one should adopt simultaneously an open and cynical motivation towards such works. To
find stable alphas, it is usually required to begin with hypothesis of markets and where the edge
might lie. Alternatively, one may embark on quantitative data analysis, but singular alphas found
from such numerical studies tend to be unstable. In such endeavours, ensemble models and large
samples are required for profitability. Often, small edges can be found both through personal
research and ubiquitously available literature. Although these are not tradable ‘as is’, often the
combination/accumulation of edges can useful as trading signals. Alternatively, they can also
introduce portfolio ‘tilts’ when layered on top of active risk management strategies. With respect
to this paper, the use of trend data was not used in relation to the directionality of prevailing market
trends. To improve upon the signals, future works may include adjustment of signal generation,
combination of using investor interest trends with momentum signals to adjust for information
propagation and so on.

References
-

20

You might also like