Professional Documents
Culture Documents
HangukQuant1, 2 *
February 5, 2023
1 2
Abstract
There have been many efforts at using information from message boards, internet traffic
and other alternative sources of crowding data for making stock market decisions. We present
moderate evidence on excess returns by processing data on investor Google search activity. We
examine style exposure and significance post-adjustment for Fama factors and introduce the
Fama French, CAPM and performance measure computation to the Russian Doll engine.
1
*1: hangukquant@gmail.com, hangukquant.substack.com
2
*2: DISCLAIMER: the contents of this work are not intended as investment, legal, tax or any other advice, and
is for informational purposes only. It is illegal to make unauthorized copies, forward to an unauthorized user or to
post this article electronically without express written consent by HangukQuant.
1
1 Introduction
We refer interested readers on using Google Trend data for stock trading to the existing literature on
the internet. Here we dive straight into the methodology, design choice and experimental analysis
of results. Updated code containing factor analysis and mathematical computations may be found
in the attendant post. We test to see an increase in investor attention towards stocks negatively
predict excess returns.
There is absolutely no warranty or guarantee implied with this product. Use at your own risk. I
provide no guarantee that it will be functional, destructive or constructive in any sense of the word.
Use at your own risk. Trading is a risky operation.
1.2 Methodology
The interest over time (IOT) data on Google Trends is obtainable via their public API (limited
rates), or obtainable vis-a-vis API services implementing rotating proxy servers. Our data is ob-
tained from the SERP API. When obtaining search queries from the IOT database, the values are
scaled to integers between 0 − 100 within the search period. Daily data are provided for periods up
to 9 months. However, shorter periods have more clarity since the magnitude of rounding errors
are smaller. Additionally, in financial markets, search changes are likely important at local regions
as opposed to over a global range of periods lasting up to a few months. This is because investor
attention is shown to last for up to a few days and diminish quickly up to weeks. When processing
IOT data, it is important that we only work on rate of change data, since the scaling at day t is
affected by days t′ > t when querying for dates [0..t, ..t′ , ..T ]. This is often mishandled in existing
literature. Working with rate of change data still takes in some lookahead bias, but this is greatly
diminished. To account for such quirks in the data available, while maintaining fair resolution of
data, we obtain data from overlapping 3 months for 6 months at a time. Period under test is from
2004 ∼ 2020 for stocks present in the NDX today. We handle the survivorship bias in our experi-
mental analysis. Data is obtained via the following code, with search query made as ’{stockName}
stock’, and simple processing methods are employed.
2
1 import os
2 import pytz
3 import asyncio
4 import functools
5 import numpy as np
6 import pandas as pd
7 from pprint import pprint
8 from pprint import pformat
9 from serpapi import GoogleSearch
10 from dotenv import load_dotenv
11 load_dotenv ()
12
3
36 df = df . reset_index ( drop = True ) . set_index ( " datetime " )
37 except Exception as err :
38 return pd . DataFrame ()
39 return df
40
4
72 data_master = DataMaster ()
73 index_service = data_master . get_indices_service ()
74 equity_service = data_master . get_equity_service ()
75 comps = index_service . get_index_components ( " NDX " )
76 insts = list ( comps . Code )
77 asyncio . run ( get_insts_iot_data ( insts = insts ) )
2 Data Processing
Unfortunately, many of our entries are ‘0’. To compute for rate of change on zero entries (which
would result in zero-division errors), we add a small value (ten) to all of the search interest values.
We compute the rate of change, and to get a stable signal for trading purposes, we apply a double
rolling mean. The signals are constructed from the following code:
5
A long portfolio is constructed using weights proportional to a stepwise rank function of the
negative interest values. As one may observe, our approach is not designed to optimize the signal
extraction. Our focus is on demonstrating an application of the dataset under a rough approximator
and subjecting it to sound quantitative analysis.
Using the Russian Doll framework, the backtest code can be run by the following:
1 import pytz
2 import asyncio
3 import pandas as pd
4 import numpy as np
5 import matplotlib . pyplot as plt
6
16 ’’’
17 query for the last 7 days will have hourly search trends ( the so - called real time
data ) ,
18 daily data is only provided for query period shorter than 9 months and up to 36
hours before your search
19 >> in order to trade live you need to stitch historical and real time data
20 ’’’
21 class GTREND ( Alpha ) :
22
23 def __init__ (
24 self ,
25 trade_range = None ,
26 instruments =[] ,
27 execrates = None ,
28 commrates = None ,
29 longswps = None ,
6
30 shortswps = None ,
31 dfs ={} ,
32 p ositional_inertia =0
33 ):
34 super () . __init__ (
35 trade_range = trade_range ,
36 instruments = instruments ,
37 execrates = execrates ,
38 commrates = commrates ,
39 longswps = longswps ,
40 shortswps = shortswps ,
41 dfs = dfs ,
42 positional_inertia = positional_inertia
43 )
44
7
65 def i n s t a n t i a t e _ e l i g i b i l i t i e s _ a n d _ s t r a t _ v a r i a b l e s ( self , delta_lag =0) :
66 eligibles = []
67 self . alphadf = self . pad_ffill_dfs [ " alphadf " ]
68 for inst in self . instruments :
69 inst_eligible = (~ np . isnan ( self . alphadf [ inst ]) ) \
70 & self . activedf [ inst ] \
71 & ( self . voldf [ inst ] > 0) . astype ( " bool " ) \
72 & self . baseclosedf [ inst ] > 0
73 eligibles . append ( inst_eligible )
74 self . invriskdf = np . log (1 / self . voldf ) / np . log (1.3)
75 self . eligiblesdf = pd . concat ( eligibles , axis =1)
76 self . eligiblesdf . columns = self . instruments
77 self . eligiblesdf . astype ( " int8 " )
78 self . rankdf = ( -1 * self . alphadf ) . rank (
79 axis =1 , method = " average " , na_option = " keep " , ascending = True
80 )
81 return
82
8
100 ( insts , signal_dfs ) = load_file ( " gtrendsignals . dump " )
101
123 # exc : { ’ CEG ’, ’ GFS ’, ’ RIVN ’, ’ LCID ’, ’ ABNB ’} listed > 2020
124 print ( " exc : " , set ( insts ) . difference ( set ([ inst for inst , inst_df in zip ( insts ,
insts_dfs ) if len ( inst_df ) != 0]) ) )
125 insts =[ inst for inst , inst_df in zip ( insts , insts_dfs ) if len ( inst_df ) != 0]
126 signal_dfs =[ signal_df for signal_df , inst_df in zip ( signal_dfs , insts_dfs ) if
len ( inst_df ) != 0]
127 insts_dfs =[ df for df in insts_dfs if len ( df ) != 0]
128
9
135 trade_range =( period_start , period_end ) ,
136 instruments =[ inst + " % USD " for inst in insts ] ,
137 execrates =[0.001]* len ( insts ) ,
138 dfs = dfs
139 )
140 df = await gtrendstrat . run_simulation ( verbose = True )
3 Results
We present strategy results, which are not to be taken without the attendant commentary. First,
log returns are presented. Results are shown assuming no friction. Trade results assuming 0.1%
market spread are also presented.
Results in general indicate good risk-adjusted returns, significant left skew, leptokurtic returns
with high cagr with vol targeted to 0.20. The return density histogram is shown, with median in
dotted red and mean in solid red.
10
Left skew is observed by both the pearson skew computation and mean sitting to the left of
median value. Directionality indicates that our portfolio is constructed long only. As a result our
results at this stage has no specific implications on the value-add of the strategy. Our simulation
consists of a long basket of NDX components being picked (survivors) from the largest and most
successful tech stocks. It would be more useful to analyze its factor loading to the components,
11
which we will conduct later. First, we present other statistical exhibits of our trading simulations
made available to the Russian Doll engine.
In order to see that our strategy is not replicating a buy-and-hold portfolio of a small subset of
names, we can observe its sum of weights over time. We see that the positions are evenly distributed
among the NDX components.
We also present 1-y rolling drawdowns, 1-y rolling max drawdonws, 3-y rolling CAGR, 3-y
rolling Calmar ratios, and the Ulcer index included as part of our backtest engine:
12
As mentioned, since our portfolio is a long basket consisting of stocks that were successful
in the first place, we would like to see how our strategy fared against the broader market, and
what proportion of returns were attributed to beta effects. We operate under zero rate (risk-free)
assumptions. The Russian Doll now supports a one-factor model to the SP500 index.
13
Sure enough, it has significant factor loading to the S&P 500.
We are interested in how much excess returns are introduced by our signal tilt. The intercept
is strongly significant with t-statistic of 5.453. Annualized alpha is 17.8 percent. However, this is a
non-representative factor model of our universe constituents. A more suitable null strategy would
be an equal-weighted portfolio of all constituents of our strategy under concern. A parity index
consisting of the strategy instruments are constructed to be the market, and the one-factor analysis
is repeated.
14
Dep. Variable: y R-squared: 0.791
Model: OLS Adj. R-squared: 0.791
Method: Least Squares F-statistic: 1.516e+04
Date: Sat, 04 Feb 2023 Prob (F-statistic): 0.00
Time: 09:30:38 Log-Likelihood: 14833.
No. Observations: 4015 AIC: -2.966e+04
Df Residuals: 4013 BIC: -2.965e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0002 9.52e-05 1.832 0.067 -1.22e-05 0.000
x 1.1383 0.009 123.116 0.000 1.120 1.156
We see that the excess returns are only moderately significant after adjusting for asset universe,
with p-value 0.07. The R-squared fit increases and the excess returns from factor tilt is only 4
percent per annum. There is still some evidence of Google trend data indicating the ability the
predict future returns. Although some may be satisfied to conclude the study at this point, a trader
should always be looking to express his edge in the cheapest method and/or diversification possible.
We shall question whether our signals represent classical style exposure to well known factors. Our
Russian Doll engine now supports multi-factor modelling to strategy components, for conducting
15
studies such as the Fama-French style analysis on daily resolution. The price-to-book ratio and
market cap are constructed based on last publicly available information basis - and the factors are
computed from returns in the highest third less the lowest third for each relevant factor.
16
Dep. Variable: y R-squared: 0.080
Model: OLS Adj. R-squared: 0.080
Method: Least Squares F-statistic: 350.0
Date: Sat, 04 Feb 2023 Prob (F-statistic): 5.93e-75
Time: 09:30:39 Log-Likelihood: 11861.
No. Observations: 4015 AIC: -2.372e+04
Df Residuals: 4013 BIC: -2.371e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P> |t| [0.025 0.975]
Intercept 0.0005 0.000 2.369 0.018 8.21e-05 0.001
x -0.0187 0.001 -18.709 0.000 -0.021 -0.017
We see that there is statistically significant negative factor loading for vmg factor. This is no
surprise, given the outperformance of big growth companies in the past decade. Now, building a
multi-factor model
see that the intercept is no longer statistically significant when adjusted for market, growth and size
17
factors, despite the excess return being marginally positive. See also that the size factor diminishes
in significance, and flips beta when the market and growth regressors are adjusted for in the model.
There is some quantitative evidence that our excess return is due to stylistic exposure to smaller size
and larger growth factors within the NDX basket. It may be possible for an investor to adopt this
style exposures to her portfolio with a cheaper replicating style ETF than the strategy proposed.
A summary of the results from the factor models are presented, with spx representing the
factor model to S&P500, while mkt represents factor model to an equal weight basket within our
strategy:
It is notable that some figures presented might be fairly absurd without context. For instance,
the sharpe of the value minus growth factor was −2.228 - put otherwise - the growth minus value
approach would have achieved significant risk-adjusted returns. However, this is not surprising,
since the NDX components consist of precisely the basket of growth stocks which performed well in
the past. One needs to be mindful of the sampling method used in obtaining the case study. Much
of financial literature fail to address such concerns. I cast serious doubt on the validity of their
results and statistical significance.
Another feature introduced is that we want to observe the returns of our alpha model against
the parity basket along the range of return values. We can plot the kernel smoothed CDF of returns
generated from the alpha model (orange) and the parity (blue) model.
18
Unfortunately, stochastic dominance is not observed. However, for the range of returns past
zero, the probability of obtaining returns at least as great as that threshold is at least as large as
the parity model in the alpha model.
Last but not least, simulated against a no-inertia position switching portfolio with 0.01% market
spread, we can obtain cost sensitivity of our strategy against its inherent turnover. All Monte Carlo
permutation hypothesis tests for asset picking, asset timing and decision making returned p-values
of 0.20 ∼ 0.30, indicating moderate but weak evidence of value-add.
The concerns of this paper are multifold - we wanted to demonstrate the potential use of Google
trend data to construct market signals. We wanted to show the use of sound quantitative analysis
and return attribution to market and style factors. It is possible that many strategies available
online are derivative of classical style exposures meshed together. In these cases, replicating the style
exposures in an ETF might offer cheaper alternatives than replicating the strategy. Diversification
can be decomposed into assets, geography, time, exposure and more. Stylistic diversification is
important. This machinery is now available for use under the Russian Doll engine for traders to
analyze their own proprietary ideas.
Author’s Note: hopefully, the content within and discussions from previous work such as in
19
paper link indicate the difficulty of finding true, stable alphas. I hope that the commentary from
these works train you to adopt skeptical view of the academic literature that are easily accessible
to us - one should adopt simultaneously an open and cynical motivation towards such works. To
find stable alphas, it is usually required to begin with hypothesis of markets and where the edge
might lie. Alternatively, one may embark on quantitative data analysis, but singular alphas found
from such numerical studies tend to be unstable. In such endeavours, ensemble models and large
samples are required for profitability. Often, small edges can be found both through personal
research and ubiquitously available literature. Although these are not tradable ‘as is’, often the
combination/accumulation of edges can useful as trading signals. Alternatively, they can also
introduce portfolio ‘tilts’ when layered on top of active risk management strategies. With respect
to this paper, the use of trend data was not used in relation to the directionality of prevailing market
trends. To improve upon the signals, future works may include adjustment of signal generation,
combination of using investor interest trends with momentum signals to adjust for information
propagation and so on.
References
-
20