Hilpisch 2020 Artificial Intelligence in Finance 1 477 Pages

Scientific Method
The scientific method refers to a set of generally accepted principles that should guide
any scientific project. Wikipedia defines the scientific method as follows:
The scientific method is an empirical method of acquiring knowledge that has
charac‐ terized the development of science since at least the 17th century. It involves
careful observation, applying rigorous skepticism about what is observed, given that
cognitive assumptions can distort how one interprets the observation. It involves
formulating hypotheses, via induction, based on such observations; experimental and
measurement-based testing of deductions drawn from the hypotheses; and refinement
(or elimination) of the hypotheses based on the experimental findings. These are
principles of the scientific method, as distinguished from a definitive series of steps
applicable to all scientific enterprises.
Given this definition, normative finance, as discussed in Chapter 3, is in stark
contrast to the scientific method. Normative financial theories mostly rely on
assumptions and axioms in combination with deduction as the major analytical
method to arrive at their central results.
• Expected utility theory (EUT) assumes that agents have the same utility function
no matter what state of the world unfolds and that they maximize expected utility
under conditions of uncertainty.
• Mean-variance portfolio (MVP) theory describes how investors should invest
under conditions of uncertainty assuming that only the expected return and the
expected volatility of a portfolio over one period count.
• The capital asset pricing model (CAPM) assumes that only the nondiversifiable
market risk explains the expected return and the expected volatility of a stock
over one period.
• Arbitrage pricing theory (APT) assumes that a number of identifiable risk factors
explains the expected return and the expected volatility of a stock over time;
admittedly, compared to the other theories, the formulation of APT is rather
broad and allows for wide-ranging interpretations.
What characterizes the aforementioned normative financial theories is that they were
originally derived under certain assumptions and axioms using “pen and paper” only,
without any recourse to real-world data or observations. From a historical point of
view, many of these theories were rigorously tested against real-world data only long
after their publication dates. This can be explained primarily with better data availa‐
bility and increased computational capabilities over time. After all, data and compu‐
tation are the main ingredients for the application of statistical methods in practice.
The discipline at the intersection of mathematics, statistics, and finance that applies
such methods to financial market data is typically called financial econometrics, the
topic of the next section.
Financial Econometrics and Regression | 1

Financial Econometrics and Regression
Adapting the definition provided by Investopedia for econometrics, one can define
financial econometrics as follows:
[Financial] econometrics is the quantitative application of statistical and
mathematical models using [financial] data to develop financial theories or test
existing hypotheses in finance and to forecast future trends from historical data. It
subjects real-world [financial] data to statistical trials and then compares and
contrasts the results against the [financial] theory or theories being tested.
Alexander (2008b) provides a thorough and broad introduction to the field of finan‐
cial econometrics. The second chapter of the book covers single- and multifactor
models, such as the CAPM and APT. Alexander (2008b) is part of a series of four
books called Market Risk Analysis. The first in the series, Alexander (2008a), covers
theoretical background concepts, topics, and methods, such as MVP theory and the
CAPM themselves. The book by Campbell (2018) is another comprehensive resource
for financial theory and related econometric research.
One of the major tools in financial econometrics is regression, in both its univariate
and multivariate forms. Regression is also a central tool in statistical learning in gen‐
eral. What is the difference between traditional mathematics and statistical learning?
Although there is no general answer to this question (after all, statistics is a sub-field
of mathematics), a simple example should emphasize a major difference relevant to
the context of this book.
First is the standard mathematical way. Assume a mathematical function is given as
follows:
1
f:ℝ ℝ+, x 2+ x
2
Given multiple values of xi, i = 1, 2, ..., n, one can derive function values for f by
applying the above definition:
yi = f xi , i = 1, 2, ..., n
The following Python code illustrates this based on a simple numerical example:
In [1]: import numpy as np
In [2]: def f(x):

return 2 + 1 / 2 * x
In [3]: x = np.arange(-4, 5)
x
2 | Chapter 4: Data-Driven Finance

Out[3]: array([-4, -3, -2, -1, 0, 1, 2, 3, 4])
In [4]: y = f(x)
y
Out[4]: array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])
Second is the approach taken in statistical learning. Whereas in the preceding exam‐
ple, the function comes first and then the data is derived, this sequence is reversed in
statistical learning. Here, the data is generally given and a functional relationship is
to be found. In this context, x is often called the independent variable and y the
depen‐ dent variable. Consequently, consider the following data:
xi, yi , i = 1, 2, ..., n
The problem is to find, for example, parameters α, β such that:
f xi ≡ α + βxi = yi ≈ yi, i = 1, 2, ..., n
Another way of writing this is by including residual values i, i = 1, 2, ..., n:
α + βxi + i = yi, i = 1, 2, ..., n
In the context of ordinary least-squares (OLS) regression, α, β are chosen to

minimize the mean-squared error between the approximated values yi and the real
values yi. The minimization problem, then, is as follows:
n
1 2
min ∑ y −y
α, β n i i
i
In the case of simple OLS regression, as described previously, the optimal solutions
are known in closed form and are as follows:
Cov x, y
βVar(x)
α = y − βx
Here, Cov stands for the covariance, Var for the variance, and x, y for the mean
values of x, y.
Returning to the preceding numerical example, these insights can be used to derive
optimal parameters α, β and, in this particular case, to recover the original definition
of f x :

In [5]: x
Out[5]: array([-4, -3, -2, -1, 0, 1, 2, 3, 4])
In [6]: y
Out[6]: array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])
In [7]: beta = np.cov(x, y, ddof=0)[0, 1] / x.var()

beta
Out[7]: 0.49999999999999994
In [8]: alpha = y.mean() - beta * x.mean()

alpha
Out[8]: 2.0
In [9]: y_ = alpha + beta * x
In [10]: np.allclose(y_, y)
Out[10]: True
β as derived from the covariance matrix and the variance
α as derived from β and the mean values
Estimated values yi, i = 1, 2, ..., n, given α, β
Checks whether yi, yi values are numerically equal

The preceding example and those in Chapter 1 illustrate that the application of OLS
regression to a given data set is in general straightforward. There are more reasons
why OLS regression has become one of the central tools in econometrics and finan‐
cial econometrics. Among them are the following:
Centuries old
The least-squares approach, particularly in combination with regression, has
been used for more than 200 years.1
Simplicity
The mathematics behind OLS regression is easy to understand and easy to imple‐
ment in programming.
Scalability
There is basically no limit regarding the data size to which OLS regression can
be applied.
1 See, for example, Kopf (2015).

Flexibility
OLS regression can be applied to a wide range of problems and data sets.
Speed
OLS regression is fast to evaluate, even on larger data sets.
Availability
Efficient implementations in Python and many other programming languages
are readily available.
However, as easy and straightforward as the application of OLS regression might be
in general, the method rests on a number of assumptions—most of them related to
the residuals—that are not always satisfied in practice.
Linearity
The model is linear in its parameters, with regard to both the coefficients and the
residuals.
Independence
Independent variables are not perfectly (to a high degree) correlated with each
other (no multicollinearity).
Zero mean
The mean value of the residuals is (close to) zero.
No correlation
Residuals are not (strongly) correlated with the independent variables.
Homoscedasticity
The standard deviation of the residuals is (almost) constant.
No autocorrelation
The residuals are not (strongly) correlated with each other.
In practice, it is in general quite simple to test for the validity of the assumptions
given a specific data set.
Data Availability
Financial econometrics is driven by statistical methods, such as regression, and the
availability of financial data. From the 1950s to the 1990s, and even into the early
2000s, theoretical and empirical financial research was mainly driven by relatively
small data sets compared to today’s standards, and was mostly comprised of end-of-
day (EOD) data. Data availability is something that has changed dramatically over
the last decade or so, with more and more types of financial and other data avail‐ able
in ever increasing granularity, quantity, and velocity.

Programmatic APIs
With regard to data-driven finance, what is important is not only what data is avail‐
able but also how it can be accessed and processed. For quite a while now, finance
professionals have relied on data terminals from companies such as Refinitiv (see
Eikon Terminal) or Bloomberg (see Bloomberg Terminal), to mention just two of the
leading providers. Newspapers, magazines, financial reports, and the like have long
been replaced by such terminals as the primary source for financial information.
However, the sheer volume and variety of data provided by such terminals cannot be
consumed systematically by a single user or even large groups of finance professio‐
nals. Therefore, the major breakthrough in data-driven finance is to be seen in the
programmatic availability of data via application programming interfaces (APIs) that
allow the usage of computer code to select, retrieve, and process arbitrary data sets.
The remainder of this section is devoted to the illustration of such APIs by which
even academics and retail investors can retrieve a wealth of different data sets.
Before such examples are provided, Table 4-1 offers an overview of categories of
data that are in general relevant in a financial context, as well as typical examples. In
the table, structured data refers to numerical data types that often come in tabular
structures, while unstructured data refers to data in the form of standard text that
often has no structure beyond headers or paragraphs, for example. Alternative data
refers to data types that are typically not considered financial data.
Table 4-1. Relevant types of financial data

Historical Prices, fundamentals News, texts Web, social media, satellites
Time
Streaming Structured
Prices, volumesdata Unstructured
News, filings data Web,
Alternative data satellites, Internet of Things
social media,
Structured Historical Data

First, structured historical data types will be retrieved programmatically. To this end,
the following Python code uses the Eikon Data API.2
To access data via the Eikon Data API, a local application, such as Refinitiv Work‐
space, must be running and the API access must be configured on the Python level:
In [11]: import eikon as ek
import configparser
In [12]: c = configparser.ConfigParser()
c.read('../aiif.cfg')
ek.set_app_key(c['eikon']['app_id'])
2 This data service is only available via a paid subscription.
Data Availability | 105

2020-08-04 10:30:18,059 P[14938] [MainThread 4521459136] Error on handshake
port 9000 : ReadTimeout(ReadTimeout())
If these requirements are met, historical structured data can be retrieved via a single
function call. For example, the following Python code retrieves EOD data for a set of
symbols and a specified time interval:
In [14]: symbols = ['AAPL.O', 'MSFT.O', 'NFLX.O', 'AMZN.O']
In [15]: data = ek.get_timeseries(symbols,

fields='CLOSE',
start_date='2019-07-01',
end_date='2020-07-01')
In [16]: data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 254 entries, 2019-07-01 to 2020-07-01
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AAPL.O 254 non-null float64
1 MSFT.O 254 non-null float64
2 NFLX.O 254 non-null float64
3 AMZN.O 254 non-null float64
dtypes: float64(4)
memory usage: 9.9 KB
In [17]: data.tail()
Out[17]: CLOSE AAPL.O MSFT.O NFLX.O AMZN.O
Date
2020-06-25 364.84 200.34 465.91 2754.58
2020-06-26 353.63 196.33 443.40 2692.87
2020-06-29 361.78 198.44 447.24 2680.38
2020-06-30 364.80 203.51 455.04 2758.82
2020-07-01 364.11 204.70 485.64 2878.70
Defines a list of RICs (symbols) to retrieve data for3
Retrieves EOD Close prices for the list of RICs
Shows the meta information for the returned DataFrame object
Shows the final rows of the DataFrame object
3 RIC stands for Reuters Instrument Code.
106 | Chapter 4: Data-Driven

Hilpisch 2020 Artificial Intelligence in Finance 1 477 Pages

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hilpisch 2020 Artificial Intelligence in Finance 1 477 Pages

Uploaded by

Copyright:

Available Formats

Scientific Method

Financial Econometrics and Regression | 1

In [2]: def f(x):

2 | Chapter 4: Data-Driven Finance

The problem is to find, for example, parameters α, β such that:

f xi ≡ α + βxi = yi ≈ yi, i = 1, 2, ..., n

Another way of writing this is by including residual values i, i = 1, 2, ..., n:

α + βxi + i = yi, i = 1, 2, ..., n

In the context of ordinary least-squares (OLS) regression, α, β are chosen to

Financial Econometrics and Regression | 3

In [7]: beta = np.cov(x, y, ddof=0)[0, 1] / x.var()

In [8]: alpha = y.mean() - beta * x.mean()

In [9]: y_ = alpha + beta * x

β as derived from the covariance matrix and the variance

α as derived from β and the mean values

Estimated values yi, i = 1, 2, ..., n, given α, β

Checks whether yi, yi values are numerically equal

1 See, for example, Kopf (2015).

4 | Chapter 4: Data-Driven Finance

Financial Econometrics and Regression | 5

Table 4-1. Relevant types of financial data

Structured Historical Data

2 This data service is only available via a paid subscription.

Data Availability | 105

6 | Chapter 4: Data-Driven Finance

In [15]: data = ek.get_timeseries(symbols,

Defines a list of RICs (symbols) to retrieve data for3

Retrieves EOD Close prices for the list of RICs

Shows the meta information for the returned DataFrame object

Shows the final rows of the DataFrame object

3 RIC stands for Reuters Instrument Code.

106 | Chapter 4: Data-Driven

You might also like