Thesis

Εθνικό Μετσόβιο Πολυτεχνείο
Σχολή Πολιτικών Μηχανικών
Εφαρμογές των στοχαστικών μεθόδων και μηχανικής

μάθησης στα χρηματοοικονομικά
Διπλωματική Εργασία
Γράφων: Επιβλέπων:
Ιωάννης Κοκοτσάκης Β. Παπαδόπουλος, Καθηγητής Ε.Μ.Π.
Athens
August 31, 2021
ii
Εισαγωγή
Η ”χρονική αξία του χρήματος” και η αβεβαιότητα είναι κεντρικά στοιχειά που επηρεάζουν την
αξία των χρηματοοικονομικών μέσων. Όταν λαμβάνεται υπόψη μόνο η χρονική πτυχή της χρημα-
τοδότησης , τα εργαλεία του λογισμού και των διαφορικών εξισώσεων είναι επαρκείς. Όταν
λαμβάνεται υπόψη μόνο η αβεβαιότητα, το εργαλείο της Θεωρίας Πιθανοτήτων φωτίζει τα
πιθανά αποτελέσματα. Όταν ο χρόνος και η αβεβαιότητα είναι μαζί ξεκινάμε την μελέτη των
μαθηματικών χρηματοοικονομικών. Η χρηματοδότηση είναι η μελέτη της συμπεριφοράς των
οικονομικών παραγόντων στην κατανομή χρηματοοικονομικών πόρων και διαχείρισης κίνδυνων
σε εναλλακτικά χρηματοοικονομικά μέσα και σε χρόνο αβέβαιο. Γνωστά παραδείγματα
χρηματοοικονομικών μέσων είναι τραπεζικοί λογαριασμοί, δάνεια, μετοχές, κρατικά ομόλογα και
εταιρικά ομολόγα. Η μαθηματική χρηματοδότηση χαρακτηρίζατε συχνά ως μελέτη των ποιο
εξελιγμένων χρηματο-οικονομικών μέσων που ονομάζονται παράγωγα. Ένα παράγωγο είναι μια
οικονομική συμφωνία μεταξύ δύο μερών που εξαρτάται από κάτι που συμβαίνει στο μέλλον, για
παράδειγμα την τιμή ή την απόδοση ενός υποκείμενου περιουσιακού στοιχείου. Το υποκείμενο
περιουσιακό στοιχειό θα μπορούσε να είναι ένα απόθεμα, ένα ομόλογο ή ένα νόμισμα.
iii
iv
Summary
The ”time value of money” and uncertainty are the central elements that influence the value of
financial instruments. When only the time aspect of finance is considered, the tools of calculus
and differential equations are adequate. When only the uncertainty is considered, the tools of
probability theory illuminate the possible outcomes. When time and uncertainty are considered
together we begin the study of mathematical finance. Finance is the study of economic agents’
behavior in allocating financial resources and managing risks across alternative financial instru-
ments and in time in an uncertain environment. Known examples of financial instruments are
bank accounts, loans, stocks, government bonds and corporate bonds. Mathematical finance is
often characterized as the study of the more sophisticated financial instruments called deriva-
tives. A derivative is a financial agreement between two parties that depends on something that
occurs in the future, for example the price or performance of an underlying asset. The underlying
asset could for example be a stock, a bond or a currency. In the context of financial derivative
pricing, there is a stage in which the asset model needs to be calibrated to market data. In other
words, the open parameters in the asset price model need to be fitted. This is typically not done
by historical asset prices, but by means of option prices, by matching the market prices of heavily
traded options to the option prices from the mathematical model, under the so-called risk-neutral
probability measure. In the case of model calibration, thousands of option prices need to be de-
termined in order to fit these asset parameters.
v
vi
Ευχαριστίες
Θα ήθελα να ευχαριστήσω τον κ. Βησσαρίων Παπαδόπουλο που μου έδωσε την συγκεκριμένη
διπλωματική εργασία και είχα την ευκαιρία να ασχοληθώ με χρηματοοικονομικά μαθηματικά
και αριθμητικές εφαρμογές τους. Επίσης θα ήθελα να ευχαριστήσω κ. Οδυσσέα Κόκκινο για
την καθοδήγηση και τις συμβουλές του που ήταν κρίσιμες για την εκπόνηση της συγκεκριμένης
εργασίας.
vii
viii
Contents
1 Introductory definitions 1
1.1 Interest rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Futures and Forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Theoretical Relationships Between Spot, Forwards and Futures . . . . . . 2
1.2.2 Accounting for Dividends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 No Arbitrage Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Put-Call Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 American Option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Mathematical models for options pricing 9

2.1 Principles of Options Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Path-Dependent Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Pricing and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Black-Scholes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Parameters and calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Implied Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Stochastic Volatility Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Stochastic Volatility PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Machine learning methods 21

3.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Bootsprap aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Optimization algorithms 31
4.1 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ix
x CONTENTS
4.2 Covariance matrix adaptation evolution . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Applications with real options data 37

5.1 Dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Calibration of BSM and Heston model . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 BSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Heston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Results 49
A Python codes 51
References 61
List of Figures
1.1 The no arbitrage range for the market price of a financial future . . . . . . . . . . 5
3.1 Rosenblatt’s Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 multilayer neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Partitions and CART. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Train/test splits in a 5-fold CV scheme . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Concept of directional optimization in CMA-ES algorithm. . . . . . . . . . . . . . 33

4.2 Pseudocode of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Basic pseudo-code for Bayesian optimization . . . . . . . . . . . . . . . . . . . . . 35
5.1 Statistics for the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Statistics for the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Parameters, upper and lower bounds for the BSM model . . . . . . . . . . . . . . 38
5.4 Calibration results for the BSM model . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Results for the BSM model calibration with the GA algorithm . . . . . . . . . . . 39
5.6 Convergence for the BSM model calibration with the GA algorithm . . . . . . . . 40
5.7 Results for the BSM model calibration with the CMAES algorithm . . . . . . . . . 40
5.8 Convergence for the BSM model calibration with the CMAES algorithm . . . . . 41
5.9 Results for the BSM model calibration with the bayesian optimization algorithm 41
5.10 Parameters, upper and lower bounds for the heston model . . . . . . . . . . . . . 42
5.11 Calibration results for the Heston model . . . . . . . . . . . . . . . . . . . . . . . . 42
5.12 Results for the Heston model calibration with the GA algorithm . . . . . . . . . . 43
5.13 Convergence for the Heston model calibration with the GA algorithm . . . . . . 43
5.14 Results for the heston model calibration with the CMAES algorithm . . . . . . . . 44
5.15 Convergence for the heston model calibration with the CMAES algorithm . . . . 44
5.16 Results for the Heston model calibration with the bayesian optimization algorithm 45
5.17 Data set features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.18 Parameters, upper and lower bounds for the NN model . . . . . . . . . . . . . . . 47
5.19 Grid search results for the NN model . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.20 Parameters, upper and lower bounds for the RF model . . . . . . . . . . . . . . . . 48
5.21 Grid search results for the RF model . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.1 code yahoo finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.2 code data creation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xi
xii LIST OF FIGURES

A.5 code bsm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.6 code bsm GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.7 code mc heston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.8 code heston analytical formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.9 code heston GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.10 code heston CMAES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.11 code heston bayes opt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.12 code data transoformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.13 code NN grid 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.14 code NN grid 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.15 code RF grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 1
Introductory definitions
1.1 Interest rates

The future value of an investment depends upon how the interest is calculated. Simple interest
is paid only on the principal amount invested, but when an investment pays compound interest
it pays interest on both the principal and previous interest payments. There are two methods
of calculating interest, based on discrete compounding and continuous compounding. Discrete
compounding means that interest payments are periodically accrued to the account, such as every
6 months or every month. Continuous compounding is a theoretical construct that assumes
interest payments are continuously accrued, although this is impossible in practice. Both simple
and compound interest calculations are possible with discrete compounding but with continuous
compounding only compound interest rates apply. The discretely compounded analogue depends
on whether simple or compound interest is used. Again let N denote the principal (i.e. the
amount invested) and let T denote the number of years of the investment. But now denote by
RT the discretely compounded T -maturity interest rate, again quoted in annual terms. Under
simple compounding of interest the future value of a principal N invested now over a period of
T years is
V = N (1 + RT T ).
However, under compound interest,
V = N (1 + RT T ).
Similarly, with simple interest the present value of an amount V paid at some future time T is
N = V (1 + RT T )−1 .
but with compound interest it is
N = V (1 + RT )−T .
The frequency of payments can be annual, semi-annual, quarterly or even monthly. If the
annual rate quoted is denoted by R and the interest payments are made n times each year, then
R m
Annual compounding factor = (1 + ) .
n
1
2 CHAPTER 1. INTRODUCTORY DEFINITIONS
For instance, (1 + 12 R)2 is the annual compounding factor when interest payments are semi-
annual and R is the 1 year interest rate. In general, if a principal amount N is invested at a
discretely compounded annual interest rate R, which has n compounding periods per year, then
its value after m compounding periods is
R m
V = N (1 + ) .
n
It is worth noting that:
R m
lim (1 + ) = exp(R)
n
And this is why the continuously compounded interest rate takes an exponential form.
Παράδειγμα 1.1.1. Find the value of $500 in 3.5 years’ time if it earns interest of 4% per annum
and interest is compounded semi-annually. How does this compare with the continuously com-
pounded value.
0.04 7
V = 500 × (1 + ) = $574.14
2
But under continuous compounding the value will be grater:
V = 500 × exp(0.04 × 3.5) = $574.14.
1.2 Futures and Forwards

1.2.1 Theoretical Relationships Between Spot, Forwards and Futures
The theoretical relationships between spot and forward prices may be extended to spot and fu-
tures prices. The general theory using continuous compounding and forward prices rather than
discrete compounding and/or futures prices, because the analysis is easier to follow.
1.2.2 Accounting for Dividends

If the underlying asset is a dividend paying stock or a coupon paying bond then strategy (a)
brings benefits that we have ignored in the above analysis. In particular, we need to take into
account any dividend payments, between now and the expiry of the future; or if the asset is a
bond we should include in the benefits to strategy (a) any coupon payments between now and
the maturity of the future. The following analysis assumes the underlying asset is a stock with
known dividend payments, but applies equally well when the asset is a fixed coupon bond. The
annual dividend yield on a stock is the present value of the annual dividends per share, divided
by the current share price. Note that if dividends are reinvested then the present value should
include the reinvestment income. The continuously compounded dividend yield y(t, T ) over the
period from time t until time T is defined as:
C(t, T )
y(t, T ) = ,
S(t)(T − t)
1.2. FUTURES AND FORWARDS 3
Where C(t, T ) is the present value at time t of the dividends accruing between time t and time
T , including any reinvestment income. Consider a US stock with quarterly dividend payments.
We hold the stock for 6 months, i.e. T = 0.5. Let the current share price be $100 and suppose that
we receive dividends of $2 per share in 1 month’s time and again in 4 months’ time. Assuming
that the dividends are not reinvested, that the 1-month zero coupon interest rate is 4.75% and the
4-month zero coupon rate is 5%, calculate:
i. The continuously compounded present value of the dividend payments; and
ii. The continuously compounded dividend yield over the 6-month period.
For the present value of dividend payments, we have:
C(0, 0.5) = 2 exp(0.0475/12) + 2 exp(0.05/3) = $3.959.
Therefore the continuously compounded dividend yield is:
Y (0, 0.5) = 3.959100 × 0.5 = 7.92%.
We now adjust formula for the fair value of the forward to take account of the dividends that
accrue to the stockholder between now and the expiry of the future. In the general case when
dividends or coupons are paid, the value of strategy (a) is no longer equal to:
S(T ) − S(t) exp((r(t, T ) − y(t, T ))(T − t)).
Hence the no arbitrage condition yields the fair value:
F ∗ (t, T ) = S(t) exp((r(t, T )y(t, T ))(T t)).
1.2.3 No Arbitrage Range

In a perfectly liquid market without transactions costs the market price of a forward would be
equal to its fair value. However, market prices of forwards deviate from their fair value because
there are transactions costs associated with buying both the underlying contract and the forward.
Write the market price of the T -maturity forward at time t < T as:
F (t, T ) = F ∗ (t, T ) + x(t, T )S(t).
Equivalently, the difference between the market price and the fair price of the forward, expressed
as a proportion of the spot price, is:
F (t, T ) − F ∗ (t, T )
x(t, T ) = .
S(t)
Many authors refer as the mispricing of the market price of the forward compared with its fair
value. But, as noted it is usually the futures contract that is most actively traded and which there-
fore serves the dominant price discovery role. Thus it is really the spot that is mispriced relative
to the futures contract, rather than the futures contract that is mispriced relative to the spot. In
practice it is only possible to make a profitable arbitrage between the spot and the futures contract
if the mispricing is large enough to cover transactions costs. In commodity markets there can be
considerable uncertainty about the net convenience yield. And in equity markets there can be
uncertainties about the size and timing of dividend payments and about the risk free interest rate
during the life of a futures contract. All these uncertainties can affect the return from holding the
spot asset. As a result there is not just one single price at which no arbitrage is possible. In fact
there is a whole range of futures prices around the fair price of the futures for which no arbitrage
is possible. We call this the no arbitrage range.
Παράδειγμα 1.2.1. We know that on 16 December 2005 the futures had a fair value of 5551.44
based on the spot price of 5531.63. But the 3-month FTSE 100 futures actually closed at 5546.50.
How do you account for this difference?
Solution: The percentage mispricing was:
5546.50 − 5551.44
= −8.9bps
5531.63
But the usual no arbitrage range for the FTSE index is approximately ±25 bps because the trans-
actions costs are very small on in such a liquid market. So the closing market price of the futures
falls well inside the usual no arbitrage range. However, calculations of this type can lead to a
larger mispricing, especially in less efficient markets than the FTSE 100.
Having said this it should be noted that:
• the prices should be contemporaneous. The spot market often closes before the futures,
in which case closing prices are not contemporaneous. It is possible that the futures price
changes considerably after the close of the spot market.
• the fair value of a futures contract is based on the assumption of a zero margin, so that
the fair value of the futures is set equal the fair value of the forward. But in practice the
exchange requires margin payments, and in this case a negative correlation between the
returns on the spot index and the zero coupon curve (up to 3 months) would have the
effect of decreasing the fair value of the futures.
Any apparent ”mispricing” of the futures relative to the spot index should always be viewed with
the above two comments in mind.
1.2. FUTURES AND FORWARDS 5
Figure 1.1: The no arbitrage range for the market price of a financial future
Figure illustrates the no arbitrage range for a 1-year forward on a non-dividendpaying asset
assuming that two-way arbitrage is possible, i.e. that the spot can be sold well as bought. The
subscripts A and B refer to ‘ask’ and ‘bid’ prices or rates. 8 Two no arbitrage strategies are depicted:
1. Receive funds at rB by selling the spot at SB and buying the forward at F B. By the no
arbitrage condition , F B = SB exp(rB).
2. Borrow funds at rA to buy the spot at SA and sell the forward at F A. By the no arbitrage
condition, F A = SA exp(rA).
The no arbitrage range for the forward price is marked on the diagram with the grey doubleheaded
arrow. If the market bid price F M B or the market ask price F M A of the forward lies outside
the no arbitrage range F BF A then we can make a profit as follows:
• if F M A > F A then sell the forward and buy the spot, or
• if F M B < F B then sell the spot and buy the forward.

1.3 Options
An option is a contract that gives the purchaser the right to enter into another contract during a
particular period of time. We call it a derivative because it is a contract on another contract. The
underlying contract is a security such as a stock or a bond, a financial asset such as a currency, or
a commodity, or an interest rate. The option contract that gives the right to buy the underlying
is termed a call option, and the right to sell the underlying is termed a put option. Interest rate
options give the purchaser the right to pay or receive an interest rate, and in the case of swaptions
the underlying contract is a swap.
1.3.1 Foundations
The market price of a liquid traded option at any time prior to expiry is determined by the supply
and demand for the option at that time, just like any other liquid traded asset. But when options
are illiquid the price that is quoted in the market must be derived from a model. The most fun-
damental model for an option price is based on the assumption that the underlying asset price
is log-normally distributed and thus follows a stochastic differential equation called a geometric
Brownian motion. Equivalently, the log price and also the log return are normally distributed and
follow an arithmetic Brownian motion. This section opens with a review of the characteristics of
Brownian motion processes. Then we describe the concepts of hedging and no arbitrage that
lead to the risk neutral valuation methodology for pricing options. We introduce the risk neutral
measure and explain why the asset price is not a martingale under this measure. However, we can
change the numeraire to the money market account, and then the price of a non-dividend paying
asset becomes a martingale. The change of numeraire technique is useful in many circumstances,
for instance when interest rates are stochastic or when pricing an exchange option. We do not
usually price options under the assumption that the price process is a geometric Brownian with
constant volatility. Instead, we use a stochastic volatility in the option pricing model, and some-
times we assume there are other stochastic parameters, and/or a mean reversion in the drift, or a
jump in the price and/or the volatility. Hence the option pricing model typically contains several
parameters that we calibrate by using the model to price liquid options (usually standard Euro-
pean calls and puts) and then changing the model parameters so that the model prices match the
observed market prices. Then, with these calibrated parameters, we can use the model to price
path-dependent and exotic options. However, if we are content to assume the price process is a
geometric Brownian with constant volatility, European exotics have prices that are easy to derive.
1.3.2 Put-Call Parity

The prices of European calls and puts are related by a simple rule that is derived using the no
arbitrage principle. To derive this relationship we use the usual trick of comparing two different
portfolios that have the same pay-off in all circumstances. The portfolios can contain a share and
European call and put options with the same strike K and maturity T on this share. First assume
the share pays no dividends and consider the following portfolios:
• Portfolio I: Buy one share and a European put on this share.

1.3. OPTIONS 7
• Portfolio II: Buy a European call on the share with strike K and maturity T and lend an
amount equal to the present value of the strike, discounted at the risk free interest rate r.
It is easy to see that both portfolios have the same pay-off when the options mature. If it turns
out that S(T ) > K then both strategies pay-off S(T ) because:
• The put is worth nothing so portfolio I pays S(T), and
• The call is worth S(T )K but you receive K from the pay-back on the loan.
If it turns out that S(T) < K then both strategies pay-off K because:
• The put is worth KS(T ) while the share has value S(T ), and
• The call is worth nothing but you still get K from the loan.
Thus both portfolios have the pay-off max(S(T ), K). This means that the two portfolios
must have the same value at any time t prior to the options expiry date T . Otherwise it would
be possible to make a risk free profit (i.e. an arbitrage) by shorting one portfolio and buying the
other. This no arbitrage argument implies that a standard European call price C(K, T ) and a
standard European put price P (K, T ) of the same strike and maturity on the same underlying
must satisfy the following relationship at any time t prior to the expiry date T :
P (S, t∣K, T ) + S(t) = C(S, t∣K, T ) + exp(−r(T − t))K.
Now suppose the share pays dividends with continuous dividend yield y. Then a similar no
arbitrage argument, but with portfolio I buying exp(y(T t)) shares and reinvesting the dividends
in the share, gives the relationship:
P (S, t∣K, T ) + exp(−y(T − t))S(t) = C(S, t∣K, T ) + exp(−r(T − t))K
Or, equivalently,
C(S, t∣K, T ) − P (S, t∣K, T ) = exp(−y(T − t))S(t) − exp(−r(T − t))K.
The put call relationship implies the following:
• Given the price of a European call it is simple to calculate the fair price of the corresponding
put, and vice versa.
• If market prices do not satisfy this relationship there is, theoretically, an opportunity for
arbitrage. However the trading costs may be too high for it to be possible to profit from the
arbitrage.
The put–call parity relationship can also be used to derive lower bounds for prices of European
options. Since the price of an option is never negative implies that:
C(S, t∣K, T ) ≥ exp(−y(T − t))S(t) − exp(−r(T − t))K
And similarly
P (S, t∣K, T ) ≥ exp(−y(T − t))K − exp(−r(T − t))S(t).
1.3.3 American Option

An American option has the same pay-off function as its European counterpart, but the pay-off
relates to any time before expiry because the option can be exercised early. This early exercise
feature is often included in new issues where market pricing may be unreliable. The ability to
exercise at any time and obtain the intrinsic value of the option gives investors added security,
and thus boosts the demand for a new product. The intrinsic value of an American call at time
t is S(t)K and that of an American put is KS(t), where K is the option strike and S(t) is the
underlying price at time t. Before expiry the price of an American option is never less than the
price of the corresponding European option, and as the expiry time approaches the two prices
converge. The option to exercise early suggests that these options may command a higher price
than their European counterpart. But, using the put–call parity relationship for standard Euro-
pean calls and puts, we can show that this is not always the case. Consider an American call
option with strike K and maturity T and denote its price by CA(S, t∣K, T ). Using the put call
parity relationship and the fact that CA(S, t∣K, T ) can never be less than the European price, we
have:
CA(S, t∣K, T ) ≥ exp(−y(T − t))S(t) − exp(−r(T − t))K.
But exp(−r(T − t) ≤ 1, hence
CA(S, t∣K, T ) ≥ exp(−y(T − t))S(t) − K.
If there are no dividends then the right-hand side is the intrinsic value of the option, so it
never pays to exercise a non-dividend paying American call early. It also never pays to exercise an
American call, or a put, on a forward contract provided that we do not have to pay the premium.
This is the case when the options are margined, like futures contracts, and the settlement value
for the option is determined by the difference between the initial and final option prices. When
the option is on the forward price F (t, T ) and the option is margined the put–call parity for
European options becomes:
CE(F, t∣K, T ) − P E(F, t∣K, T ) = F (t, T ) − K.
using the superscript E to denote the European option price. Hence:
CA(F, t∣K, T ) ≥ CE(F, t∣K, T )F (t, T ) − K.
and
P A(F, t∣K, T ) ≥ P E(F, t∣K, T ) ≥ K − F (t, T ).
so in both cases we never gain by early exercise.
Chapter 2
Mathematical models for options pricing
2.1 Principles of Options Pricing

In discussing general principles ofMonte Carlo, it is useful to have some simple specific examples
to which to refer. As a first illustration of a Monte Carlo method, we consider the calculation of
the expected present value of the payoff of a call option on a stock.
Let S(t) denote the price of the stock at time t. Consider a call option granting the holder
the right to buy the stock at a fixed price K at a fixed time T in the future; the current time is
t = 0. If at time T the stock price S(T ) exceeds the strike price K, the holder exercises the option
for a profit of S(T )K; if, on the other hand, S(T ) ≤ K, the option expires worthless. (This is a
European option, meaning that it can be exercised only at the fixed date T ; an American option
allows the holder to choose the time of exercise.) The payoff to the option holder at time T is thus
(S(T ) − K)+ = max{0, S(T ) − K}.
To get the present value of this payoff we multiply by a discount factor erT , with r a continuously
compounded interest rate. We denote the expected present value by E [e−rT (S(T ) − K)+ ].
For this expectation to be meaningful, we need to specify the distribution of the random
variable S(T ), the terminal stock price. In fact, rather than simply specifying the distribution at
a fixed time, we introduce a model for the dynamics of the stock price. The Black-Scholes model
describes the evolution of the stock price through the stochastic differential equation (SDE)
dS(t)
= rdt + σdW (t), (2.1)
S(t)
with W a standard Brownian motion. This equation may be interpreted as modeling the per-
centage changes dS/S in the stock price as the increments of a Brownian motion. The parameter
σ is the volatility of the stock price and the coefficient on dt in (2.1) is the mean rate of return.
In taking the rate of return to be the same as the interest rate r, we are implicitly describing the
risk-neutral dynamics of the stock price.
The solution of the stochastic differential equation (2.1) is
1
S(T ) = S(0) exp ([r − σ 2 ]T + σW (T )) . (2.2)
2
9
10 CHAPTER 2. MATHEMATICAL MODELS FOR OPTIONS PRICING
As S(0) is the current price of the stock, we may assume it is known. The random variable
√ W (T )
is normally distributed with mean 0 and variance T; this is also the distribution of T Z if Z is a
standard normal random variable (mean 0, variance 1). We may therefore represent the terminal
stock price as
1 √
S(T ) = S(0) exp ([r − σ 2 ]T + σ T Z) . (2.3)
2
The logarithm of the stock price is thus normally distributed, and the stock price itself has a
lognormal distribution.
The expectation E [e−rT (S(T ) − K)+ ] is an integral with respect to the lognormal density
of S(T ). This integral can be evaluated in terms of the standard normal cumulative distribution
function Φ as BS(S(0), σ, T, r, K) with
log(S/K) + (r + 12 σ 2 )T log(S/K) + (r − 21 σ 2 )T
BS(S(0), σ, T, r, K) = SΦ ( √ )−e−rT KΦ ( √ )
σ T σ T
(2.4)
This is the Black-Scholes formula for a call option.
In light of the availability of this formula, there is no need to use Monte Carlo to compute
E [e−rT (S(T ) − K)+ ]. Moreover, we noted earlier that Monte Carlo is not a competitive method
for computing one-dimensional integrals. Nevertheless, we now use this example to illustrate the
key steps in Monte Carlo. From ( 2.3) we see that to draw samples of the terminal stock price S(T )
it suffices to have a mechanism for drawing samples from the standard normal distribution. For
now we simply assume the ability to produce a sequence Z1 , Z2 , . . . of independent standard
normal random variables.
2.2 Path-Dependent Example

The payoff of a standard European call option is determined by the terminal stock price S(T )
and does not otherwise depend on the evolution of S(t) between times 0 and T . In estimating
E [e−rT (S(T ) − K)+ ], we were able to jump directly from time 0 to time T using ( 2.3) to sample
values of S(T ). Each simulated “path” of the underlying asset thus consists of just the two points
S(0) and S(T ).
In valuing more complicated derivative securities using more complicated models of the dy-
namics of the underlying assets, it is often necessary to simulate paths over multiple intermediate
dates and not just at the initial and terminal dates. Two considerations may make this necessary:
• the payoff of a derivative security may depend explicitly on the values of underlying assets
at multiple dates;
• we may not know how to sample transitions of the underlying assets exactly and thus need
to divide a time interval [0, T ] into smaller subintervals to obtain a more accurate approx-
imation to sampling from the distribution at time T .
In many cases, both considerations apply.

2.2. PATH-DEPENDENT EXAMPLE 11
Before turning to a detailed example of the first case, we briefly illustrate the second. Consider
a generalization of the basic model ( 2.1) in which the dynamics of the underlying asset S(t) are
given by
dS(t) = rS(t)dt + (S(t))S(t)dW (t). (2.5)
In other words, we now let the volatility σ depend on the current level of S. Except in very special
cases, this equation does not admit an explicit solution of the type in ( 2.2) and we do not have an
exact mechanism for sampling from the distribution of S(T ). In this setting, we might instead
partition [0, T ] into m subintervals of length ∆t = T /m and over each subinterval [t, t + ∆t]
simulate a transition using a discrete (Euler) approximation to ( 2.5) of the form
√
S(t + ∆t) = S(t) + rS(t)∆t + σ(S(t))S(t) ∆tZ,
with Z a standard normal random √ variable. This relies on the fact that W (t + ∆t)W (t) has
mean 0 and standard deviation ∆t. For each step, we would use an independent draw from
the normal distribution. Repeating this for m steps produces a value of S(T ) whose distribution
approximates the exact (unknown) distribution of S(T ) implied by ( 2.5).We expect that as m
becomes larger (so that ∆t becomes smaller) the approximating distribution of S(T ) draws closer
to the exact distribution.
Even if we assume the dynamics in (1.1) of the Black-Scholes model, it may be necessary to
simulate paths of the underlying asset if the payoff of a derivative security depends on the value
of the underlying asset at intermediate dates and not just the terminal value. Asian options are
arguably the simplest path-dependent options for which Monte Carlo is a competitive compu-
tational tool. These are options with payoffs that depend on the average level of the underlying
asset. This includes, for example, the payoff (S̄ − K)+ with
1 m
S̄ = ∑ S(tj ) (2.6)
m j=1
for some fixed set of dates 0 = t0 < t1 < ⋯ < tm = T , with T the date at which the payoff is
received.
To calculate the expected discounted payoff E [e−rT (S(T ) − K)+ ], we need to be able to gen-
erate samples of the average S̄. The simplest way to do this is to simulate the path S(t1 ), . . . , S(tm )
and then compute the average along the path. We saw in ( 2.3) how to simulate S(T ) given S(0);
simulating S(tj + 1) from S(tj ) works the same way:
1 √
S(tj+1 = S(tj ) exp ([r − σ 2 ](tj+1 − tj ) + σ tj+1 − tj Zj+1 ) (2.7)
2
where Z1 , . . . , Zm are independent standard normal random variables. Given a path of values, it
is a simple matter to calculate S̄ and then the discounted payoff e−rT (S(T ) − K)+ .
2.3 Pricing and Replication

To further develop these ideas, we consider an economy with d assets whose prices Si (t), i =
1, . . . , d, are described by a system of SDEs
dSi (t)
= µi (S(t), t) dt + σi (S(t), t)⊺ dW o (t), (2.8)
Si (t)
with Wo a k-dimensional Brownian motion, each σi taking values in Rk , and each µi scalar-
valued. We assume that the µi and σi are deterministic functions of the current state S(t) =
(S1 (t), . . . , Sd (t))⊺ and time t, though the general theory allows these coefficients to depend on
past prices as well. Let
Σij = σi⊺ σj , i, j = 1, . . . , d; (2.9)
this may be interpreted as the covariance between the instantaneous returns on assets i and j.
A portfolio is characterized by a vector θ ∈ Rd with θi representing the number of units held
of the ith asset. Since each unit of the ith asset is worth Si (t) at time t, the value of the portfolio
at time t is
θ1 S1 (t) + ⋯ + θd Sd (t),
which we may write as θ⊺ S(t). A trading strategy is characterized by a stochastic process θ(t)
of portfolio vectors. To be consistent with the intuitive notion of a trading strategy, we need to
restrict θ(t) to depend only on information available at t; this is made precise through a measur-
ability condition (for example, that θ be predictable).
If we fix the portfolio holdings at θ(t) over the interval [t, t+h], then the change in value over
this interval of the holdings in the ith asset is given by θi (t)[Si (t + h) − Si (t)]; the change in the
value of the portfolio is given by θ(t)⊺ [S(t+h)−S(t)]. This suggests that in the continuous-time
limit we may describe the gains from trading over [0, t] through the stochastic integral
t
∫ θ(u)⊺ dS(u),
0
subject to regularity conditions on S and θ. Notice that we allow trading of arbitrarily large or
small, positive or negative quantities of the underlying assets continuously in time; this is a con-
venient idealization that ignores constraints on real trading.
A trading strategy is self-financing if it satisfies
t
θ(t)⊺ S(t) − θ(0)⊺ S(0) = ∫ θ(u)⊺ dS(u) (2.10)
0
for all t. The left side of this equation is the change in portfolio value from time 0 to time t and
the right side gives the gains from trading over this interval. Thus, the self-financing condition
states that changes in portfolio value equal gains from trading: no gains are withdrawn from the
portfolio and no funds are added. By rewriting (2.10) as
t
θ(t)⊺ S(t) = θ(0)⊺ S(0) + ∫ θ(u)⊺ dS(u),
0
2.4. BLACK-SCHOLES MODEL 13
we can interpret it as stating that from an initial investment of V (0) = θ(0)⊺ S(0) we can achieve
a portfolio value of V (t) = θ(t)⊺ S(t) by following the strategy θ over [0, t].
Consider, now, a derivative security with a payoff of f (S(T )) at time T ; this could be a
standard European call or put on one of the d assets, for example, but the payoff could also depend
on several of the underlying assets. Suppose that the value of this derivative at time t, 0 ≤ t ≤
T , is given by some function V (S(t), t). The fact that the dynamics in (2.8) depend only on
(S(t), t) makes it at least plausible that the same might be true of the derivative price. If we
further conjecture that V is a sufficiently smooth function of its arguments, Itˆo’s formula gives
d t ∂V (S(u), u)
V (S(t), t) = V (S(0), 0) + ∑ ∫ dSi (u)+
i=1 0 ∂Si
⎡
t ⎢ ∂V (S(u), u)
⎤
1 d ∂ 2 V (S(u), u) ⎥
+∫ ⎢ ⎢ + ∑ i S (u)S j (u) ∑ (S(u), u) ⎥ du, (2.11)
⎥
0 ⎢ ∂u 2 i,j=1 ∂Si ∂Sj ⎥
⎣ ij ⎦
with Σ as in (2.9). If the value V (S(t), t) can be achieved from an initial wealth of V (S(0), 0)
through a self-financing trading strategy θ, then we also have
d t
V (S(t), t) = V (S(0), 0) + ∑ ∫ θi (u)dSi (u). (2.12)
i=1 0
Comparing terms in (2.11) and (2.12), we find that both equations hold if
∂V (S(u), u)
θi (u) = , i = 1, . . . , d, (2.13)
∂Si
and
∂V (S(u), u) 1 d ∂ 2 V (S(u), u)
+ ∑ Σij (S, u)Si Sj = 0. (2.14)
∂u 2 i,j=1 ∂Si ∂Sj
Since we also have V (S(t), t) = θ⊺ (t)S(t), (2.13) implies
d
∂V (S, t)
V (S, t) = ∑ Si . (2.15)
i=1 ∂Si
Finally, at t = T we must have
V (S, T ) = f (S) (2.16)
if V is indeed to represent the value of the derivative security.
2.4 Black-Scholes Model

As an illustration of the general formulation in (2.14) and (2.15), we consider the pricing of Eu-
ropean options in the Black-Scholes model. The model contains two assets. The first (often in-
terpreted as a stock price) is risky and its dynamics are represented through the scalar SDE
dS(t)
= µdt + σdW o (t)
S(t)
with W o a one-dimensional Brownian motion. The second asset (often called a savings account
or a money market account) is riskless and grows deterministically at a constant, continuously
compounded rate r; its dynamics are given by
dβ(t)
= rdt.
β(t)
Clearly, β(t) = β(0)ert and we may assume the normalization β(0) = 1. We are interested in
pricing a derivative security with a payoff of f (S(T )) at time T . For example, a standard call
option pays (S(T ) − K)+ , with K a constant.
If we were to formulate this model in the notation of (2.8), Σ would be a 2 × 2 matrix with
only one nonzero entry, σ 2 . Making the appropriate substitutions, (2.14) thus becomes
∂V 1 2 2 ∂ 2 V
+ σ Σ = 0. (2.17)
∂t 2 ∂S 2
Equation (2.15) becomes
∂V ∂V
V (S, β, t) = S+ β. (2.18)
∂S ∂β
These equations and the boundary condition V (S, β, T ) = f (S) determine the price V .
This formulation describes the price V as a function of the three variables S, β, and t. Because
β depends deterministically on t, we are interested in values of V only at points (S, β, T ) with
β = ert . This allows us to eliminate one variable and write the price as tildeV (S, t) = V (S, ert , t).
Making this substitution in (2.17) and (2.18), noting that
∂ Ṽ ∂V ∂V
= rβ +
∂t ∂β ∂t
and simplifying yields

∂ Ṽ ∂ Ṽ 1 2 2 ∂ 2 Ṽ
+ rS + σ Σ − rṼ = 0.
∂t ∂S 2 ∂S 2
This is the Black-Scholes PDE characterizing the price of a European derivative security. For the
special case of the boundary condition Ṽ (S, T ) = (S − K)+ , the solution is given by Ṽ (S, T ) =
BS(S, σ, T − t, r, K), the Black- Scholes formula in (2.4).
2.5 Parameters and calibration

The simplest possible assumption about the stochastic process that drives the underlying price
is that it is a geometric Brownian motion with constant volatility. This is the assumption that
underlies the Black–Scholes–Merton model, but it is not representative of the observed move-
ments in returns in the real world. Thus during the last decade there has been a huge amount
of research on numerous stochastic volatility option pricing models, some of which add jumps
in the price and/or the volatility equations. These models assume that the volatility parameter
is itself stochastic, so it becomes a latent (i.e. unobservable) risk factor. Sometimes we assume
2.6. VOLATILITY 15
the mean is a deterministic function of time, so the mean of the asset price distribution at some
fixed future point in time is just the risk free return from now until that time, less any benefits or
plus any costs associated with holding the underlying. But some pricing models assume that the
drift term has a mean-reverting component and/or that the interest rate or the dividend yields
are stochastic. Making a parameter into a risk factor introduces other parameters to the option
pricing model, because these are required to describe the distribution of this new risk factor. For
instance, the parameters of a typical stochastic volatility model include the spot volatility, the
long run volatility, the rate of volatility mean reversion, the volatility of volatility, and the price–
volatility correlation. In addition, we could include a volatility risk premium. When volatility
is assumed to be the only stochastic parameter there are two risk factors in the option pricing
model, the price and the volatility. To price the option we need to know their joint distribution at
every point in time, unless it is a standard European option in which case we only need to know
their joint distribution at the time of expiry. The easiest way to do this is to describe the evolution
of each risk factor in continuous time using a pair of stochastic differential equations. The model
parameters are the parameters in the price and volatility equations, and the price–volatility corre-
lation. The model parameters are calibrated by matching the model prices of standard European
options to the observable market prices of these options. There are usually very many standard
European options prices that are quoted in the market but an option pricing model typically has
only a handful of parameters. Hence, the model prices will not match every single market price
exactly. The calibration of parameters is therefore based on a numerical method to minimize a
root mean square error between the model price and the market price of every standard European
option for which there is a market price. More precisely, we can calibrate the model parameters
λ by applying a numerical method to the optimization problem:
√
2
min ∑ w(K, T ) (f m (K, T ) − f (K, T ∣λ))
λ K,T
where f m (K, T ) and f (K, T ∣λ) denote the market and model prices of a European option with
strike K and expiry time T , w(K, T ) is a weighting function (for instance, we might take it to
be proportional to the option gamma, to place more weight on short term near at-the-money
options) and the sum is taken over all options with liquid market prices. Most of the trading is
on options that are near to at-the-money. The observed prices of deep in-the-money and out-of-
the-money options may be stale if the market has moved since they were last traded. Therefore
it is common to use only those options that are within 10% of the at-the-money to calibrate the
model. Once calibrated the option pricing model can be used to price exotic and path-dependent
options. Traders may also use the model to dynamically hedge the risks of standard European
options, although there is no clear evidence that this would be any better than using the Black–
Scholes–Merton model for hedging.
2.6 Volatility
A set of standard European options of different strikes and maturities on the same underlying
has two related volatility surfaces, the implied volatility surface, and the local volatility surface.
When we use the market prices of options to derive these surfaces we call them the market implied
volatility surface and the market local volatility surface. When we use option prices based on a
stochastic volatility model we call them the model implied volatility surface and the model local
volatility surface. Implied volatility is a transformation of a standard European option price. It is
the volatility that, when input into the Black–Scholes–Merton (BSM) formula, yields the price of
the option. In other words, it is the constant volatility of the underlying process that is implicit in
the price of the option. For this reason some authors refer to implied volatility as implicit volatility.
The BSM model implied volatilities are constant. And if the assumptions of the BSM model were
valid then all options on the same underlying would have the same market implied volatility.
However, traders do not believe in these assumptions, hence the market prices of options yield
a surface of market implied volatilities, by strike (or moneyless) and maturity of the option, that
is not flat. In particular, the market implied volatility of all options of the same maturity but
different strikes has a skewed smile shape when plotted as a function of the strike (or moneyless)
of the options. This is called the (market) volatility smile. And the market implied volatility of
all options of the same strike (or moneyless) but different maturities converges to the long term
implied volatility when plotted as a function of maturity. This is called the (market) term structure
of implied volatility. An implied volatility is a deterministic function of the price of a standard
European option (the transformation that is implicitly specified by the BSM formula). Thus a
dynamic model of implied volatility is also a dynamic model of the option price and hedge ratios.
If we can forecast market implied volatility successfully then we can also forecast the market price
of the option and hedge the option accurately. Hence, practitioners spend considerable resources
on developing models of market implied volatility dynamics.
2.7 Implied Volatility

The market implied volatility of a standard European option, or just the implied volatility for
short, is the volatility of the underlying that gives the market price of the option when used in the
Black–Scholes–Merton formula. If put – call parity holds on market prices then a call and a put of
the same strike and maturity will have the same implied volatility. The presence of the volatility
smile has motivated the development of many new option pricing models, such as stochastic
volatility models and models based on jumps in prices and volatilities: these models are better
able to explain why options are transacted at the prices that we observe.
2.8 Stochastic Volatility Models

The BSM model is the benchmark by which market prices of standard European options are mea-
sured. In currency markets the prices of options are even quoted in terms of their BSM volatility
rather than their price. The BSM model makes the assumption that the underlying price has
a lognormal distribution and constant volatility but, since the market implied volatility surface
is not flat, this assumption is clearly not believed by traders. Historically the returns on most
financial assets are not normally distributed, they are leptokurtic and skewed. Moreover, volatil-
ity is not constant. The log returns distribution changes randomly over time, so its volatility is
stochastic. There is an enormous literature on models that relax the assumptions made by the
2.9. STOCHASTIC VOLATILITY PDE 17
BSM model so that the price process is no longer assumed to follow a geometric Brownian mo-
tion with constant volatility. The aim of these models is to price OTC options that may have
exotic and/or path-dependent pay-offs so that their prices are consistent with the market prices
of standard European options. This precludes the possibility of arbitrage between an exotic op-
tion and a replicating portfolio of standard calls and puts. Hence, the market prices of European
options are used to calibrate the model parameters. We may calibrate the model by changing the
parameters so that the model prices of European options are set equal to, or as close as possible to,
the market prices. Alternatively, the model’s parameters may be calibrated by fitting the model
implied volatilities to the market implied volatility smile.
√
2
min ∑ w(K, T ) (θm (K, T ) − θ(K, T ∣λ))
λ K,T
Where θm (K, T ) and θ(K, T ∣λ) denote the market and model implied volatilities of a European
option with strike K and expiry time T and w(K, T ) is some weighting function that ensures
that more weight is placed on the options with more reliable prices.
2.9 Stochastic Volatility PDE

Suppose a price S(t) is driven by a geometric Brownian motion but that volatility σ(t) is stochas-
tic and driven by its own Brownian motion that is correlated with the Brownian motion driving
the price process. We write
dS(t)
= (r − y)dt + σ(t)dW1 (t),
S(t) (2.19)
dσ(t) = α(σ, t)dt + β(σ, t)dW2 (t)
and, using ⟨⋅, ⋅⟩

to denote the quadratic covariation, we have
⟨dW1 (t), dW2 (t)⟩ = ρdt. (2.20)
The property (2.20) implies a non-zero constant price–volatility correlation ρ between the log
returns on the underlying and the changes in volatility.
When volatility is stochastic there are two sources of risk that must be hedged, the price risk
and the volatility risk. In this subsection we set up a risk free portfolio just as in the constant
volatility case, but now the portfolio contains three assets: the underlying asset with price S and
two options on S. The second option is needed to hedge the second source of uncertainty. In
the following we write the price of a general claim on S as g(S, σ), thus making explicit its de-
pendence on the volatility σ but ignoring the other variables and parameters that may affect the
option price.
We set up a portfolio with price Π consisting of a short position of one unit in an option with
price g1 (S, σ), a long position of δ1 units of the underlying with price S and a position of δ2 units
in another option on S. Denoting the price of the second option by g2 (S, σ), we may write down
the price of the portfolio as
Π(S, σ)δ1 S − δ2 g2 (S, σ) − g1 (S, σ) (2.21)
To eliminate both sources of uncertainty we must choose δ1 and δ2 in such as way that the price
remains constant with respect to changes in both the price S and the volatility σ.
Differentiating (2.21) with respect to S and σ and setting to zero gives two first order condi-
tions that are easily solved, namely
ΠS (S, σ) = 0 ∶
g1S (S, σ) = δ1 + δ2 g2S (S, σ) ⇒ δ1 = g1S (S, σ) − δ2 g2σ (S, σ)
and
Πσ (S, σ) = 0 ∶
g1σ (S, σ) (2.22)
g1σ (S, σ) = δ2 g2σ (S, σ) ⇒ δ2 =
g2σ (S, σ)
Substituting these values for δ1 and δ2 back into (2.21) yields
g1σ (S, σ)
Π(S, σ) = (S − 1) (g1 (S, σ) − g2S (S, σ)) (2.23)
g2σ (S, σ)
This portfolio will be risk free so it must earn a net return equal to the risk free rate, r. Since
holding the portfolio may earn income such as dividends or incur costs such as carry costs at the
rate y, the portfolio’s price must grow at the rate ry. We assume both r and y are constant and set
dΠ(S, σ)
= (r − y)dt. (2.24)
Π(S, σ)
Now, just as in the constant volatility case, we apply Itô’s lemma to (2.23) using the approxi-
mation that δ1 and δ2 are unchanged over a small time interval dt. Then we use the resulting total
derivative in (2.23) and (2.24) to derive a PDE that must be satisfied by every claim. However, we
have only one condition (2.24) from which to derive the values of two unknowns, i.e. g1 (S, σ)
and g2 (S, σ). So there are infinitely many solutions.
We resolve this problem by introducing a parameter corresponding to a premium that in-
vestors demand for holding the risky asset called volatility. This is called the volatility risk pre-
mium or the market price of volatility risk and is here denoted λ. The resulting PDE for the price
of the claim may be written
gt + (r − y)SgS + 1/2σ 2 S 2 gSS {(α − λβ)gσ + αρσSgSσ + 1/2β 2 gσσ } = (r − y)g, (2.25)
where g can be either g1 (S, σ) or g2 (S, σ) and we have dropped the dependence on variables
from our notation for simplicity. If volatility were constant the term
(α − λβ)gσ + αρσSgSσ + 1/2β 2 gσσ

2.9. STOCHASTIC VOLATILITY PDE 19
would be zero and the PDE would reduce to the BSM PDE. When volatility is not constant the
general solution to (2.25) may be written in the form of the stochastic volatility model (2.19).
The presence of a volatility risk premium in the stochastic volatility PDE (2.25) indicates that
we are in an incomplete market. That is, it is not possible to replicate the value of every claim
with a self-financing portfolio. Then the price of a claim is not unique. Different investors have
different claim prices depending on their risk attitude. However, it is possible to complete the
market by adding all options on a tradable asset, so that we can observe the price of the second
option instead of having to solve for it. In this case, there should be no volatility risk premium.
We shall see below that the volatility risk premium appears as a parameter in the drift term
of the volatility or variance diffusion in parametric stochastic volatility models. For instance, in
the Heston model the volatility risk premium affects the rate of mean reversion. When the price–
volatility correlation is positive, most investors have a positive volatility risk premium and the
mean reversion speed is slow, and when the price–volatility correlation is negative, most investors
have a negative volatility risk premium and mean reversion is rapid. To see why this is intuitive,
suppose that we are modelling the equity index price and volatility processes with the Heston
model. The price–volatility correlation in equity indices is large and negative, so when the index
falls volatility is high. Suppose all investors have the same negative volatility risk premium – which
means they like to hold volatility. Then after a market fall investors will buy the index, because
of the high volatility it adds to their portfolio, so the index price rises again and volatility will
come down very quickly, i.e. the mean reversion rate will be rapid. But if investors had the same
positive volatility risk premium, they would not buy into high volatility and therefore it would
take longer to mean revert.
When we calibrate a stochastic volatility model to market data on option prices we often
assume the market has been completed (by adding all options on S to the tradable assets) and so
the volatility risk premium is zero. If we do not assume it is zero, we usually find that the volatility
risk premium is negative, i.e. that investors appear to like to hold volatility. This makes sense, for
the same reason as the variance risk premium that is calculated from the market prices of variance
swaps is usually negative. Investors are usually prepared to accept low or even negative returns for
holding volatility because the negative correlation between prices and volatility makes volatility
a wonderful diversification instrument. Hence, returns on volatility do not need to be high for
volatility to be an attractive asset.
Chapter 3
Machine learning methods
Machine learning (ML) is the study of computer algorithms that can improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning
algorithms build a model based on sample data, known as ”training data”, in order to make predic-
tions or decisions without being explicitly programmed to do so. Machine learning algorithms
are used in a wide variety of applications, such as in medicine, email filtering, speech recogni-
tion, and computer vision, where it is difficult or unfeasible to develop conventional algorithms
to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on

making predictions using computers; but not all machine learning is statistical learning. The
study of mathematical optimization delivers methods, theory and application domains to the field
of machine learning. Data mining is a related field of study, focusing on exploratory data analysis
through unsupervised learning. Some implementations of machine learning use data and neural
networks in a way that mimics the working of a biological brain. In its application across business
problems, machine learning is also referred to as predictive analytics.
3.1 Artificial neural networks
A feedforward neural network is an artificial neural network where connections between the
nodes do not form a cycle. The feedforward neural network was the first and simplest type of
artificial neural network devised. Back-Propagation is the algorithm that is used to train.
21
22 CHAPTER 3. MACHINE LEARNING METHODS
Σταθερά
'637-. @&1!&/
Σταθερά
Δεδομένα
#/01"
!"
'23#450()*+,. !&/
!&"
!"
#"
Συνάρτηση
!&$
!"
#$ Ενεργοποιήσης
'C:,3B7,3?)0
D+):,3?)0.
A B&
Αποτέλεσμα
'E+,*+,.
Δεδομένα
A
Σ φ( )
!"
'()*+,-.
A 9&
A Άθροισμα
'8+%%3)<0
>+):,3?)0.
!&%
!"
#%
Βάρη
'89)7*,3:0
;43<=,-.
Figure 3.1: Rosenblatt’s Perceptron
Feed forward neural networks consists of the following items:
• Input and output layers. The network can also have one or more hidden layers of neurons.
• Each neuron has its activation function. Usually activation functions are non-linear differ-
entiable functions.
• The layers of the network (input, hidden and output) are connected. Can be fully connected
or partially.
23
Figure 3.2: multilayer neural network

.
HMBtmlqy-n-a-BAF-gz.gg
Δεδομένα
(Input) . Αποτελέσματα
(Output)
.
. . .
. .
3.1. ARTIFICIAL NEURAL NETWORKS
(Input Layer) (First Hidden Layer) (Second Hidden Layer) (Output Layer)
_oE_-µ -0-8-0 -1--0-3--9
Επίπεδο Πρώτο Κρυφό Δεύτερο Κρυφό Επίπεδο
Δεδομένων Επίπεδο Επίπεδο Αποτελεσμάτων
.
,
Back-Propagation algorithm
Let us consider the following unconstrained optimization problem: Find the matrix or vector
w that minimizes the following function E = E(w) which is known as cost function or energy
function. Given a set of m input-target pairs (inp, tar) ∈ Rn ×Rk . The above is known as training
data set in machine learning the first coordinate is the input space and the second the desired
output space. Back-propagation is the most commonly used algorithm to train the neural network
that represents the relationship that connects the two spacesRn → Rk . During the training of
neural network we want to find a class of parameters called weights w that are the solutions of the
bellow minimization problem
1 m
E(w) = ∑ Ei (w) (3.1)
m i=1
. Ei (w) is the total instantaneous error energy of the i training pair is given by Ei (w) = ∑kj=1 Ei,j (w)
where Ei,j (w) is the instantaneous error energy of the i training pair of the j output neuron given
by Ei,j (w) = 12 ei,j (w)2 where ei,j (w) is the error signal of the j output neuron of the neural net-
work ei,j (w) = tari,j − outj (inpi , w) where inpi ∈ Rn is the input of the i training pair, tari,j
is the desired output for the j neuron of the network for the i training pair, outj is the output of
the j neuron.
In each iteration of the algorithm the weights of the network are given by wjk t+1
= wjk
t
+ ∆wjk t
0
with initial weights wjk small random numbers and j the neuron number from where the signal is
generated. It is then multiplied by the weight wjk and after adding up to the corresponding other
products, it ends up as an entrance to the neuron k of the next layer of the network. ∆wt = ηdt
with η learning rate and dt the search direction of the t epoch (the t iteration of the algorithm is
called epoch). dt uses the gradient of the energy function E t (w) which is calculated with back-
∂E t
propagation as ∂w t = δk outj where δk is the local gradient given by:
jk
⎧
⎪
⎪−ek ϕ(yk ) if neuron k is an output neuron
δk = ⎨ (3.2)
⎪
⎩∑L δL wkL ϕ(yk ) if neuron k is a hidden layer’s or input neuron
⎪
where ϕ is the activation function and yk is the input of the k neuron. Algorithms that use uni-
versal network status information, such as e.g. the direction of the total vector of weight renewal,
are the universal methods (global techniques). On the contrary, local strategies are based on
specialized information about weights, such as e.g. the behavior of the partial derivative of a
given weight. The second category of methods is more closely related to the concept of the dis-
tributed processing neural network where the calculations are done independently of each other.
In addition, in many applications, local strategies achieve faster and more reliable prediction than
universal methods despite the fact that they use less information.
Global training techniques

The most commonly used algorithms in this category are those of steepest descent, conjugate
gradient, and the Newton method:
• steepest descent algorithm: dt = −∇E(wt )

3.1. ARTIFICIAL NEURAL NETWORKS 25
• conjugate gradient algorithm: dt = −∇E(wt ) + βt−1 dt−1 ,

where βt−1 = ∇E∇E t ⋅∇Et
t−1 ⋅∇Et−1
(Fletcher and Reeves)
• Newton’s method: dt = −H(wt )−1 ∇E(wt ), where ∇E(wt ) is the gradient and H(wt ) is
the Hessian matrix of the vector function E(wt ).
The convergence properties of the previous algorithms depend on the properties of the first and
/ or second derivative of the function to be optimized. For example, the algorithms of steep de-
scent and conjugate vector gradients determine the search direction. Based on the first derivative,
their convergence rate depends indirectly on the properties of the second derivative. Accordingly,
Newton’s method requires the first derivative and the Hessian matrix to determine the search ad-
dress.
Local training techniques

To improve the computational performance of neural network training, two completely different
methods belonging to the local methods category were proposed: Quickprop and Rprop.
Quickprop Algorithm
This method consists of a training algorithm developed by Fahlman and based in part on Newton’s
method. The Quickprop algorithm is widely used to train neural networks. The repeated renewal
of the weights is done on the basis of the estimation of the position of the minimum for each
weight. The weights are renewed according to the next relationship
∂E t
t
∂wjk
t
∆wjk = ∂E t−1 ∂E t
t−1
∆wjk (3.3)
∂wjkt − t
∂wjk
The computational cost of training is significantly improved compared to that of global methods.
Rprop Algorithm
Another local training algorithm developed by Riedmiller and Braun is the Rprop algorithm (ab-
breviation for Resilient back propagation). The weight are renewed in each iteration by the fol-
lowing relationship:
⎛ ∂E t ⎞
∆wjkt
= −ηjk
t
sgn (3.4)
⎝ ∂wjk
t
⎠
where
⎧
⎪ t−1
, ηmax ) ∂E t−1
⋅ ∂E t
>0
⎪
⎪
⎪
min(aηjk ∂wjkt t
∂wjk
⎪
⎪ t−1
t
= ⎨max(βηjk ∂E t
ηjk t−1
, ηmin ) ∂E
⋅ <0
⎪
⎪
⎪
∂wjkt t
∂wjk
⎪
⎪
⎪ t−1
⎩ηjk
with a = 1.2, β = 0.5, ηmax = 50 και ηmin = 0.001. It is noteworthy that, unlike other algorithms,
the Rprop method uses the sign rather than the absolute size of some derivatives. Whenever the
partial derivative of a weight wjk sign changes, which proves that the previous change was large
and the algorithm bypassed a local minimum, the coefficient ηjk is decreases by β. If the partial
derivative retains its sign, the rate ηjk increases slightly in order to accelerate convergence. The
coefficients of change ηjk are usually bounded to avoid numerical problems.
3.2 Tree-Based Methods
3.2.1 Background
Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model
(like a constant) in each one. They are conceptually simple yet powerful. We first describe a
popular method for tree-based regression and classification called CART, and later contrast it
with C4.5, a major competitor.
Let’s consider a regression problem with continuous response Y and inputs X1 and X2 , each
taking values in the unit interval. The top left panel of Figure ?? shows a partition of the feature
space by lines that are parallel to the coordinate axes. In each partition element we can model
Y with a different constant. However, there is a problem: although each partitioning line has a
simple description like X1 = c, some of the resulting regions are complicated to describe.
Figure 3.3
3.2. TREE-BASED METHODS 27
Figure 3.4: Partitions and CART. Top right panel shows a partition of a two-dimensional feature
space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel
shows a general partition that cannot be obtained from recursive binary splitting. Bottom left
panel shows the tree corresponding to the partition in the top right panel, and a perspective plot
of the prediction surface appears in the bottom right panel.
To simplify matters, we restrict attention to recursive binary partitions like that in the top
right panel of Figure (3.4). We first split the space into two regions, and model the response by
the mean of Y in each region. We choose the variable and split-point to achieve the best fit. Then
one or both of these regions are split into two more regions, and this process is continued, until
some stopping rule is applied. For example, in the top right panel of Figure (3.4), we first split at
X1 = t1 . Then the region X1 ≤ t1 is split atX2 = t2 and the region X1 > t1 is split at X1 = t3 .
Finally, the region X1 > t3 is split at X2 = t4 . The result of this process is a partition into the
five regions R1 , R2 , . . . , R5 shown in the figure. The corresponding regression model predicts Y
with a constant cm in region Rm , that is,
5
fˆ(X) = ∑ cm I {(X1 , X2 ) ∈ Rm } . (3.5)
m=1
This same model can be represented by the binary tree in the bottom left panel of Figure (3.4).
The full dataset sits at the top of the tree. Observations satisfying the condition at each junction
are assigned to the left branch, and the others to the right branch. The terminal nodes or leaves
of the tree correspond to the regions R1 , R2 , . . . , R5 . The bottom right panel of Figure ?? is a
perspective plot of the regression surface from this model. For illustration, we chose the node
means c1 = 5, c2 = 7, c3 = 0, c4 = 2, c5 = 4 to make this plot.
A key advantage of the recursive binary tree is its interpretability. The feature space partition
is fully described by a single tree. With more than two inputs, partitions like that in the top right
panel of Figure (3.4) are difficult to draw, but the binary tree representation works in the same
way. This representation is also popular among medical scientists, perhaps because it mimics the
way that a doctor thinks. The tree stratifies the population into strata of high and low outcome,
on the basis of patient characteristics.
3.2.2 Regression Trees

We now turn to the question of how to grow a regression tree. Our data consists of p inputs and a
response, for each of N observations: that is, (xi , yi ) for i = 1, 2, . . . , N , with xi = (xi1 , xi2 , . . . ,
xip ). The algorithm needs to automatically decide on the splitting variables and split points, and
also what topology (shape) the tree should have. Suppose first that we have a partition into M
regions R1 , R2 , . . . , RM , and we model the response as a constant cm in each region:
M
fˆ(X) = ∑ cm I (x ∈ Rm ) . (3.6)
m=1
If we adopt as our criterion minimization of the sum of squares ∑(yi − f (xi ))2 , it is easy to
see that the best ĉm is just the average of yi in region Rm :
ĉm = ave(yi ∣xi ∈ Rm ). (3.7)
3.3 Random forest

ML models generally suffer from three errors:
1. Bias: This error is caused by unrealistic assumptions. When bias is high, the ML algo-
rithm has failed to recognize important relations between features and outcomes. In this
situation, the algorithm is said to be “underfit.”
2. Variance: This error is caused by sensitivity to small changes in the training set. When
variance is high, the algorithm has overfit the training set, and that is why even minimal
changes in the training set can produce wildly different predictions. Rather than modeling
the general patterns in the training set, the algorithm has mistaken noise with signal.
3. Noise: This error is caused by the variance of the observed values, like unpredictable changes
or measurement errors. This is the irreducible error, which cannot be explained by any
model.
Consider a training set of observations {xi }i = 1, . . . , n and real-valued outcomes {yi }i =

1, . . . , n. Suppose a function f [x] exists, such that y = f [x] + ε, where ε is white noise with
E[εi ] = 0 and E[ε2i ] = σε2 . We would like to estimate the function fˆ[x] that best fits f [x], in
2
the sense of making the variance of the estimation error E [(yi − fˆ[xi ]) ] minimal (the mean
squared error cannot be zero, because of the noise represented by σε2 ). This mean-squared error
3.3. RANDOM FOREST 29
can be decomposed as
2
⎛ ⎞
E [(yi − f [xi ]) ] = ⎜E [f [xi ] − f [xi ]]⎟
⎜
2
⎟ + V [f [xi ]] + σε
ˆ ˆ ˆ 2
¯
⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶⎠ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ noise
bias variance
An ensemble method is a method that combines a set of weak learners, all based on the same
learning algorithm, in order to create a (stronger) learner that performs better than any of the
individual ones. Ensemble methods help reduce bias and/or variance.
3.3.1 Bootsprap aggregation

Bootstrap aggregation (bagging) is an effective way of reducing the variance in forecasts. It works
as follows: First, generate N training datasets by random sampling with replacement. Second, fit
N estimators, one on each training set. These estimators are fit independently from each other,
hence the models can be fit in parallel. Third, the ensemble forecast is the simple average of the
individual forecasts from the N models. In the case of categorical variables, the probability that
an observation belongs to a class is given by the proportion of estimators that classify that obser-
vation as a member of that class (majority voting). When the base estimator can make forecasts
with a prediction probability, the bagging classifier may derive a mean of the probabilities.
3.3.2 Variance Reduction

Bagging’s main advantage is that it reduces forecasts’ variance, hence helping address overfitting.
The variance of the bagged prediction (ϕi [c]) is a function of the number of bagged estima-
tors (N ), the average variance of a single estimator’s prediction (σ̄), and the average correlation
among their forecasts (ρ̄):
1 N 1 N ⎛N ⎞ 1 N ⎛ 2 N ⎞
V[ ∑ ϕi [c]] = 2 ∑ ∑ σi,j = 2 ∑ σi + ∑ σi σj ρi,j
N i=1 N i=1 ⎝j=1 ⎠ N i=1 ⎝ j≠i ⎠
⎛ ⎞
⎜ ⎟
⎜ ⎟
1 N ⎜⎜ 2
N ⎟ σ̄ 2 + (N − 1)σ̄ 2 ρ̄
⎟ (3.8)
= 2 ∑ ⎜σ̄ + ∑ σ̄ ρ̄ ⎟ =
2
N i=1 ⎜
⎜ j≠i
⎟
⎟ N
⎜ ⎟
´¹¹ ¹ ¸¹¹ ¹ ¹ ¶ ⎟
⎜
⎝ =(N −1)σ̄ 2 ρ̄⎠
for a fixed i
1 − ρ̄
= σ̄ 2 (ρ̄ + )
N
−1 N
i=1 σ̄ = ∑i=1 σ̄i ⇔ σ̄ = N
where σi,j is the covariance of predictions by estimators i, j; ∑N 2 N 2 2
∑i=1 σ̄i 2 ;
−1 N
and ∑j≠i σ̄ ρ̄ = ∑j≠i σi σj ρi,j ⇔ ρ̄ = (σ̄ N (N − 1) ∑j≠i σi σj ρi,j .
N 2 N 2
3.4 Cross validation

One of the purposes of ML is to learn the general structure of the data, so that we can produce
predictions on future, unseen features. When we test an ML algorithm on the same dataset as
was used for training, not surprisingly, we achieve spectacular results. When ML algorithms
are misused that way, they are no different from file lossy-compression algorithms: They can
summarize the data with extreme fidelity, yet with zero forecasting power.
CV splits observations drawn from an IID process into two sets: the training set and the
testing set. Each observation in the complete dataset belongs to one, and only one, set. This
is done as to prevent leakage from one set into the other, since that would defeat the purpose
of testing on unseen data. Further details can be found in the books and articles listed in the
references section.
There are many alternative CV schemes, of which one of the most popular is k−fold CV. Figure
(3.5) illustrates the k train/test splits carried out by a k−fold CV, where k = 5. In this scheme:
Figure 3.5: Train/test splits in a 5-fold CV scheme
a. The ML algorithm is trained on all subsets excluding i.

b. The fitted ML algorithm is tested on i.
The outcome from k−fold CV is a kx1 array of cross-validated performance metrics. For ex-
ample, in a binary classifier, the model is deemed to have learned something if the cross-validated
accuracy is over 1/2, since that is the accuracy we would achieve by tossing a fair coin.
In finance, CV is typically used in two settings: model development (like hyperparameter
tuning) and backtesting. Backtesting is a complex subject. In this section, we will focus on CV
for model development.
Chapter 4
Optimization algorithms
4.1 Genetic algorithm

A genetic algorithm (GA) is a meta-heuristic inspired by the process of natural selection that be-
longs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used
to generate high-quality solutions to optimization and search problems by relying on biologically
inspired operators such as mutation, crossover and selection.
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or
phenotypes) to an optimization problem is evolved toward better solutions. Each candidate so-
lution has a set of properties (its chromosomes or genotype) which can be mutated and altered;
traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are
also possible.
The evolution usually starts from a population of randomly generated individuals and is an
iterative process, with the population in each iteration called a generation. In each generation,
the fitness of every individual in the population is evaluated; the fitness is usually the value of
the objective function in the optimization problem being solved. The more fit individuals are
stochastically selected from the current population, and each individual’s genome is modified
(recombined and possibly randomly mutated) to form a new generation. The new generation of
candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm
terminates when either a maximum number of generations has been produced, or a satisfactory
fitness level has been reached for the population.
A typical genetic algorithm requires:
1. a genetic representation of the solution domain,
2. a fitness function to evaluate the solution domain.

A standard representation of each candidate solution is as an array of bits (also called bit set
or bit string).
This generational process is repeated until a termination condition has been reached. Com-
mon terminating conditions are:
• A solution is found that satisfies minimum criteria
31
32 CHAPTER 4. OPTIMIZATION ALGORITHMS
• Fixed number of generations reached
• Allocated budget (computation time/money) reached
• The highest ranking solution’s fitness is reaching or has reached a plateau such that succes-
sive iterations no longer produce better results
• Manual inspection
• Combinations of the above
4.2 Covariance matrix adaptation evolution
Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for
numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for nu-
merical optimization of non-linear or non-convex continuous optimization problems. They be-
long to the class of evolutionary algorithms and evolutionary computation. An evolutionary
algorithm is broadly based on the principle of biological evolution, namely the repeated interplay
of variation (via recombination and mutation) and selection: in each generation (iteration) new
individuals (candidate solutions, denoted as x) are generated by variation, usually in a stochastic
way, of the current parental individuals. Then, some individuals are selected to become the par-
ents in the next generation based on their fitness or objective function value f (x). Like this, over
the generation sequence, individuals with better and better f -values are generated.
In an evolution strategy, new candidate solutions are sampled according to a multivariate
normal distribution in R⋉ . Recombination amounts to selecting a new mean value for the distri-
bution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise
dependencies between the variables in the distribution are represented by a covariance matrix.
The covariance matrix adaptation (CMA) is a method to update the covariance matrix of this
distribution. This is particularly useful if the function f is ill-conditioned.
Adaptation of the covariance matrix amounts to learning a second order model of the un-
derlying objective function similar to the approximation of the inverse Hessian matrix in the
quasi-Newton method in classical optimization. In contrast to most classical methods, fewer as-
sumptions on the nature of the underlying objective function are made. Only the ranking between
candidate solutions is exploited for learning the sample distribution and neither derivatives nor
even the function values themselves are required by the method.
4.2. COVARIANCE MATRIX ADAPTATION EVOLUTION 33
Figure 4.1: Concept of directional optimization in CMA-ES algorithm
4.2.1 Algorithm
In the following the most commonly used (µ/µw, λ)-CMA-ES is outlined, where in each iteration
step a weighted combination of the µ best out of λ new candidate solutions is used to update the
distribution parameters. The main loop consists of three main parts:
1. sampling of new solutions
2. re-ordering of the sampled solutions based on their fitness
3. update of the internal state variables based on the re-ordered samples
A pseudocode of the algorithm looks as follows.

Figure 4.2: Pseudocode of the algorithm
4.3 Bayesian optimization
Bayesian optimization is a sequential design strategy for global optimization of black-box func-
tions that does not assume any functional forms. It is usually employed to optimize expensive-
to-evaluate functions.
Bayesian optimization is typically used on problems of the form maxx∈A f (x), where A is
a set of points whose membership can easily be evaluated. Bayesian optimization is particularly
advantageous for problems where f (x) is difficult to evaluate, is a black box with some unknown
structure, relies upon less than 20 dimensions, and where derivatives are not evaluated.
Since the objective function is unknown, the Bayesian strategy is to treat it as a random func-
tion and place a prior over it. The prior captures beliefs about the behavior of the function. After
gathering the function evaluations, which are treated as data, the prior is updated to form the
posterior distribution over the objective function. The posterior distribution, in turn, is used to
construct an acquisition function (often also referred to as infill sampling criteria) that determines
the next query point.
4.3. BAYESIAN OPTIMIZATION 35
Algorithm Basic pseudo-code for Bayesian optimization

Place a Gaussian process prior on f
Observe f at n0 points according to an initial space-filling experimental design. Set n = n0 .
while n ≤ N do
Update the posterior probability distribution on f using all available data
Let xn be a maximizer of the acquisition function over x, where the acquisition function is computed using
the current posterior distribution.
Observe yn = f (xn ).
Increment n
end while
Return a solution: either the point evaluated with the largest f (x), or the point with the largest posterior
mean.
Figure 4.3: Basic pseudo-code for Bayesian optimization

Chapter 5
Applications with real options data
5.1 Dataset creation

We have created a dataset using data from yahoo finance. Yahoo finance was one of the few
sources that have public data for options and stock prices. The dataset contains data about op-
tion prices, stock prices and stock info. We have also created new features that are containing
information about the stock price (minimum, maximum, average and standard deviation) at
2,3,5,7,14,30,60,90,180,360,540,720 days before the sale of the option. The below tables sum-
marize the statistics of the datasets.
Figure 5.1: Statistics for the data set
Figure 5.2: Statistics for the data set
37
38 CHAPTER 5. APPLICATIONS WITH REAL OPTIONS DATA
Our objective is to compare the different methods for option pricing. We split the above
dataset into training and test dataset (with ratio 80%-20%). One approach is to use the Stochastic
Models described in Chapter 3 (BSM and Heston Model). Another approach is to use Machine
Learning models (ANN and RF). We will compare the above methods by using the mse (mean
square error) on test dataset.
1 N 2
∑ (yi − ŷi ) (5.1)
N i=1
where, yi is the real option price and ŷi the predicted option price.
5.2 Calibration of BSM and Heston model

Calibration of an option pricing model is, in general, a convex optimization problem. The most
widely used function used for the calibration—i.e., the minimization—is the mean-squared error
(MSE) for the model option values given the market quotes of the options. Assume there are N
relevant options, and also model and market quotes. The problem of calibrating a financial model
to the market quotes based on the MSE is then given in
1 N ∗ 2
∑ (Cn − Cn (p))
mod
min (5.2)
p N n=1
where, Cn∗ and Cnmod are the market price and the model price of the nth option, respectively.
p is the parameter set provided as input to the option pricing model. We will use the methods
of Chapter 4 to solve the above problem.Those algorithms can solve optimization problems with
black box functions.In our case we want to minimize the mse on training dataset.
5.2.1 BSM
Consider now the Black-Scholes-Merton model in its dynamic form, as described by the stochas-
tic differential equation (SDE). Here, Zt is a standard Brownian motion. The SDE is called a
geometric Brownian motion. The values of St are lognormally distributed and the (marginal)
returns dSSt normally.
t
dSt = rSt dt + σSt dZt (5.3)

For the BSM model we have two parameters ρ, σ
Figure 5.3: Parameters, upper and lower bounds for the BSM model
We have found the following results

5.2. CALIBRATION OF BSM AND HESTON MODEL 39
Figure 5.4: Calibration results for the BSM model
Figure 5.5: Results for the BSM model calibration with the GA algorithm
Figure 5.6: Convergence for the BSM model calibration with the GA algorithm
Figure 5.7: Results for the BSM model calibration with the CMAES algorithm
Figure 5.8: Convergence for the BSM model calibration with the CMAES algorithm
Figure 5.9: Results for the BSM model calibration with the bayesian optimization algorithm
5.2.2 Heston
One of the major simplifying assumptions of the Black-Scholes-Merton model is the constant
volatility. However, volatility in general is neither constant nor deterministic; it is stochastic.
Therefore, a major advancement with regard to financial modeling was achieved in the early 1990s
with the introduction of so-called stochastic volatility models. One of the most popular models
that fall into that category is that of Heston (1993), which is presented in (5.4)
√
dSt = rSt dt + νt St dZt1
√
dνt = κν (θν − νt ) dt + σν νt dZt2 (5.4)
dZt1 dZt2 = ρ
The meaning of the single variables and parameters can now be inferred easily from the dis-
cussion of the geometric Brownian motion and the square-root diffusion. The parameter ρ rep-
resents the instantaneous correlation between the two standard Brownian motions Z1t , Z2t . This
allows us to account for a stylized fact called the leverage effect, which in essence states that volatil-
ity goes up in times of stress (declining markets) and goes down in times of a bull market (rising
markets).
For the Heston model we have six parameters V0 , κV , θV , σV , ρ, r
Figure 5.10: Parameters, upper and lower bounds for the heston model
We have the following results
Figure 5.11: Calibration results for the Heston model

Figure 5.12: Results for the Heston model calibration with the GA algorithm
Figure 5.13: Convergence for the Heston model calibration with the GA algorithm
Figure 5.14: Results for the heston model calibration with the CMAES algorithm
Figure 5.15: Convergence for the heston model calibration with the CMAES algorithm
5.3. MACHINE LEARNING 45
Figure 5.16: Results for the Heston model calibration with the bayesian optimization algorithm
5.3 Machine Learning
For the machine learning models we can use more features than the three input features of BSM
and Heston model (S0 initial stock price, K strike, T time of maturity in years). For this reason
we have created the following extra features:
Figure 5.17: Data set features
We transform the categorical features (’country’,’sector’,’industry’) into numerical by using la-

bel encoding. Another option would be to use one hot encoding. We fill the null values with
−1 (a negative value which cannot be a value of any feature). We will use grid search in order to
estimate the best parameters for the machine learning models which minimize the mse on train
dataset with five folds cross validation.
5.3.1 Neural Network

We are using keras module with tensorflow as back end in order to implement the feed forward
neural network. Before we start the CV grid search we have to scale the data in order to avoid
numeric instability. We will use min-max scaling or min-max normalization, which is the sim-
plest method and consists in re-scaling the range of features to scale the range in [0, 1] or [−1, 1].
Selecting the target range depends on the nature of the data. The general formula for a min-max
of [0, 1] is given as:
x − min(x)
x′ =
max(x) − min(x)
where x is an original value, x′ is the normalized value.
We will train a feed forward NN with one and two hidden layers and the using the following
optimization algorithms:
5.3. MACHINE LEARNING 47
• Stochastic gradient descend (SGD)
• Resilient back propagation (RSMprop)
For the above six combinations we will use the following parameters in grid search:
• Number of epochs
• Learning rate
• Drop-out rate
• Number of neurons in the first hidden layer
• Number of neurons in the second hidden layer
• Activation functions
Figure 5.18: Parameters, upper and lower bounds for the NN model
We have the following results.
Figure 5.19: Grid search results for the NN model
5.3.2 Random Forest

We are using the sklearn python implementation of random forest. In the grid search CV we are
trying to optimize the bellow parameters:
• Max depth:
• Number of estimators
• Max samples
• Max features
• Min samples split
Figure 5.20: Parameters, upper and lower bounds for the RF model
We have the bellow results:
Figure 5.21: Grid search results for the RF model

Results
• From the above we can see that Machine Learning Models have slightly better results.
• For the stochastic Models we see that BSM Model gave about equivalent results. We would
expect Heston Model to outperform the BSM Model and the reason that is not happening
is because we are using the same number of iterations for both Models.
• We were forced to apply the above regarding iterations due to extremely heavy computa-
tional cost of the Heston Model. Normally would need three times the number of iterations
of the BSM Model (since one has two parameters and the other has six).
• Regarding optimization algorithms (GA, CMA-ES and Bayesian optimization) the better
the results the heavier the computational cost.
• Neural Network performs better with extra features and two hidden layers.
• RMSprop outperforms SGD
• Random Forest performed better with three features ( stock price, time of maturity, strike)
Random Forest gave the best results from the methods we used.
49
Appendix A
Python codes
Figure A.1: code yahoo finance
51
52 APPENDIX A. PYTHON CODES
Figure A.2: code data creation 1

53

Figure A.5: code bsm
Figure A.6: code bsm GA

55
Figure A.7: code mc heston
Figure A.8: code heston analytical formula

Figure A.9: code heston GA
Figure A.10: code heston CMAES

57
Figure A.11: code heston bayes opt
Figure A.12: code data transoformations

Figure A.13: code NN grid 1

59
Figure A.14: code NN grid 2

Figure A.15: code RF grid

References
1. Abate, J., Choudhury, G., and Whitt, W. (1999) An introduction to numerical transform in-
version and its application to probability models in Computational Probability, W.Grassman,
ed., Kluwer Publisher, Boston.
2. Abken, P.A.(2000) An empirical evaluation of value at risk by scenario simulation, Journal

of Derivatives 7 (Summer):12-30
3. Acworth,P., Broadie, M., and glasserman, P. (1998) A comparison of some Monte Carlo
and quasi Monte Carlo methods for option pricing , pp.1-18 in Monte-Carlo and Quasi-
Monte Carlo Methods 1996, P.Hellekalek, G.Larcher, H. Niederreiter, and P.Zintehof, eds.,
Springer-Verlag, Berlin.
4. Ahrens, J.H., and Dieter, U.(1974) Computer methods for sampling from the gamma, beta,
Poisson, and binomial distributions.
5. Akesson,F., and Lehoczky,.J.(2000) Path generation for quasi-Monte Carlo simulation of

mortgage-backed securities, Management Science.
6. Alexander, C. and Barbosa, A. (2008) Hedging exchange traded funds. Journal of Banking
and Finance 32(2), 326–337.
7. Alexander, C. and Nogueira, L. (2004) Hedging with stochastic and local volatility. ICMA
Discussion Papers in Finance 2004-11.
8. Alexander, C. and Nogueira, L. (2008) Stochastic local volatility. ICMA Discussion Papers
in Finance DP2008-02.
9. Andersen,L., and Broadie, M.(2001) A primal-dual simulation algorithm for pricing multi-
dimensional American options, working paper, Columbia Business School, New York
10. Anderson, T.W (1984) An Indroduction to Multivariate Statistical Analysis, Second Edi-
tion, Wiley, New York.
11. Bishop, Christopher M. (1995). Neural networks for pattern recognition. Clarendon Press.
61
62 REFERENCES
12. Breiman L (2001). ”Random Forests”. Machine Learning. 45 (1): 5–32
13. Carol Alexander –Market Risk Analysis-Pricing, Hedging and Trading Financial Instra-
ments
14. Cecchetti, S.G., Cumby, R.E. and Figlewski, S. (1988) Estimation of optimal futures hedge.
Review of Economics and Statistics 70.
15. Claus Anderskov Madsen – Topics in Financial Engineering 2013
16. Cox, J., Ingersoll, J. and Ross, S. (1985) A theory of the term structure of interest rates.
Econometrica 53
17. Hansen, N. (2006), ”The CMA evolution strategy: a comparing review”, Towards a new
evolutionary computation. Advances on estimation of distribution algorithms, Springer,
pp. 1769–1776
18. Haykin, Simon S. (1999). Neural networks : a comprehensive foundation. Prentice Hall
19. Jonas Mockus: On Bayesian Methods for Seeking the Extremum. Optimization Techniques
1974: 400-404
20. Paul Glasserman – Monte Carlo Methods in Financial Engineering (Stochastic Modeling
and Aplied Probability) (v.53)- Springer (2003).
21. Rudin, C. and K. L. Wagstaff (2014) “Machine learning for science and society.” Machine
Learning, Vol. 95, No. 1, pp. 1–9.
22. Schmitt, Lothar M. (2001). ”Theory of Genetic Algorithms”. Theoretical Computer Sci-
ence. 259 (1–2): 1–61
23. Shoshani, A. and D. Rotem (2010): “Scientific data management: Challenges, technology,
and deployment.” Chapman & Hall/CRC Computational Science Series. CRC Press.
24. Snir, M. et al. (1998): MPI: The Complete Reference. Volume 1, The MPI-1 Core. MIT
Press.
25. Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009) The Elements of Statistical Learn-
ing, Springer-Verlag

Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Πολιτικών Μηχανικών

Εφαρμογές των στοχαστικών μεθόδων και μηχανικής

2 Mathematical models for options pricing 9

3 Machine learning methods 21

4.2 Covariance matrix adaptation evolution . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Applications with real options data 37

3.1 Rosenblatt’s Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Concept of directional optimization in CMA-ES algorithm. . . . . . . . . . . . . . 33

5.1 Statistics for the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.1 code yahoo finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.3 code data creation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.1 Interest rates

V = 500 × exp(0.04 × 3.5) = $574.14.

1.2 Futures and Forwards

1.2.2 Accounting for Dividends

i. The continuously compounded present value of the dividend payments; and

For the present value of dividend payments, we have:

C(0, 0.5) = 2 exp(0.0475/12) + 2 exp(0.05/3) = $3.959.

Therefore the continuously compounded dividend yield is:

Y (0, 0.5) = 3.959100 × 0.5 = 7.92%.

S(T ) − S(t) exp((r(t, T ) − y(t, T ))(T − t)).

Hence the no arbitrage condition yields the fair value:

F ∗ (t, T ) = S(t) exp((r(t, T )y(t, T ))(T t)).

1.2.3 No Arbitrage Range

F (t, T ) = F ∗ (t, T ) + x(t, T )S(t).

Solution: The percentage mispricing was:

Having said this it should be noted that:

• if F M A > F A then sell the forward and buy the spot, or

• if F M B < F B then sell the spot and buy the forward.

1.3.2 Put-Call Parity

• Portfolio I: Buy one share and a European put on this share.

1.3.3 American Option

CA(S, t∣K, T ) ≥ exp(−y(T − t))S(t) − K.

CE(F, t∣K, T ) − P E(F, t∣K, T ) = F (t, T ) − K.

using the superscript E to denote the European option price. Hence:

CA(F, t∣K, T ) ≥ CE(F, t∣K, T )F (t, T ) − K.

Mathematical models for options pricing

2.1 Principles of Options Pricing

2.2 Path-Dependent Example

In many cases, both considerations apply.

2.3 Pricing and Replication

2.4 Black-Scholes Model

and simplifying yields

2.5 Parameters and calibration

2.7 Implied Volatility

2.8 Stochastic Volatility Models

2.9 Stochastic Volatility PDE

and, using ⟨⋅, ⋅⟩

⟨dW1 (t), dW2 (t)⟩ = ρdt. (2.20)

Π(S, σ)δ1 S − δ2 g2 (S, σ) − g1 (S, σ) (2.21)

(α − λβ)gσ + αρσSgSσ + 1/2β 2 gσσ

Machine learning methods

A subset of machine learning is closely related to computational statistics, which focuses on

3.1 Artificial neural networks

Figure 3.1: Rosenblatt’s Perceptron

Feed forward neural networks consists of the following items:

Figure 3.2: multilayer neural network

Global training techniques

• steepest descent algorithm: dt = −∇E(wt )

• conjugate gradient algorithm: dt = −∇E(wt ) + βt−1 dt−1 ,

Local training techniques

3.2 Tree-Based Methods