Dynamic Pricing With Applicati

DYNAMIC PRICING WITH
APPLICATION TO INSURANCE
A thesis submitted to the University of Manchester

for the degree of Doctor of Philosophy
in the Faculty of Science and Engineering
2020
Yuqing Zhang
Department of Mathematics
Contents
Abstract 8
Declaration 9
Copyright Statement 10
Acknowledgements 11
Nomenclature 12
1 Introduction 15
2 Background material 19
2.1 Notation and terminology . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Example: GLMs in non-life insurance . . . . . . . . . . . . . . 29
2.3 Strong consistency of estimates . . . . . . . . . . . . . . . . . . . . . 31
3 Adaptive pricing in non-life insurance 38

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Generalized linear pricing model . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Model and assumptions . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Estimation of unknown parameters . . . . . . . . . . . . . . . 47
2
3.4.3 Adaptive GLM pricing policy . . . . . . . . . . . . . . . . . . 49
3.4.4 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Gaussian process pricing model . . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Adaptive GP pricing policy . . . . . . . . . . . . . . . . . . . 56
3.5.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 GLM and GP algorithms with unknown delays . . . . . . . . . . . . . 61
3.6.2 Adaptive GLM pricing with unknown delays . . . . . . . . . . 63
3.6.3 Adaptive GP pricing with unknown delays . . . . . . . . . . . 66
3.7 Numerical examples in insurance . . . . . . . . . . . . . . . . . . . . 66
3.7.1 Adaptive GLM pricing without delays . . . . . . . . . . . . . 66
3.7.2 Adaptive GP pricing without delays . . . . . . . . . . . . . . . 67
3.7.3 Adaptive GLM and GP algorithms with delays . . . . . . . . 68
3.7.4 Comparison of GLM and GP algorithms with and without
delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . 69
3.9 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Perturbed pricing 80
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.2 Perturbed certainty equivalent pricing . . . . . . . . . . . . . 91
4.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.1 Additional notations . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.2 Key additional results . . . . . . . . . . . . . . . . . . . . . . 94
4.3.3 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3
4.5 Discussion, conclusions and future directions . . . . . . . . . . . . . . 100
4.6 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Reinforcement learning in insurance 118

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.1 Temporal-difference algorithms . . . . . . . . . . . . . . . . . 124
5.3.2 Function approximation . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.4 Deep Q-Network . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.5 Exploration and exploitation trade-off . . . . . . . . . . . . . 129
5.3.6 Cross-entropy method in MDPs . . . . . . . . . . . . . . . . . 130
5.5 Numerical examples in insurance . . . . . . . . . . . . . . . . . . . . 134
5.5.1 Temporal-difference methods . . . . . . . . . . . . . . . . . . . 135
5.5.2 Cross-entropy method . . . . . . . . . . . . . . . . . . . . . . 138
5.5.3 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . 140
4
List of Tables
2.1 Commonly used link functions . . . . . . . . . . . . . . . . . . . . . . 25
5
List of Figures
3.1 Price dispersion and convergence of parameter estimates obtained for

the GLM pricing policy. . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Cumulative regret and convergence rate for the GLM pricing algo-
rithm. GLM denotes the non-delayed case and D-GLM denotes the
delayed case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Cumulative regret and convergence rate for the GP pricing algorithm.
GP denotes the non-delayed case and D-GP denotes the delayed case. 72
4.1 Convergence of parameter estimates and regret with linear demand

function. Time period T = 2000 and αt = t−1/4 . Problem parameters
used are [pl , ph ] = [0.5, 2], m = 15, true parameters of price vector
[1, −0.5]> ∈ R2 , ct and its true coefficients drawn i.i.d. from N (0, I) ∈
R15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Convergence of parameter estimates and regret with logistic regres-

sion for demand. Time period T = 2000 and αt = t−1/4 . Problem
parameters used are [pl , ph ] = [0.5, 2], m = 15, true parameters of
price vector [1, −0.5]> ∈ R2 , ct and its true coefficients drawn i.i.d.
from N (0, I) ∈ R15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 The agent-environment interaction in an MDP. . . . . . . . . . . . . 122
5.2 A feedforward NN with n input units, m output units, and two hidden
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6
5.3 Performance of Q-learning, Sarsa and Expected Sarsa. The curves
display the T -period regret for different step-sizes α = 0.01, 0.03, 0.1, 0.3, 1.0,
where periods T are 106 , 106 and 5 × 105 respectively. In all cases,
the problem parameters used are γ = 0.9 and = 0.1. . . . . . . . . . 136
5.4 A comparison of Q-learning, Sarsa and Expected Sarsa as a function
of α. The three curves display the T -period regret in period 105 ,
respectively. In all cases, the problem parameters used are γ = 0.9
and = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.5 Performance of Q-learning, Sarsa and Expected Sarsa for different
exploration rates = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. The curves display the
T -period regret in period 105 , respectively. In all cases, the problem
parameters used are γ = 0.9 and α = 0.5. . . . . . . . . . . . . . . . . 137
5.6 A comparison of T -period regret of Q-learning, Sarsa and Expected
Sarsa. The three curves display the T -period regret in period 105 , re-
spectively. In all cases, the problem parameters used are γ = 0.9, =
0.1 and α = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.7 Pricing Policy. The curve is the average reward for varying prices, and
the green area is the CE convergence zone. The problem parameters
are N = 100, T = 100 and ρ = 10%. . . . . . . . . . . . . . . . . . . . 139
5.8 A comparison of average reward for CE policies with 90%, 95% and
99% percentiles. In all cases, the problem parameters are N = 100
and T = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.9 Performance of DQN using Keras and Q-learning. The curves display
the T -period regret in period 20000, respectively. In both cases, the
problem parameters used are constant γ = 0.9, α = 0.5 and decaying .140
7
The University of Manchester
Yuqing Zhang
Doctor of Philosophy
Dynamic Pricing with Application to Insurance
December 15, 2020
E-commerce has grown explosively in the last decade and dynamic pricing is one
of the main drivers of this growth. Due to digitization and technology advances,
companies are able to gather information about a product’s features, particularly
in relation to pricing, and then dynamically improve their pricing decisions in or-
der to maximize revenue over time. However, when a company sells a new product
online, there is often little information about the demand. This thesis aims to inves-
tigate dynamic pricing with unknown demand, i.e., how can a company dynamically
learn the impact of prices or other context on product demand, and simultaneously
maximize long-run revenue?
We first focus on the non-life insurance industry. Compared with other finan-
cial businesses, the insurance industry is relatively slow to adapt new technologies.
Dynamic pricing for the insurance pricing problems has only very rarely been consid-
ered before. We consider two adaptive models for demand and claims—a generalized
linear pricing model and a Gaussian process pricing model, based on the work of
den Boer & Zwart (2014b), Srinivas et al. (2012) and Joulani et al. (2016). Here,
neither demand or claims is known to the company. In the real world, claims are
often delayed because they are only triggered when the insured events happen so are
not paid out immediately when an insurance product is purchased. We first show
how these methods can be applied in a simple insurance setting without any delays,
and then we extend them to the setting with delayed claims. Our study shows that
dynamic pricing is potentially applicable to the non-life insurance pricing problem.
We then propose a simple randomized rule for the optimization of prices in
revenue management with contextual information. A popular method—certainty
equivalent pricing—treats parameter estimates as certain and then separately opti-
mizes prices, but this is well-known to be sub-optimal. To overcome this problem,
we advocate a different approach: pricing according to a certainty equivalent rule
with a small random perturbation, and call this perturbed certainty equivalent pric-
ing or perturbed pricing. We show that if the magnitude of the perturbation is
chosen well, our new perturbed pricing performs comparably with the best pricing
strategies. Furthermore, we achieve a new eigenvalue lower bound on design matrix.
Finally, we study the application of reinforcement learning to the insurance pric-
ing problem. Reinforcement learning focuses on learning how to make sequential
decisions in environments with unknown dynamics and it has been successfully ap-
plied to a wide range of problems in many areas. We extend the insurance model
from before to the case where the company pays dividends to shareholders and ruin
probability is involved. Starting with reviewing the basics of reinforcement learn-
ing, we model the insurance pricing as a Markov decision problem and then apply
reinforcement-learning based techniques to solve the pricing problem. The numeri-
cal simulation shows that reinforcement learning could be a useful tool for solving
pricing problems in the insurance context, where available information is limited.
8
Declaration
No portion of the work referred to in the thesis has been
submitted in support of an application for another degree
or qualification of this or any other university or other
institute of learning.
9
Copyright Statement
i. The author of this thesis (including any appendices and/or schedules to this thesis)
owns certain copyright or related rights in it (the “Copyright”) and s/he has given
The University of Manchester certain rights to use such Copyright, including for
administrative purposes.
ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents
Act 1988 (as amended) and regulations issued under it or, where appropriate, in
accordance with licensing agreements which the University has from time to time.
This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other intel-
lectual property (the “Intellectual Property”) and any reproductions of copyright
works in the thesis, for example graphs and tables (“Reproductions”), which may
be described in this thesis, may not be owned by the author and may be owned
by third parties. Such Intellectual Property and Reproductions cannot and must
not be made available for use without the prior written permission of the owner(s)
of the relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this thesis, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University
IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487),
in any relevant Thesis restriction declarations deposited in the University Library,
The University Library’s regulations (see http://www.manchester.ac.uk/library/ab-
outus/regulations) and in The University’s Policy on Presentation of Theses.
10
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my supervisor Neil
Walton, for his continued patience, guidance, advice, and support at every stage of
my doctoral studies. I have greatly benefited from our meetings and conversations,
and the results in this work would never have been obtained without him. I am
thankful to my financial supporters: to China Scholarship Council for my living
stipend and to the University of Manchester for covering my tuition fees.
I am deeply grateful for the support, care and encouragement of my friends in
the Alan Turing building, especially Bindu, Chen, and Davide, who were always
available on my tough days. Next, I wish to thank all those who I had the pleasure
to share offices with. Thank you Clement, Dan, Dalal, Helena, Gian Maria, Michael,
Monica for amazing discussions (on mathematics and everything else), and especially
Tom for reading my thesis draft and giving me helpful advice.
Finally, I would like to thank my family—mum, dad and my little brother—for
standing by me and for being always supportive throughout my life, even though
they don’t really know what I do.
11
Nomenclature
Acronyms
a.s. Almost surely
CE Cross-entropy
GLM Generalized linear model
GP Gaussian process
i.i.d. Independent and identically distributed
MDP Markov decision process
MLE Maximum likelihood estimator
MQLE Maximum quasi-likelihood estimator
RL Reinforcement learning
TD Temporal-difference
UCB Upper confidence bound
w.r.t. With respect to
Greek Symbols
α Step-size parameter
β, β Single parameter, vector of parameters
Γ Gamma function
12
γ Discount factor
γT Maximum information gain after T rounds
κ, K Covariance function (or kernel), covariance matrix
λmax , λmin Maximum, minimum eigenvalues
µ Mean function
Ω Lower bound
π Policy
σ Variance
Bν Modified Bessel function
Other Symbols
E Expectation
1 Indicator function
Z+ Set of positive integers
N Set of natural numbers
P Probability measures
R Set of real numbers
R+ Set of positive real numbers
Roman Symbols
A(s) Set of all actions available in state s
C Set of all contexts
F, H Filtration
N Normal (or Gaussian) distribution
13
R Set of all possible rewards
O Upper bound
Rg(T ) Cumulative regret over time horizon T
S Set of all all possible states
Var Variance function
w Vector of weights
a An action
C(p) Claim function of price p
D(p) Demand function of price p
I Shannon mutual information
p, p? , P, P Price, optimal price, price matrix, set of prices
r Revenue
s, s0 States
SD Sum of all delays
14
Chapter 1
Introduction
Digital technologies are fundamentally changing how businesses operate across all
industries. The rapid growth of information technology and the internet allows
companies to gather information about a product’s features, particularly in relation
to pricing, and respond to the information very quickly and effectively. Quickly
adapting to real-time data from e.g., the Internet of Things (IoT) and social media,
combined with valuable historic data, enables companies to not only improve their
core operations but to launch entirely new products.
The objective of the company is to maximize long-term revenue over time. Pric-
ing correctly is the fastest and most effective way to achieve this objective (McKinsey
& Company, 2003). In a simple world with perfect knowledge, finding the optimal
price, i.e., the price that maximizes revenue, is straightforward. This can be achieved
by algebraic calculation. However, in the real world, companies do not always have
enough information about the underlying demand to make pricing decisions.
This thesis aims to solve the problem that is how to optimally set prices that both
maximize long-run revenue and efficiently estimate the distribution of demand. To
address this question, we investigate dynamic pricing; i.e., the study of how demand
responds to prices in a changing environment. Dynamic pricing has been successfully
applied in many industries such as airline ticketing, hotel bookings, car rentals,
and fashion, for more details we refer to the textbooks Phillips (2005) and Talluri
& van Ryzin (2005). Nowadays many other industries have realized the benefits
of dynamic pricing including taxi services, sports complexes and even zoos (The
15
Wall Street Journal, 2015). Most notably, Amazon changes prices on its products
about every 10 minutes based on customers’ shopping patterns, competitors’ prices,
profit margins, inventory, and other factors (Business Insider, 2018). The success of
Amazon and other online sellers shows the advantages of using dynamic pricing in
e-commerce and even brick-and-mortar retail to optimizing pricing and increasing
revenues.
Specifically, we consider a monopolist who sells a single new product over a finite
time horizon. The company makes its pricing decisions based on the history of past
prices and demands. We also consider the context, in which the items are sold,
such as who is viewing and what search criteria they used. When the company
launches a new product online, there is often little information available about the
distribution of demand, or the relationship between demand and price, but this
can be learned through observations over time. A trade-off between exploration
(learning) and exploitation (earning) is a natural in dynamic pricing decisions. The
company needs to balance exploitation—choosing prices that gave the best reward in
the past—and exploration—choosing prices that potentially yield higher reward in
the future. Online selling enables companies to observe demand for their products in
real time, and dynamically adjust the price of their products. This can help improve
pricing accuracy and efficiency, and make pricing more convenient and transparent.
This thesis investigates how a company in e-commerce can maximize revenue
by applying dynamic pricing. All chapters are self-contained and have detailed
introductions. The literature review in each chapter will provide a detailed survey
of work directly related to this thesis. We now describe the structure and results of
this thesis as follows.
Chapter 2. This chapter provides the necessary background. We give a short

overview of generalized linear models and two commonly used techniques to esti-
mate the unknown parameters of the model—maximum likelihood estimation and
maximum quasi-likelihood estimation. An example of the use of generalized linear
models in non-life insurance pricing is given afterwards. When parameters are un-
known, estimates are normally used instead. However this raises the question of the
consistency of these parameter estimates. We will review several related previous
16
work which established the conditions under which strong consistency is assured in
generalized linear models.
Chapter 3. In this chapter, we consider the dynamic pricing problem in the non-
life insurance setting, based on the work of den Boer & Zwart (2014b), Srinivas
et al. (2012) and Joulani et al. (2016). The non-life insurance product consists of
demand and heavy-tail distributed claims, the functions of which are not known
to the company. We focus on two adaptive approaches—generalized linear models
and Gaussian process regression models. Parameter estimation is conducted by
maximum quasi-likelihood estimation, which is an extension of maximum-likelihood
estimation. In the real world, claims are only triggered when the insured events
happen so are not paid out immediately when an insurance product is purchased.
Therefore, we investigate pricing algorithms both with and without delayed claims.
Our objective is to choose prices that maximize the revenue. The main challenge
here is that the revenue is unknown. Thus we use regret to measure the performance
of policies, where regret is the expected revenue loss relative to the optimal policy.
The objective now is to minimize the regret. We derive asymptotic upper bounds
on the regret that hold for adaptive pricing policies. A simple example shows that
both the GLM and GP pricing polices perform well.
Chapter 4. In this chapter, we consider a company that sells new products on-
line. When a product is sold online, the demand, and thus prices, depend on the
context in which the items are sold, such as who is viewing and what search crite-
ria they used. The objective of the company retains, that is, to choose prices that
maximize long-run revenue and efficiently estimate the distribution of demand. We
advocate a different approach: pricing according to a certainty equivalent rule with
a small random perturbation. We call this perturbed certainty equivalent pricing, or
perturbed pricing, for short. Estimation is then conducted according to a standard
maximum likelihood objective.
We show that the convergence of the perturbed certainty equivalent pricing is
optimal up to logarithmic factors. The key advantage of the perturbed certainty
equivalent pricing is its simplicity—we perturb the data not the optimization. If the
17
magnitude of the perturbation is chosen well, then our results suggest that perturbed
pricing performs comparably with the best pricing strategies. This policy is also
flexible in leveraging contextual information, which is an important best-practice in
many online market places and recommendation systems.
Chapter 5. In this chapter, we analyze the insurance pricing problem in the

reinforcement learning framework. In particular, we extend the model in Chap-
ter 3 to the case where the company pays dividends to shareholders. We start
by reviewing the main concepts of reinforcement learning, including Markov deci-
sion processes—a general problem formulation for reinforcement learning problems.
There are many different reinforcement learning mechanisms but we only consider
temporal-difference methods including the use of neural networks for function ap-
proximation, and cross-entropy search. We then investigate the application of the
these techniques in solving dynamic pricing problems in insurance. Our results show
that reinforcement-learning can be employed by the insurance company to learn op-
timal pricing and therefore maximize their revenue.
18
Chapter 2
Background material
This chapter serves three main purposes. First, we establish the notation and ter-
minology that will be used in the following chapters. Second, we briefly review the
main concepts that are needed for the work in Chapters 3 and 4. In the last section,
we introduce theoretical results that will be necessary later on and provide relevant
literature for the interested readers.
2.1 Notation and terminology
Sets and orders. The set of natural numbers is denoted by N, the set of positive
integers by Z+ , the set of real numbers by R, and the set of positive real numbers
by R+ . The set of real n-dimensional vectors is denoted by Rn , and the set of real
matrices with m rows and n columns is denoted by Rm×n .
It is often useful to talk about the rate at which some function changes as
its argument grows (or shrinks), without worrying too much about the detailed
form. The notation O(·) denotes “of the same order”, which implies that functions
g(n) and f (n) “grow (or shrink) at the same rate”, and o(·) denotes “ultimately
smaller than (or negligible compared to)”. For example, g(n) = O(f (n)) means
that |g(n)| ≤ c|f (n)| for some constant c for large n, while g(n) = o(f (n)) means
that |g(n)|/|f (n)| → 0 for large n.
19
Vectors and matrices. A matrix A ∈ Rm×n can be written as
 
 a11 a12 . . . a1n 
 
 
 
a
 21 a22 . . . a2n 

A = (aij ) =  ,
 .. .. .. 
 
..
 . . . . 
 
 
 
am1 am2 . . . amn
where aij (= (A)ij ) is the entry of A in the ith row and jth column. We denote by
A> ∈ Rn×m the transpose of A, such that (A)>
ij = (A)ji . There are several special
matrices. For example, a square matrix is called a diagonal matrix if aij = 0 when
i 6= j; a square matrix is called the identity matrix if aij = 0 when i 6= j and
aij = 1 when i = j; a matrix is called a block matrix if its elements are partitioned
according to a block pattern and the blocks along the diagonal are square.
A vector norm is a function k · k : Rn → R, which, for all vectors x, y ∈ Rn and
α ∈ R, satisfies the following conditions:
1. kxk ≥ 0, with kxk = 0 if and only if x = 0.
2. kαxk = |α|kxk.
3. kx + yk ≤ kxk + kyk (the triangle inequality).
Examples of vector norms of a vector, x ∈ Rd , include the 2-norm or Euclidean

norm, which is the most relevant to this thesis, defined as
n
!1/2
1/2 X
kxk2 = |x1 |2 + · · · + |xn |2 = |xi |2 ,
i=1
the 1-norm, defined as

n
X
kxk1 = |x1 | + · · · + |xn | = |xi | ,
i=1
and the supremum norm or infinity norm, defined as
kxk∞ = max |xi | .

1≤i≤n
A matrix norm is a function k · k : Rm×n → R, which, for all A, B ∈ Rm×n and

α ∈ R, satisfies the following conditions:
20
1. kAk ≥ 0, with kAk = 0 if and only if A = 0.
2. kαAk = |α|kAk.
3. kA + Bk ≤ kAk + kBk.
Given a vector norm k · k, the corresponding operator norm (often called the subor-
dinate or induced matrix norm) on Rm×n is defined by
kAxk
kAk = max .
n
x∈R \{0} kxk
Examples of operator norms are

m
X
kAk1 = max |aij | ,
1≤j≤n
i=1
p
kAk2 = ρ (A> A) = σmax (A) ,
n
X
kAk∞ = max |aij | ,
1≤i≤m
j=1
where kAk2 is called the spectral norm, which is the most relevant to this thesis.
√
For a positive definite matrix A ∈ Rn×n , we have kxkA = x> Ax.
If A ∈ Rn×n , the operator norm becomes
kAxk
kAk = max = max kAxk = max x> Ay ,
n x∈R \{0} kxk x∈S n−1 x,y∈S n−1
where S n−1 is the unit sphere in Rn .

Let F be a field and let V be a vector space over F. An inner product is a
function h·, ·i : V × V → F, which, for all x, y, z ∈ V and α ∈ F, satisfies the
following conditions:
1. hx, xi ≥ 0 with hx, xi = 0 if and only if x = 0.
2. hx + y, zi = hx, zi + hy, zi.
3. hαx, yi = αhx, yi.
4. hx, yi = hy, xi.
21
For example, the Euclidean inner product over V = Rn , which is a function h·, ·i :
Rn × Rn → R, is defined by
n
X
hx, yi = xi y i = x> y = y > x .
i=1
For all x, y ∈ V , by the Cauchy-Schwarz inequality we have
|hx, yi| ≤ kxk · kyk .
Moreover, equality holds if and only if x and y are linearly dependent. (This latter
statement is sometimes called the “converse of Cauchy-Schwarz.”)
Matrices and eigenvalues. A matrix A ∈ Rn×n is symmetric if A> = A. The

symmetric matrix A is positive definite, denoted by A 0, if x> Ax > 0 for all
nonzero x ∈ Rn ; it is positive semi-definite, denoted by A 0, if x> Ax ≥ 0 for all
nonzero x ∈ Rn .
Let A ∈ Rn×n . If λ ∈ R and x ∈ Rn \{0} satisfy
Ax = λx ,
then λ is an eigenvalue of A and x is the corresponding eigenvector.

We let λmin (A) and λmax (A) denote the minimum and maximum eigenvalues of
A. If A 0 then
λmax (A) = kAk , λmin (A) = min w> Aw ,

w:kwk=1
and
λmin (a1 A1 + a2 A2 ) ≥ a1 λmin (A1 ) + a2 λmin (A2 ) ,
for a1 , a2 ∈ R+ and matrices A1 , A2 0. Further, for a block diagonal matrix,

 
A1 0 
λmin   ≥ λmin (A1 ) ∧ λmin (A2 ) .
 
 
0 A2
Here we define a ∧ b = min{a, b} and a ∨ b = max{a, b} for a, b ∈ R.
22
Schur complements. Let A ∈ Rn×n , then A can be written as a 2 × 2 block
matrix  
B C 
A= ,
 
 
D E
where B is a p × p matrix and C is a q × q matrix, with n = p + q (so C is a p × q

matrix and D is a q × p matrix). If we assume that A is symmetric, so that B, E
are symmetric and D = C > , then A can be expressed as
     >
 B C   I CE −1   B − CE −1 B > 0   I CE −1 
A= =  .
     
 
     
C> E 0 I 0 E 0 I
Trace and determinant. The trace of A ∈ Rn×n , denoted by tr(A), is the sum
of its diagonal elements, i.e.,
n
X
tr(A) = aii .
i=1
Two key properties are that tr(A) is the sum of the eigenvalues of A and that
tr(AB) = tr(BA) for all A ∈ Rm×n and B ∈ Rn×m .
The determinant of a matrix A ∈ Rn×n , denoted by det(A) or |A|, is the product
of the eigenvalues of A. Moreover, it can be shown that |αA| = αn |A| for all α ∈ R,
and that the determinant is multiplicative, i.e., for any A, B ∈ Rn×n , one has that
|AB| = |A||B|
2.2 Generalized linear models

In this section, we provide a brief introduction to generalized linear models (GLMs).
GLMs were introduced by Nelder & Wedderburn (1972) and are an extension of
classical linear models. For a detailed overview of GLMs, we refer to McCullagh &
Nelder (1989). Compared with classical linear models, there are two distinguishing
features of GLMs. The first is that, in GLMs, the distribution of the response is
chosen from an exponential family. The exponential family of distributions is a wide
class of distributions, and many well-known distributions belong to this family, such
23
as the Poisson, Normal and Binomial distributions. Thus the distribution of the
response needs not be Normal. The second is that GLMs use a function, called
the link function, to connect the linear predictor with the mean of the dependent
variables. The link function can be an identity, a log or a power function. When it
is an identity function, GLMs become classic linear models.
In GLMs, a commonly used technique to find the parameters of the model is
to maximize likelihood estimation. This method is simple, consistent and normally
work well enough. However, maximum likelihood estimation requires complete infor-
mation, which is not always possible. To address this problem, Wedderburn (1974)
proposed quasi-likelihood estimation, an extension of likelihood estimation for which
only the first two moments of the observations are needed.
2.2.1 Overview
GLMs are popular statistical models that are often used in the framework of an
exponential family. There are three elements in GLMs. Assume y1 , . . . , yn are
realizations of the random variables Y1 , . . . , Yn . The first element is that the response
yi belongs the exponential family of distribution if its density function can be written
as
yi θi − b(θi )
f (yi ; θi , φ) = exp + c(yi , φ) , (2.2.1)
a(φ)
where θi is the canonical (or scale) parameter which changes with i, and φ > 0
is the dispersion parameter. Functions a(·), b(·), c(·) are fixed and known for all
i = 1, . . . , n, and b(·) is assumed twice continuously differentiable with invertible
first derivative. Usually, we have that a(φ) = φ or a(φ) = φ/wi for a known weight
wi . The mean and variance of yi are
µi = E[yi ] = b0 (θi ) , Var(yi ) = a(φ)b00 (θi ) , (2.2.2)
where b0 (θi ), b00 (θi ) are the first and second derivatives of b(θi ) w.r.t. θi . Since µi
depends on θi , we may write the variance as
Var(yi ) = a(φ)V (µi ) ,
where V (·) is called the variance function. This function captures the relationship
between the mean and variance of yi .
24
Table 2.1: Commonly used link functions
Distribution Link functions g(µi ) µi = g −1 (ηi )
Normal identity µi ηi
Poisson log log(µi ) exp(ηi )

µi exp(ηi )
Binomial logit log 1−µ i (1+exp(ηi ))
Exponential log log(µi ) exp(ηi )
The second element is the linear regression function or linear predictor. We

denote the linear predictor by ηi and write it in the form
p
X
ηi = x>
i β = xij βj ,
j=1
where xi is a p × 1 vector of explanatory variables, xi = (xi1 , · · · , xip )> , the ith

column of the design matrix X. β is a p × 1 vector of unknown parameters, β =
(β1 , . . . , βp )> .
The third element is the link function, g(µi ), which describes how the mean
response µi is related to the linear predictor ηi , such that
g(µi ) = ηi ,
where g is a monotonic, differentiable function (such as a log or a square root). The

link function is often chosen based on the knowledge of the response distribution,
which makes GLMs more flexible. We assume g is invertable, so once x>
i β is deter-
mined then so is µi and also θi . Note that the classic linear model is a special case
of GLMs where the response variable is typically assumed to be normal distributed
with an identity link function, i.e., g (µi ) = µi .
Some commonly used distributions and link functions are summarized in Table
2.1. Among those, the log- and logit- link functions are normally used for insurance,
because they have multiplicative model structure and make sure the predictions
are always positive. If g(µi ) = θi when ηi = θi , then g is called the canonical
link function. The canonical link function is often used because it simplifies the
mathematical analysis (e.g. see Wedderburn (1974)).
25
2.2.2 Parameter estimation
Likelihood estimation
A commonly used technique in GLMs to estimate the regression parameters is the

maximum likelihood estimation method. The idea is to find an estimate for β which
maximizes the likelihood function given the observed data, and this is equivalent to
maximizing the logarithm of the likelihood function. The log-likelihood function,
denoted by `(θi , φ ; yi ), is a function of θi , φ, yi and in the form
`(θi , φ ; yi ) = log f (yi ; θi , φ) .
We consider the log-likelihood function of β, rather than θ, which can be achieved

by the inverse of the the relation µi = b0 (θi ) and the link function g(µi ) = ηi = x>
i β.
Since the dispersion parameter φ does not affect the maximization of `, we can write
n
X yi θi − b(θi )
`(β ; y) = + c(yi , φ) .
i=1
a(φ)
The following are examples of the log-likelihood functions for some common dis-
tributions. Under the normal model, yi ∼ N (µi , σ 2 ) , the log-likelihood function
is
n
X yi µi − µ2 /2 i yi2 1 2

`(β ; y) = − − log 2πσ .
i=1
σ2 2σ 2 2
If yi ∼ Poisson(µi ) then
n
X
`(β ; y) = yi log(µi ) − µi − log(yi !) .
i=1
If yi ∼ n−1 Binomial(n, π) then

n π

X yi log 1−π
+ log(1 − π)
`(β ; y) = + c,
i=1
1/n
where c is independent of π.
The maximum likelihood estimators (MLEs) of β, denoted by β̂, are derived by
maximizing the log-likelihood function `, defined by
β̂ = arg max `(β ; y) .

β
26
To find the maximum, we can differentiate `(β) w.r.t. all βj , j = 1 . . . , p and set all
these partial derivatives equal to zero,
∂`
= 0.
∂βj
Applying the chain rule gives

n
∂` X ∂` ∂θi ∂µi ∂ηi
= · · · ,
∂βj i=1
∂θ i ∂µ i ∂η i ∂β j
together with Var(y) = a(φ)V (µi ) and
∂` yi − b0 (θi ) ∂θi 1 a(φ) ∂µi 1 ∂ηi

= , = 00 = , = 0 , = xij ,
∂θi a(φ) ∂µi b (θi ) Var(yi ) ∂ηi g (µi ) ∂βj
we obtain
n n
X y i − µi xij X yi − g −1 (x>
i β) xij
· 0 = >
·
−1 (x β)) g 0 (µ )
= 0,
i=1
a(φ)V (µ i ) g (µ i ) i=1
a(φ)V (g i i
since µi = g −1 (x>
i β). The condition for a maximum is that the second partial
derivatives is positive definite. Thus it is necessary to verify that the matrix of

second derivatives
∂ 2`
< 0.
∂βj ∂βk
Maximum likelihood estimation has several desirable mathematical properties. For
example, the MLE has minimum variance and is asymptotically unbiased as the
sample size increases, and is therefore the most precise estimate possible. The
method is statistically well founded—several popular statistical software packages
support maximum likelihood estimation for many of the commonly used distribu-
tions. Furthermore, MLE is consistent, which means that it can be developed for a
large variety of estimation situations. However, there are also several disadvantages
of MLE. For example, the MLE can be heavily biased for small samples, and thus
the optimality properties may not apply for small samples. In insurance, the sample
size is generally very large and small-sample bias is usually not an issue. In addi-
tion, MLEs are not always in closed form. In this case, parameters can be estimated
numerically using either the Newton–Raphson method or Fisher’s scoring method.
We will not discuss this here; for details we refer readers to Wedderburn (1974).
27
Quasi-likelihood estimation
In order to construct a likelihood function, a specific probability model is usually

required, for example, modeling the number of claims by a Poisson distribution.
Sufficient information to specify a full likelihood equation is sometimes lacking in
statistical inference. Furthermore, the response yi is not always in the exponential
family. To solve the above problems, the concept of quasi-likelihood estimation was
introduced by Wedderburn (1974), and further developed by McCullagh & Nelder
(1989). The key difference is that only the first two moments are required, so that
maximum quasi-likelihood estimation is a much more flexible approach than MLE.
We can express the mean and variance of yi in (2.2.2) in terms of regression
equations of β,
µi = E[yi ] = µi (β) , Var(yi ) = a(φ)V (µi (β)) ,
where µi (β) are functions of an unknown p-dimensional parameter vector β. Instead

of taking the first derivative of the log-likelihood function w.r.t. β, we take the first
derivative w.r.t. mean function, µ(β), with the following analogous properties:
∂ 2 `(β)

∂`(β) 1
E = 0, −E 2
= .
∂µi ∂µi a(φ)V (µi )
Assuming all yi are in the form of (2.2.1), then we can write
∂` y i − µi
= ,
∂µi a(φ)V (µi )
which has several common properties with the log-likelihood derivative. Then we
define the quasi-likelihood for each yi as the integral
µi
yi − ν
Z
q(µi ; yi ) = dν ,
yi a(φ)V (µi )
which, if it exists, behaves like a log-likelihood. Since Y1 , . . . , Yn are independent,

then the quasi-likelihood for Y1 , . . . , Yn is the sum of the individuals, given by
n
X
q(µ; y) = q(µi ; yi ).
i=1
28
The function q(µ ; y) is called the quasi-likelihood, or more correctly the log quasi-
likelihood. Similarly, maximum quasi-likelihood estimators (MQLEs) of β are de-
noted as β̂, which are derived by maximizing the quasi-likelihood q. This is equiv-
alent to solving
∂q
= 0.
∂βj
Applying the chain rule gives
n n
X y i − µi ∂µi X yi − µi xij
· = · 0 = 0,
i=1
a(φ)V (µi ) ∂βj i=1
a(φ)V (µi ) g (µi )
where j = 1, . . . , p. In the special case where g(µ) is the canonical link function for
a GLM, the above eqaution is simplified to
n
X yi − µi (β)
· xij = 0 .
i=1
a(φ)
Maximum quasi-likelihood estimation retains many of the properties of maximum

likelihood estimation, and has the same asymptotic distribution as the MLE of a
GLM (e.g., see Chapter 9 in McCullagh & Nelder (1989)). Instead of specifying that
a particular distribution belongs to the exponential family, all MQLE requires are the
first two moments µi (β) and V (µi (β)) and the link functions (i.e., φ and V (µ) do not
need to correspond to an exponential family response). For example, we can obtain
quasi-Poisson regression by specifying µi (β) = exp(x>
i β) and V (µi (β)) = µi (β),
together with a log-link function.
2.2.3 Example: GLMs in non-life insurance
A non-life insurance product is a promise to make certain payments (i.e., the in-
surance claims) for unpredictable losses under certain conditions (where by law an
element of chance/uncertainty must play a role) to the client—the policyholder—
during a time period (typically one year), in exchange for a certain fee (i.e., the
premium) from the client at the start of the contract.
In non-life insurance, the total claims can be expressed as the claim frequency
times the severity. The claim frequency is the number of claims during a specific time
period (e.g., one year) and the claim severity is the size of individual claims. We can
model the claim frequency and the claim severity separately. A GLM with a Poisson
29
distribution is usually used to model claim frequency, and a gamma distribution is
usually used to model severity. Maximum likelihood estimation is used to find model
parameters.
Estimation model of claim frequency
The number of claims that occur for insurance policy i during a policy period,
denoted by Ni , is often assumed to follow a Poisson distribution with mean λi (e.g.
see Dionne & Vanasse (1989); de Jong & Heller (2008); Antonio & Valdez (2012);
Cameron & Trivedi (2013)), given by Ni ∼ Poisson(λi ). The claim frequency λi is
“explained” by a set of observable variables/features, xi , via a link function of the
form
g(λi ) = x>
i β.
Using a Poisson distribution implies that the mean is equal to the variance so that
E[Ni ] = Var(Ni ) = λi . A popular choice for g(λi ) is the log-link function, log(λi ) =
x> >
i β, which guarantees λi = exp(xi β) is positive and turns it into an additive form.
In this case, the log-likelihood function ` is

n
X
`(β) = Ni log(x> >
i β) − xi β − log(Ni !) .
i=1
The MLE of β is the solution of the partial derivatives of the log-likelihood function
n n
∂` X X
= xij (yi − λi ) = xij (yi − exp(x>
i β)) = 0 ,
∂βi i=1 i=1
for j = i, . . . , p.
Estimation model of claim size
In insurance, the size of claims is non-negative and generally has a long tail to
the right. The gamma distribution is often used to model the claim size, because
gamma random variables are continuous, non-negative and skewed to the right, with
the possibility of large values in the upper tail.
We let Ni be the number of claims on policy i, and ci1 , ci2 , . . . , ciNi be the in-
dividual claim size for the Ni -th observed claims on policy i for i = 1, . . . , n. The
30
individual claim size is assumed to be independently gamma distributed. The prob-
ability density function is given by
ν
1 νci vci
f (ci ) = exp − ,
Γ(v) µi µi
with the mean E [ci ] = µi and variance V (ci ) = µ2i /ν. Then the log-likelihood
function is
n Y Ni v
Y 1 vcik vcik 1
`(β) = exp −
i=1 k=1
Γ(v) µ i µi cik
Ni
n X
!
ci −µ−1

X i ln µi
= n(ν ln ν − ln Γ(ν)) + (ν − 1) ln ci + −1
− −1 .
i=1 k=1
ν ν
The MLE of β is the solution of the partial derivatives of the log-likelihood

n Ni
∂`(β) ∂ XX νcik
= −ν ln µi − =0
∂βj ∂βj i=1 k=1 µi
We also choose the log-link function, g(µi ) = log(µi ) = x>

i β, then for j = 1, . . . , p,
n
∂`(β) X νci
= −ν − xij = 0 .
∂βj i=1
µi
For more details, we refer readers to Chapter 2 in Ohlsson & Johansson (2010) and
Sections 7.2.1 and 7.3.2 in Wüthrich (2017). The typical data consists of claim
information for a number of policies over a number of years, and a number of ex-
planatory variables such as policyholder specific features (e.g. age, gender, etc,),
insured object specific features (e.g. in the case of a car: type, power, age, etc.), and
insurance contract specific features (e.g. claims experience). A natural question is
how much premium the insurer should require for an insurance product. There are a
number of pricing principles for calculating premiums, see Wüthrich (2017, Chapter
6). To avoid ruin, the insurance company needs to charge an expected premium
that exceeds the expected total claims.
2.3 Strong consistency of estimates

Much attention has been paid to ensuring the consistency of estimates. For exam-
ple, Fahrmeir & Kaufmann (1985) studied the asymptotic existence, consistency and
31
asymptotic normality of the MLE of the true parameter under some general condi-
tions. A parameter estimate, β̂, is said to be (weakly) consistent if it converges, in
probability, to the “true” unknown parameter, β0 as sample size n → ∞. We say
β̂, is strongly consistent if, with probability 1 (or almost surely), it converges to β0
as sample size n → ∞.
In the classical linear regression model with i.i.d. errors, a necessary and sufficient
condition for weak (Drygas, 1976) and strong (Lai et al., 1979) consistency of the
least squares estimator is
n
!
X
λmin xi x>
i → ∞.
i=1
We consider the following regression problem in Lai et al. (1979) and Lai & Wei
(1982),
yn = β1 xn1 + · · · + βp xnp + n ,
for n = 1, 2, .... Here n are unobservable random errors, β1 , ..., βp are unknown
parameters, and yn is the observed response corresponding to the design levels
xn1 , . . . , xnp . Write inputs as a vector, xn = (xn1 , · · · , xnp )> . We let Xn = (xij : i =
1, . . . , n, j = 1, . . . , p) be the matrix of inputs and yn = (y1 , . . . , yn )> be the matrix
of outputs. Further we let bn be the least squares estimate of β = (β1 , . . . , βp )>
given Xn and yn , i.e.,
bn := (Xn> Xn )−1 Xn> yn , β := (Xn> Xn )−1 Xn> (yn − n ) ,
where n = (1 , . . . , n )> . Assume errors i are a martingale difference sequence w.r.t.
the filtration generated by {xj , yj−1 : j ≤ i}. Let Fi be a sequence of increasing
filtration such that yi ∈ Fi and xi ∈ Fi−1 , then we know εi is Fi -measurable and
E[εi | Fi ] = 0 and supi∈N E[ε2i ] < ∞ for all i ≥ 1.
We define a fixed design where the design vectors xi are fixed, non-random
p-dimensional vectors, and an adaptive design where the xi are sequentially deter-
mined random vectors.
Review of related results. Lai et al. (1979) proved that the least squares esti-
mator for linear regression models is strongly consistent under following conditions:
32
−1
C1. Xn> Xn → 0 a.s. as n → ∞.
P∞ P∞ 2
C2. i=1 ci i converges a.s. for any sequence of constants {ci } satisfying i=1 ci <
∞.
Here condition C1 is a minimum requirement for bn to converge to β, and it is

equivalent to
Pn
xi x>

C1’. λmin i=1 i → ∞ a.s. as n → ∞.
Anderson & Taylor (1976) showed that when εi are i.i.d. with mean zero and variance
σ 2 > 0, condition C1 implies the strong consistency of bn . Drygas (1976) also proved
that condition C1 is a necessary and sufficient condition for weak consistency in a
classical linear regression model with i.i.d. errors.
However, we now want to consider a stochastic setting where we sequentially
choose inputs xi and then get observations yi . In this case, the condition C1 (or
C1’) is not sufficient for the strong consistency of bn . To solve this problem, Lai &
Pn >

Wei (1982) made the slightly stronger assumption that λmax i=1 xi xi goes to
Pn
infinity faster than log(λmax ( i=1 xi x>
i )). The following result gives a condition on
the eigenvalues of the design matrix Xn> Xn for bn to converge to β and also gives a
rate of convergence.
Theorem 2.3.1. If λmin (n) and λmax (n) are, respectively, the minimum and maxi-
mum eigenvalues of the design matrix Xn> Xn . Under the following conditions:
log(λmax (n))
C3. λmin (n) → ∞ a.s., and λmin (n)
→ 0 a.s.,
C4. supi∈N E[|i |γ ] < ∞ a.s. for some γ > 2 ,
then the estimate bn converges to β a.s., and

21 !
log(λmax (n))
kbn − βk = O a.s. (2.3.1)
λmin (n)
Proof. We give a brief proof here, for more details see Lai & Wei (1982). To prove
33
this theorem we will require Lemma 2.3.1, a restatement of Sherman–Morrison for-
mula (stated and proved after the proof). Note that
2
kbn − βk2 = (Xn> Xn )−1 Xn> n
2 2
≤ (Xn> Xn )−1/2 (Xn> Xn )−1/2 Xn> n
= λmin (n)−1 > > −1 >

n Xn (Xn Xn ) Xn n ,
by the Cauchy-Schwarz inequality. Then we define
Qn = > > −1 >

n Xn (Xn Xn ) Xn n .
−1 Pn −1
We bound Qn by Lemma 2.3.1. Let Vn = Xn> Xn = i=1 xi x>
i and by
Lemma 2.3.1,
Vn−1 xn x>
n Vn−1
Vn = (Xn> Xn )−1 = >
(Xn−1 Xn−1 + xn x>
n)
−1
= Vn−1 − .
1 + x>n Vn−1 xn
Now we obtain the recursive form of Qn . Define N = inf{n : Xn> Xn is nonsigular} <
∞, and for k > N ,
k
! k
!
X X
Qk = x>
i i Vk xi i
i=1 i=1
k−1
! k−1
! k−1
!
X X X
= x>
i i Vk xi i + x> 2 >
k Vk xk k + 2xk Vk xi i k
i=1 i=1 i=1
k−1
! k−1
!
>
X
X V x x
k−1 k k k−1 V
= Qk−1 + x>
i i − >
xi i
i=1
1 + x k Vk−1 x k i=1
k−1
!
>
X
xk Vk−1
+ x> 2
k Vk xk k + 2 xi i k
1 + x> k Vk−1 xk i=1
2
k−1
x>
P !
k Vk−1 i=1 xi i
> k−1
X
x V
k k−1
= Qk−1 − + x> 2
k Vk xk k + 2 xi i k .
1 + x> V
k k−1 kx 1 + x >
k Vk−1 xk i=1
Let
Pk−1 2
x>
k Vk−1 i=1 xi i
γk = ,
1 + x>
k Vk−1 xk
θk = x> 2
k Vk xk k ,
k−1
!
x>
X
k Vk−1
ωk−1 = 2 xi i ,
1 + x> k Vk−1 xk i=1
34
where γk ≥ 0, θk ≥ 0, and ωk−1 are Fk−1 -measurable because εk−1 is a martingale
difference sequence w.r.t. Fk−1 . Summing these, we have for t > N ,
n
X n
X n
X
Qn = QN − γk + θk + ωk−1 k .
k=N +1 k=N +1 k=N +1
Here, Qn ≥ 0 is an extended stochastic Lyapunov function if it is Fn -measurable

(Lai, 2003). Notice that the second term is a martingale difference sequence and the
third term is a quadratic form. By the strong laws for martingales, for any α > 0,
n
! 
n n
! 21 +α 
X X X
2
max Qn , γk = O θk + ωk−1  a.s.
k=N +1 k=N +1 k=N +1
The local martingale convergence theorem and the strong law of large numbers
(Chow, 1965) imply nk=N +1 ωk−1
2
≤ 4 nk=N +1 γk . To bound the term nk=N +1 θk ,
P P P
we require Lemma 2.3.2 (stated and proved after the proof). Then we have that
n
X n
X
θk = x> 2
k Vk xk k
k=N +1 k=N +1
n
!
X
=O x>
k Vk xk
k=N +1
= O (log λmax (n)) .
Thus, we have that when limn→∞ λmax (t) = ∞,
Qn = O (log λmax (n)) .
Finally, we obtain the result given in (2.3.1).
Lemma 2.3.1 (Sherman–Morrison Formula). For an invertible matrix A and two

vectors u and v,
A−1 uv > A−1
(A + uv > )−1 = A−1 − .
1 + v > A−1 u
Proof. Recall that the outer-product of two vectors wv > is the matrix (wi vj )n,n
i=1,j=1 .
We first show the following is an identity:
> −1 wv >
(I + wv ) =I− .
1 + v>w
35
Multiplying (I + wv > ) on both sides, the right hand side becomes I and the left
hand side is
wv > wv > + wv > wv >

>
I− 1 + wv = I + wv > −
I + v>w 1 + v>w
w(1 + v > w)v >
= I + wv > −
1 + v>w
=I.
We next let u = Aw then

wv > −1 A−1 uv > A−1
(A + uv > )−1 = (I + wv > )−1 A−1 = I − A = A−1
+ ,
1 + v>w 1 + v > A−1 u
as required.
Pn
Lemma 2.3.2. If w1 , w2 , ... are a sequence of vectors and we let An = k=1 wk wk> ,
then
n
X
wk> A−1
k wk = O(log λmax (An )) .
k=s
Proof. Write A = B + ww> , then by the Sherman–Morrison formula
|B| = |A − ww> | = |A|(1 − w> A−1 w) .
Here, | · | is the determinant of A and B. Thus

|A| − |B|
wAw> = .
|A|
Summing the above equation and applying the concavity of the logarithm give
n n
X X |Ak | − |Ak−1 |
wk> A−1
k wk =
k=s k=s
|Ak |
Xn
≤ log |Ak | − log |Ak−1 |
k=s
= log |An | − log |As | .
Let λmax (n) and λmin (n) denote the maximum and minimum eigenvalue of An , re-
spectively. Since An −An−1 = wn wn> is nonnegative definite, λmax (n) ≥ λmax (n−1),
λmin (n) ≥ λmin (n − 1), and An is nonsingular for n ≥ s. Furthermore, |An | is the
product of all eigenvalues and for n ≥ s, |An | ≥ λ(s)d−1 λmax (n). So we see that
n
X
wk> A−1
k wk = O(log |An |) = O(log λmax (n)) .
k=1
36
Under similar conditions to Lai et al. (1979) and Lai & Wei (1982), Chen et al.
(1999) studied the strong consistency results for MQLE in GLMs but they considered
the GLM with a canonical link function µ. Assume that in the fixed design case,
E [yi ] = µ x>

i β ,
and in the adaptive design case,
E [yi |Fi−1 ] = µ x>

i β .
Define the errors

i = yi − µ x>

i β ,
where εi forms a martingale difference sequence w.r.t. Fi , i = 1, . . . , n. As discussed

in Section 2.2.2, MQLE, denoted by bn , is the solution to
Xn
yi − µ x>

i b n xi = 0 .
i=1
Chen et al. (1999) derived the strong consistency results for GLMs, that respectively,
parallel those of Lai et al. (1979) for fixed designs and Lai & Wei (1982) for adaptive
designs. However, the proof contains a mistake, see Zhang & Liao (1999).
Chang (1999) studied the consistency in GLMs under general link functions with
an additional assumption
1
C5. λmin (n) ≥ εnα , for some α > 2
and ε > 0.
Similarly in the adaptive design,
E [yi |Fi−1 ] = µ x> Var [yi |Fi−1 ] = σ 2 V (µi ) ,

i β ,
and define errors

i = yi − E [yi |Fi−1 ] .
where εi forms a martingale difference sequence w.r.t. Fi , i = 1, . . . , n. Moreover,

MQLE bn is the solution to
n
µ0 x >

i bn
X
>

2 V (µ )
yi − µ x i bn xi = 0 ,
i=1
σ i
where µ0 (·) is the first derivative of µ(·).

Chang (1999) achieved the same result as that in Lai & Wei (1982) under condi-
tions C3, C4 and C5. However, there was a flaw in Chang’s original version of this
proof, but it was amended by den Boer & Zwart (2014a).
37
Chapter 3
Adaptive pricing in non-life

insurance
3.1 Introduction
In this chapter, we consider dynamic learning and pricing problems in a non-life

insurance context. An insurance product is an agreement between an insurance
company and the policyholder. The idea is that the policyholders pay a fees, the
premium, at the beginning of the contract, to protect them against unforeseeable
random events that might cause (financial) damage, during a certain time period. If
the insured events happen within the insurance period, the company makes certain
payments to cover damages to the policyholders. Such payments caused by insured
events are called insurance claims. Through the insurance contracts, the loss is
transferred from the policyholder to the insurer. The loss of the insurance company is
the sum of a large number of small individual losses, which is much more predictable.
By the law of large numbers, the loss should not be too far from its expected
value, and thus we can say that the premium is based on the expected future costs.
Typically, at the start of the contract, the insurer’s income (premium) is known.
However, the outcome (total claims) depends on future events and is unknown to
the insurer.
Non-life insurance refers to insurance for anything other than loss of life—car
insurance, fire insurance, property insurance, earthquake insurance, accident and
38
health insurance (Ohlsson & Johansson, 2010; Wüthrich, 2017). As one of the oldest
financial businesses, insurance has been relatively slow to adopt new technologies
because it is traditionally cautious, heavily regulated, relies on legacy systems, and
has low customer engagement (Institute of International Finance, 2016; McKinsey
& Company, 2018; Nasdaq MarketInsite, 2019). With the rise of digitization and
dynamic pricing, the insurance industry is transforming to a new era, by diversifying
with e-commerce and offering insurance products online.
Dynamic pricing is the study of how demand responds to prices in a changing

environment and this has been successfully applied in a variety of industries such
as airline ticketing, hotel bookings, car rentals, and fashion (Talluri & van Ryzin,
2005, Chapter 10). This enables companies to automatically collect up-to-date in-
formation, and then make pricing decisions. Insurance companies are also faced with
the dynamic pricing problem—the company must decide on real-time premiums for
their customers. The company can learn about premiums via dynamically adjusting
the prices and maximize long-run revenue. Moreover, online learning techniques can
facilitate claims process automation, which reduces the time and cost to process a
claim. This can help improve pricing accuracy and efficiency, and make insurance
pricing more convenient and transparent. Therefore, it is natural to investigate the
application of dynamic pricing and learning techniques to insurance pricing prob-
lems. The main objective of this chapter is to achieve the maximal revenue for the
company by real-time premium and claims learning and pricing strategies.
We address the insurance pricing problem with demand and heavy-tail dis-
tributed claims using an adaptive generalized linear model (GLM) and an adaptive
Gaussian process (GP) regression model. Here demand and claims are both un-
known to the company. A GLM is a parametric model with unknown parameters
and it is widely used in insurance. We model demand and total claims by GLMs
and then adaptively apply maximum likelihood estimation to infer the unknown pa-
rameters. This method is inspired by den Boer & Zwart (2014b), who considered
the expected demand is a generalized linear function of the price for a single prod-
uct. A GP is a Bayesian non-parametric model, which generalizes the Gaussian
probability distribution and models random variables with stochastic processes. We
39
sample demand and total claims from GPs and then choose the optimal price that
gives the highest upper bound on its confidence interval. This is based on Srinivas
et al. (2012), who obtained a finite regret bound for GP method. However, in the
real world, claims are only triggered when the insured events happen so are not paid
out immediately when an insurance product is purchased. We consider the delayed
situation as a delayed feedback problem and apply a method from Joulani et al.
(2016), who considered a full information case. Therefore, we investigate pricing
algorithms both with and without delayed claims and our results show that they
both achieve asymptotic upper bounds on regret.
One of our main contributions compared to previous work is that we study the
dynamic pricing problem in an insurance context, where the company observes not
only the demands but also claims. Second, our pricing problem is an optimization
and bandit learning of an additive function, since the revenue has two components—
premiums and claims. Third, in the delayed case, we adapt the previous result of a
full information setting to a bandit case. We show that GLM and GP mechanisms
are both simple, easy to implement and lead to good performance in the insurance
pricing problem. This suggests that online learning is a promising avenue for future
investigations and applications in insurance.
The remainder of this chapter is structured as follows. In the next section we

provide a review of literature that is most closely related to this work. Then in Sec-
tion 3.3, we give a brief description of the insurance pricing problem and formulate
the problem. In Section 3.4, we discuss an adaptive generalized linear pricing model
and in Section 3.5, we describe an adaptive Gaussian process model. In Section 3.6,
we extend both models with unknown delayed-claims. In Section 3.7, we present
numerical results via a specific example. Proofs of additional results are deferred to
appendices at the end of this chapter.
40
3.2 Related literature
In this section, we provide a brief review of insurance pricing, dynamic pricing

and online learning. We also highlight related work on two statistical models—
generalized linear models and Gaussian processes. In addition, we discuss the exist-
ing literature on revenue management with uncertainty and bandit problems with
delayed feedback.
Dynamic pricing with uncertainty. In recent decades, interest in dynamic pric-

ing with demand uncertainty has grown rapidly. Early work, for example, Gallego
& van Ryzin (1994a) modeled the unknown demand as a Poisson process whose in-
tensity at each time is determined by a price. However, there is no demand learning
in their work due to the assumption of Poisson arrivals.
One popular approach to address the demand uncertainty is to consider a para-

metric setting, that is modeling the demand as a fixed parametric function of the
price, whose parameters are unknown to the company. Bertsekas & Perakis (2006)
considered a linear demand model and applied the least squares approach to estimate
the demand parameters. Besbes & Zeevi (2009) proposed a parametric pricing pol-
icy based on maximum likelihood estimation and derived lower bounds on the regret
for any admissible pricing policy. Broder & Rusmevichientong (2012) constructed
a forced-exploration policy based maximum-likelihood based model and achieved
a lower bound on the worst-case regret. den Boer & Zwart (2014b) proposed a
controlled variance pricing policy, in which they created taboo intervals around the
average of previously chosen prices to ensure sufficient price dispersion. This policy
was the first to consider the parametric model with unknown demand by maximiz-
ing the quasi-likelihood estimation. They obtained an asymptotic upper bound on
T -period regret of O(T 1/2+δ ), where δ > 0 is extremely small. The survey by den
Boer (2015) provides an excellent overview of dynamic pricing with learning.
41
Insurance pricing. In insurance, the company makes pricing decisions based on
their own historical data on pricing policies and claims. Here, premium is the ex-
pected income that the insurance company gains and claims are the expected out-
come that the insurance company loses. Thus, the wealth of the insurance company
increases with premiums but decreases whenever claims occur.
A popular approach to insurance pricing is generalized linear models (GLMs).
The use of GLMs in actuarial work was first proposed in the 1980s by British ac-
tuaries including Brockman & Wright (1992) and Haberman & Renshaw (1996).
McCullagh & Nelder (1989) applied GLMs to insurance ratemaking, fitting a GLM
to different types of data, including average claim costs for a motor insurance port-
folio and claims frequency for marine insurance. GLMs have now become a well-
established and standard industry practice across a wide range of insurance pricing,
from health insurance (de Jong & Heller, 2008) to vehicle insurance (Ohlsson & Jo-
hansson, 2010; Wüthrich & Buser, 2017). We refer to Haberman & Renshaw (1996)
for an overview of applications of GLMs in insurance. An introduction to the use of
GLMs for insurance problems specific to insurance data can be found in de Jong &
Heller (2008), and for a tariff analysis and some useful extensions of standard GLM
theory see Ohlsson & Johansson (2010). Stochastic control theory is also applied
in insurance, which deals with a more realistic situation of dynamic strategies, for
more details see Schmidli (2008), Asmussen & Albrecher (2010) and Koller (2012).
Machine learning techniques have become increasingly popular in insurance appli-
cations for enhancing and supplementing GLMs analysis, and GPs are widely used
in machine learning. Wüthrich & Buser (2017) provided an overview and insight
into classical methods including GLMs and machine learning methods in non-life
insurance pricing.
Multi-armed bandit problems. We model our insurance pricing problem as

a multi-arm bandit problem, which refers to a broad class of sequential decision
making problems. At each time step, based on previous observations, the algo-
rithm chooses an “arm” amongst a set of arms and receives reward (or payoff), and
then improves its arm-selection strategy with the new observations (i.e., rewards).
There is a trade-off between exploration—estimating the distribution of rewards for
42
all arms in the past, and exploitation—choosing the arm with higher expected re-
ward. The upper confidence bound (UCB) algorithm is commonly used to balance
exploration and exploitation, which estimates the mean reward of each arm and a
corresponding confidence interval, and then selects the arm that achieves a highest
upper confidence bound (Lai & Robbins, 1985; Auer et al., 2002). For an overview on
multi-armed bandit problems see Bubeck & Cesa-Bianchi (2012). In the insurance
pricing problem, a learning algorithm sequentially selects prices based on observed
information, while simultaneously adapting its price-selection strategy to maximize
its revenue.
Adaptive generalized linear models. Generalized linear models (GLMs) are

a class of statistical methods and they are a natural generalization of the classical
linear models. GLMs have proven to be a powerful tool for the analysis of non-
normal data. There are many widely-used software packages such as SAS and R to
implement GLMs. The basics of GLMs were discussed in Section 2.2, so for more
details, we refer to there.
Quasi-likelihood estimation is a commonly used technique to estimate the pa-

rameters, where only the first and second moments of responses are required instead
of all assumptions about the distribution of responses. However, the strong con-
sistency of quasi-maximum likelihood estimates in GLMs is of great concern, see
counter-example in ? and discussions in den Boer (2013). To deal with this prob-
lem, certain conditions are usually imposed. For a more detailed discussion, see Lai
(2003) Specifically, Lai & Wei (1982) generalized conditions to multiple regression
models with errors given by a martingale difference sequence and proved that the
minimum requirement is when the ratio of the minimum eigenvalue to the logarithm
of the maximum eigenvalue goes to infinity. For GLMs under canonical link func-
tions, Chen et al. (1999) derived the same strong consistency results under similar
conditions of Lai & Wei (1982). For GLMs under general link functions, Chang
(1999) obtained the strong consistency under a slightly stronger conditions than
those of Lai & Wei (1982).
43
Bayesian optimization and gaussian process. Bayesian optimization is an-
other efficient approach to addressing global optimization problems, especially when
objective functions are unknown or expensive to evaluate. There are two significant
stages in Bayesian optimization. The first stage is to learn the objective function
from available samples, which typically works by assuming the unknown function is
sampled from a GP. The second stage is to determine the next sampling points. The
UCB criterion is a popular approach for this. The Bayesian optimization problem
can be cast as a multi-armed bandit problem with a continuous set of arms. More
recently, the gaussian process-upper confidence bound (GP-UCB) algorithm was
proposed as a Bayesian optimization by Srinivas et al. (2012). This is noteworthy
because it achieved sublinear cumulative regret. For a comprehensive overview of
Bayesian optimization and its applications, we refer readers to Brochu et al. (2010).
Online learning problem with delayed feedback. Traditionally, delays are

considered as fixed constants. Under this setting, Chapelle & Li (2012) showed the
influence of delayed feedback for contextual bandits in news article recommendation.
Pike-Burke et al. (2018) discussed the case with delayed, aggregated anonymous
feedback, where the expected delay is known and only the sum of regret is available
while individual regret is unknown. In general, delays follow stochastic processes.
Agarwal & Duchi (2011) analyzed stochastic gradient-based optimization algorithms
when delays are randomly distributed. Desautels et al. (2014) studied parallel ex-
periments with a bounded delay between an experiment and observation in GP
bandit problems. For a systematic study of online learning with delayed feedback
and the effects of delay on regret, see Joulani et al. (2013). Most notably, they
showed that delays additively increase regret in stochastic problems without requir-
ing knowledge on distributions of delays. Later, Joulani et al. (2016) developed and
analyzed delayed online learning in a full information setting, where they reduced
the delayed-feedback to a standard (i.e., non-delayed) online learning.
44
3.3 Problem formulation
We consider an insurance company that sells a new product over a finite selling
horizon T > 0. At the beginning of each time period t = 1, . . . , T , the selling
price pt is determined. We define the set of acceptable prices by P = [pl , ph ], where
0 < pl < ph are the minimum and maximum selling prices.
We assume that dynamic pricing is only associated with past prices. We denote
demand and total claims functions by D(·) and C(·). Given a determined selling
price pt ∈ P at time t, the insurance company observes random functions D(pt ) and
C(pt ) under the chosen price pt . Here C(pt ) denotes the logarithm of total claims.
We consider that price is a design variable, demand and total claims are random
variables that only depend on the price. Normally, total claims increase with an
increase in prices, so we assume that demand and total claims respond to changes
in prices simultaneously.
In insurance, the premium is the expected income that the company earns, and
claims are the outcome that the company pays out. If the selling price is known,
the expected revenue at time t is
r(pt ) := E[pt D(pt ) − C(pt )] . (3.3.1)
The goal is to find an optimal pricing policy that generates highest revenue over
T periods. We use regret to evaluate company’s pricing policy by comparing its
expected revenue to the best possible expected revenue. Define the optimal price p?
at time t by
p? := arg max r(pt ) . (3.3.2)
pt ∈P
We can write the cumulative regret over time horizon T as
" T #
X
∗
Rg(T ) := E r(p ) − r(pt ) . (3.3.3)
t=1
Notice that the objective of the company now is to minimize the cumulative regret.
3.4 Generalized linear pricing model

In this section, we consider the pricing problem in a GLM setting. However the
expected revenue or the regret can’t be calculated directly because it depends on
45
unknown parameters. These parameters can be inferred. We apply maximum quasi-
likelihood estimation to estimate the unknown parameters in the model.
We introduce the model and assumptions in Section 3.4.1, and explain how to
estimate the unknown parameters in Section 3.4.2. In Section 3.4.3, we discuss our
pricing algorithm. In Section 3.4.4, we give the main result, Theorem 3.4.1, in this
chapter. Proofs of this section are collected in Appendix A.
We do not require that the insurance company has complete knowledge on de-
mand distributions, only the first two moments of demand and the selling price are
required. Commonly used demand models are normally, Poisson, negative binomial,
Bernoulli and logistic distributed demand. Once the demand distribution is chosen,
the variance function is then determined.
Much attention has been paid to the distribution of claims. The most popular
approach is to assume that the distribution is heavy-tailed (Ramsay, 2003; Albrecher
& Kortschak, 2009). This is because most claims are relatively small, but occasion-
ally a large claim occurs that leads to a long right tail. For example, we may assume
that the expected total claims follow a lognormal distribution. The lognormal distri-
bution shows a good fit especially for large claims in insurance due to its skewness.
Notice that large claims can even cause ruin. A comprehensive introduction into
the extremal event can be found in Embrechts et al. (1997).
3.4.1 Model and assumptions
We assume that the company knows the forms of the first two moments of demand
and total claims. In addition, we also assume that the model for demand distribution
at time t is
E[D(pt )] = h1 (a0 + a1 pt ) ,
Var(D(pt )) = σ12 v1 (E[D(pt )]) .
Similarly, we assume the expectation and variance of the logarithm of total claims
are
E[C(pt )] = h2 (b0 + b1 pt ) ,
Var(C(pt )) = σ22 v2 (E[C(pt )]) .
46
Here parameters a0 , a1 and b0 , b1 are all unknown but functions h1 (·), h2 (·), v1 (·), v2 (·) :
R → R, are known. Further, h1 (·), h2 (·) are called link functions. Recall that the
link function is called the canonical link function when ḣ(x) = v(h), otherwise it
is called a general link function. Variances of the randomly distributed demand
and log of total claims are functions of constants σ1 , σ2 > 0 and variance functions
v1 (·), v2 (·) of the expected demand and log of total claims. Both h(·) and v(·) are
continuously differentiable with first derivatives denoted by ḣ(·) and v̇(·).
We denote by a = (a0 , a1 )> and b = (b0 , b1 )> , then the expected revenue in
(3.3.1) can be expressed as a function in terms of p, a, b,
r(pt , a, b) = pt h1 (a0 + a1 pt ) − h2 (b0 + b1 pt ) .
Moreover, the cumulative regret in (3.3.3) becomes

" T #
X
Rg(T ) = E r(p? , a, b) − r(pt , a, b) .
t=1
Here the optimal price p? is defined in (3.3.2).
3.4.2 Estimation of unknown parameters
The regret cannot be calculated directly because it depends on unknown param-

eters a and b. But unknown parameters can be estimated. We apply maximum
quasi-likelihood estimation, which is an extension of maximum likelihood estima-
tion. Maximum likelihood estimation has been widely used in GLMs, but it requires
the complete specification of the underlying distributions. However, in our case, we
do not know the underlying distributions of demand and total claims. The informa-
tion we have is the first and second moments of the distribution, which satisfies the
requirements of MQLE.
To simplify notation, we define a parameter matrix β := (a, b), and use β0 to
denote the true values of the unknown parameters. We also define a price vector by
pi := (1, pi )> for i = 1, . . . , t. The maximum quasi-likelihood estimators (MQLEs),
denoted by β̂t , are solutions to the equation
t
X ḣ(p>
i β̂t )

`t (β̂t ) = p i y i − h p>
i β̂ t = 0. (3.4.1)
i=1 σ 2 v(h(p>
i β̂t ))
47
Let the filtration (Ft )t∈N be generated by {pi , di , ci : i = 1, . . . , t} for each t. Write
error terms ηi = yi − h(p>
i β̂t ). This forms a martingale difference sequence w.r.t. Ft ,
i.e., ηi is Fi -measurable and E[ηi | Fi−1 ] = 0. We also define a design matrix P (t),
which is the sum of the transpose matrices achieved from price vectors, given by
t
X
P (t) = pi p>
i . (3.4.2)
i=1
Here P (t) is a symmetric positive definite matrix.

However the MQLE β̂t may not converge to true parameters β0 . We will explain
it in more details in the next section. Assumptions are needed to guarantee the
strong consistency of β̂t , i.e., to guarantee that β̂t converges to β0 with probability
1 when t is large enough. That is, as t → ∞
β̂t → β0 a.s.
Following Theorem 1 from Lai & Wei (1982), we assume
A1. E[ηi | Fi−1 ] = 0 and supi≥1 E[|ηi |γ |Fi−1 ] < ∞ a.s. for some γ > 2 .
A2. limt→∞ λmin (t)/log λmax (t) = ∞ a.s., where λmax (t) and λmin (t) are the maxi-
mum and minimum eigenvalues of ti=1 pi p>
P
i .
In GLMs with the canonical link functions, Chen et al. (1999) proved strong consis-
tency and convergence of MQLEs. Their proof contains a mistake, see Zhang & Liao
(1999). In GLMs under general link functions, Chang (1999) obtained the strong
consistency for the MQLE by using a last-time random variable under an additional
assumption
1
A3. λmin (t) ≥ ctγ a.s. for some c > 0 and γ > 2
independent of t.
However there is also a flaw in this proof, see den Boer & Zwart (2014a). Under
assumptions A1, A2 and A3, den Boer & Zwart (2014a) obtained mean square
convergence rates in case of a canonical link function such that as t → ∞

2 log λmax (t)
E β̂t − β0 =O a.s. (3.4.3)
λmin (t)
p
In our case, we consider λmin (t) ≥ c t log(t) where c > 0, so it is sufficient for us
to use (3.4.3) directly.
48
3.4.3 Adaptive GLM pricing policy
A popular pricing policy is the certainty equivalent pricing, which was first applied
in a simple linear regression model (Anderson & Taylor, 1976) and later developed
in a more general multiple regression model (Anderson & Taylor, 1979). We define
the certainty equivalent price, denoted by pce (β), to be
pce (β) = arg max r (p, β) ,

p∈P
where r (p, β) = r (p, a, b).

Certainty equivalent pricing is popular, essentially because it separates the sta-
tistical problem from the problem of optimizing revenue and reward. This policy
is efficient provided that current parameter estimates are correct. Given MQLE β̂t
(3.4.1), which is our current estimate of β0 , we maintain the update rule
pt = pce (β̂t ) .
However, it is possible that parameter estimates converge to incorrect values due to

the too quick convergence. Lai & Robbins (1982) showed that inconsistency may
occur for a linear demand function with constant variance under an iterative least-
squared policy. To solve this problem, Lai & Wei (1982) proposed a condition that
the minimum eigenvalues of a designed matrix derived from the problem tends to
infinity and proved it is sufficient for strong consistency of estimators. den Boer &
Zwart (2014b) further relaxed the assumptions on the values of prices and proposed
a variant certainty equivalent pricing, which they called a controlled variance pricing
policy. By creating taboo intervals around mean prices that have been chosen, they
chose the next price outside these intervals to obtain more information.
We propose a dynamic pricing policy with an additional requirement on price
dispersion. A sufficient price dispersion guarantees enough information for the strong
consistency of parameter estimation. Since the variation in p1 , . . . , pt is bounded by
λmin (t), setting a lower bound on λmin (t) gives sufficient price dispersion and further
guarantees the strong convergence of MQLE β̂t to β0 . However, there is no simple
explicit relation between λmin (t) and λmin (t + 1). Recalling that the design matrix
49
P (t) defined in (3.4.2) is
t
X
P (t) = pi p>
i .
i=1
Since P (t) is a positive definite matrix, we have
tr(P (t)−1 )−1 ≤ λmin (P (t)) ≤ n tr(P (t)−1 )−1 ,
where tr(P (t)) is the trace of the design matrix. Let L be a class of positive dif-
ferentiable monotone increasing functions. Choose a function L(t) ∈ L, such that
L(t) → ∞ and t → 1
L(t)
is convex. We let tr(P (t)−1 )−1 ≥ L(t) for all t ≥ 2, then
λmin (t) ≥ L(t) .
A proper choice of function L(t) makes sure sufficient price dispersion, and therefore
guarantees the convergence of parameter estimates to the true parameters, as well
as the asymptotic convergence of our price sequence to the optimal price.
We consider a general case here. Based on the work of den Boer & Zwart (2014b),
we let v1 , . . . , vn+1 be associated unit eigenvectors, which form an orthonormal basis
of Rn+1 , for t > n + 1. The price given uncertain β̂ can be written as a linear
combination of unit eigenvectors such that p(β̂t ) = n+1
P
i=1 αi vi .
Pseudocode for the GLM pricing algorithm is given in Algorithm 1. We ex-

plain how the pricing algorithm works into 2 steps. At the very beginning, we
choose linearly independent initial price vectors and calculate parameters β̂t by
(3.4.1). As shown in condition (I), if the MQLE β̂t does not exist or the constraint
tr(P (t)−1 )−1 ≥ L(t) is not met, we repeat prices that have been chosen until we
can find the MQLE β̂t and satisfy the sufficient price dispersion requirement. The
condition tr(P (t + j)−1 )−1 ≥ L(t + j) will eventually be met because j is always
finite. As descried in condition (II), if the MQLE β̂t exists and the constraint
tr(P (t)−1 )−1 ≥ L(t) is also satisfied, we let pce = p(β̂t ) and the next price be
pt+1 = pce . Then we check the requirement in condition (II). If it does not hold, we
choose the next price in the direction of v2 , which has the form pt+1 = p(β̂t ) + φt
and
q
φt := k L̇(t) v2,1 p(β̂t ) − v2 , (3.4.4)
50
Algorithm 1: GLM Pricing Algorithm
Initialisation:
Choose L1 ∈ L.
Choose linearly independent initial price vectors p(1), p(2) in P.
For all t ≥ 3:
Estimation:
Calculate β̂t using (3.4.1).
Pricing:
(I) If β̂t does not exist or tr(P (t)−1 )−1 L(t), then set pt+1 = p1 , . . ., pt+j = pj ,
where j is the smallest integer satisfying tr(P (t + j)−1 )−1 ≥ L(t + j).
(II) If β̂t exists and tr(P (t)−1 )−1 ≥ L(t), then set pt+1 = pce , where pce = p(β̂t ),
and consider −1 −1
tr P (t) + pt+1 p> t+1 ≥ L(t + 1) .
If it does not hold, we instead set pt+1 = p(β̂t ) + φt , where φt is defined as

in (3.4.4). Here, we can choose kφt k2 = L̇(t) (1 + maxp∈P kpk2 ) so that the
above requirement is satisfied.
where v2,1 is the first component of v2 and there exists a constant k > 0 such that
q
k L̇(t) ≤ 1.
The following proposition guarantees that when prices are chosen the price dis-
persion condition (3.4.5) is satisfied.
Proposition 3.4.1. Assume MQLE β̂t exists and tr(P (t)−1 )−1 ≥ L(t). If we set
the next price to be pt+1 = pce + φt , where φt is defined in (3.4.4), for t ≥ 3 such
that

>
−1 −1
tr P (t) + pt+1 pt+1 ≥ L(t + 1) . (3.4.5)
3.4.4 Main results
Our main result in the GLM setting is shown in Theorem 3.4.1, proof of which is
also provided in this section. We state below the result for the canonical case under
consideration.
Theorem 3.4.1. Suppose there exists t0 ∈ N and choose L(t) ∈ L for all t ≥ t0 ,
such that the MQLE β̂t is strongly consistent. If the following conditions are satisfied
(i) λmin (t) ≥ L(t) a.s., for all t ≥ t0 ,
51
PT 2
(ii) t=1 pt − p(β̂t−1 ) ≤ k0 L(T ) a.s., for all T ≥ t0 and some k0 > 0,
then the regret over time horizon T is

T
!
X log(t)
Rg (T ) = O L(T ) + .
t=1
L(t)
p
Choosing L(t) = c t log(t) for some c > 0 gives
p
Rg(T ) = O T log(T ) .
p
We obtain an upper bound on the regret in the order of O T log(T ) when
p
L(t) = c t log(t) under canonical link functions. den Boer & Zwart (2014b) and
Keskin & Zeevi (2014) also provided same upper bound on the regret.
To prove Theorem 3.4.1, we require Proposition 3.4.2, which shows that studying
the regret is equivalent to studying the sum of squared errors of the estimated
parameters. We write the price vector as a function in terms of β, denoted by p(β),
and use p for short.
Proposition 3.4.2. Assume that there is an open, bounded neighborhood V ∈ R2×3

around the true value of β0 , such that, for all β ∈ V , there is a unique optimal price
∂r(p? ,β0 )
p? that maximizes the revenue function r(p, β). Given p(β0 ) ∈ P, ∂p
= 0 and
∂ 2 r(p,β0 )
∂p2
< 0, we have
|r(p, β0 ) − r(p? , β0 )| = O kp − p? k2 .

Further, if we assume there exists t0 ∈ N such that β̂t ∈ V for all t ≥ t0 ,

2 2
p(β̂t ) − p(β0 ) = O β̂t − β0 .
Proposition 3.4.2 shows that the cumulative regret is the same order of the
expected value of sum of squared price vectors kp − p∗ k2 . This term is equal to
expected value of sum of the squared difference between the estimated and true
2
parameters β̂t − β0 .
We focus on a simple case when link functions are canonical, i.e., ḣ(·) = v(h(·)),
then (3.4.1) becomes
t
X 1
>

`t (β̂t ) = p
2 i
y i − h p i β̂ t = 0.
i=1
σ
52
In our case, we consider the GLM pricing problem under assumptions A1, A2 and
A3, thus we can apply (3.4.3) directly. Recall that

2 log λmax (t)
E β̂t − β0 =O a.s.
λmin (t)
The necessary and sufficient condition for the consistency of β̂t is λmin (t) → ∞ as
t → ∞, since log λmax (t) = O (log(t)).
Proof of Theorem 3.4.1. By Proposition 3.4.2, the regret over selling horizon T can
2
be derived in terms of β̂t−1 − β0 , there exists some k1 , such that
E [|r(p, β0 ) − r(p? , β0 )|] ≤ E k1 kpt − p? k2

2 2
?
≤ E 2k1 pt − p(β̂t−1 ) + E 2k1 p(β̂t−1 ) − p ,
(3.4.6)
for all t ≥ t0 . The first inequality is an immediate result from Proposition 3.4.2. The
second inequality is obtained by the fact that (a + b)2 ≤ 2a2 + 2b2 for all a, b > 0.
Summing over t = t0 , . . . , T , by condition (ii) we have an upper bound
" T #
X 2
2E pt − p(β̂t−1 ) ≤ 2k0 L(T ) . (3.4.7)
t=t0
Furthermore, there exist some k2 and k3 , applying (3.4.3) gives

" T # " T #
X 2 X 2
2E p(β̂t−1 ) − p? ≤ 2k2 E β̂t−1 − β0
t=t0 t=t0
T
(3.4.8)
X log λmax (t)
= 2k2 k3 .
t=t0
λmin (t)
Finally, applying (3.4.7) and (3.4.8) to (3.4.6) gives

T
!
X log(t)
Rg (T ) = O L(T ) + .
t=1
L(t)
p
Letting L(t) = c t log(t) and c > 0 gives
p
53
We now demonstrate that our GLM Pricing Algorithm 1 satisfies the conditions
in the Theorem 3.4.1. The condition (i) is necessary because it provides a sufficient
condition for the price dispersion. At each time t, we choose a function L(t) such
that tr(P (t)−1 )−1 ≥ L(t) for all t ≥ 2. Since λmin (t) ≥ tr(P (t)−1 )−1 , then condition
(i) is satisfied. Furthermore, condition (ii) follows from Proposition 3.4.1 and implies
how to define the next price. Recall that if β̂t−1 exists and condition (i) is satisfied,
we set pt+1 = p(β̂t−1 ). Otherwise, we set pt+1 = p(β̂t−1 ) + φt , where φt is defined
in (3.4.4). By choosing kφt k2 = L̇(t) (1 + maxp∈P kpk2 ), where maxp∈P kpk is a
constant, and using Tt=1 L̇(t) = L(T ), then condition (ii) is satisfied.
P
3.5 Gaussian process pricing model
In this section, we construct a non-parametric model by sampling the expected

demand and expected log of total claims from Gaussian processes (GPs). Our pricing
model uses an alternative upper confidence bound (UCB) rule adapted for the field
of insurance, where demand and claims are considered. Section 3.5.1 discusses the
GP model and Section 3.5.2 describes the GP pricing algorithm. We establish the
regret bound in Theorem 3.5.1 in Section 3.5.3. Proofs of this section are collected
in Appendix B.
Gaussian process. First, we give a brief introduction to GP regression; more

details can be found in Rasmussen & Williams (2006). A GP is a collection of
random variables, any finite number of which have a joint Gaussian distribution. We
let x ∈ Rd denote a d-dimensional input vector and y denote a scalar output. The
standard linear regression model with Gaussian noise can be written as y = f (x)+ε,
where ε ∼ N (0, σ 2 ) is an i.i.d. Gaussian distribution with variance σ 2 .
A GP is completely specified by its mean function and covariance function (or

kernel). We define mean function, µ(x), and the covariance function, κ(x, x0 ), of
54
process f (x) by
µ(x) = E[f (x)] ,
κ(x, x0 ) = E[(f (x) − µ(x)) (f (x0 ) − µ(x0 ))] .
and write the GP as

f (x) ∼ GP(µ(x), κ(x, x0 )) .
The kernel determines how observations influence the prediction of nearby or sim-
ilar inputs. There are two commonly used kernels: the squared exponential kernel,
denoted by κσ,` , and the Matérn kernel, denoted by κν,` , which have the forms

0 1 0 2
κσ,` (x, x ) = exp − 2 |x − x | ,
2`
√ ν √
2 ν|x − x0 | 2 ν|x − x0 |

0 1
κν,` (x, x ) = ν−1 Bν ,
2 Γ(ν) ` `
where parameters l is the length-scale, ν controls the smoothness of the process, Γ(·)
is the Gamma function and Bν (·) is the modified Bessel function of the second kind
of order ν. Note that when ν = 1/2 the Matérn kernel reduces to the exponential
kernel, κν,` (r) = exp (−r`), and when ν → ∞ it reduces to the squared exponen-
tial kernel. The mean function, in fact, depends on prices chosen in our case, for
instance a linear or polynomial function of prices. Non-constant mean functions
provide the gaussian process with a global structure, but it won’t affect the final
results. Moreover, zero-mean is always possible by subtracting the sample mean.
Without loss of generality, we assume that the mean is a constant (e.g. zero) and
the covariance function is strictly bounded.
GP pricing model. We are interested in the noisy sample, yt = [y1 , . . . , yt ]> ,

given a collection of input prices {p1 , . . . , pi }. We define pi as the i-th sample and
yi = f (pi )+εi , where εi ∼ N (0, σ 2 ). As a GP describes a distribution over functions,
we use GP(µ(p), κ(p, p0 )) as the prior distribution over f . The posterior over f is
also a GP distribution with mean µt (p) and covariance function κt (p, p0 ) at time t,
which are defined by
−1
µt (p) = κt (p)> Kt + σ 2 I yt ,
−1 (3.5.1)
0 0 > 2 0
κt (p, p ) = κ(p, p ) − κt (p) Kt + σ I κt (p ) ,
55
where kt (p) = [k(p1 , p), . . . , k(pt , p)]> and the covariance matrix Kt is the positive
definite matrix whose entries are Ki,j = k(pi , pj ) for i, j = 1, . . . , t.
We denote the function of expected demand at price pt by fd (pt ), and the function
of expected log of total claims by fc (pt ). Functions fd (·), fc (·) are independently
sampled from GPs with known means µd , µc and kernels κd (p, p0 ), κc (p, p0 ). We can
write fd ∼ GP(µd , κd ) and fc ∼ GP(µc , κc ). The posteriors over fd and fc are GPs
and also follow the GP posterior update in (3.5.1).
The expected revenue function, or revenue function for short, given a determined
price pt at time t, can be written as
r(pt ) = pt · fd (pt ) − fc (pt ) + εrt ,
where the noise term εr ∼ N (0, σr2 ) is a combination of demand noise and claims
noise. Since the sum of GPs is a GP and the kernel has the form of a direct
sum, the revenue function also follows a GP with an additive kernel. We can write
r ∼ GP(µr , κr ), where µr = p · µd − µc and κr = p2 · κd + κc are known.
The cumulative regret in (3.3.3) over time horizon T can be expressed as
" T #
X
? ? ?
Rg(T ) = E (p · fd (p ) − fc (p )) − (pt · fd (pt ) − fc (pt )) .
t=1
Note that our work is slightly different from Srinivas et al. (2012), who considered
only one function f . In our case, we treat the revenue function r(·) as an additive
function, i.e., r(·) contains two components p · fd (·) and fc (·) and is sampled with
an additive kernel. Furthermore we assume that the samplings of fd (·) and fc (·) are
independent of each other.
3.5.2 Adaptive GP pricing policy
In the GP setting, we determine our pricing policy by the UCB rule. We determine
the next price by
√
pt = arg max µrt−1 (p) + r
ϕt σt−1 (p) ,
p∈P
where µrt−1 r
is the additive mean and σt−1 (p) is the additive kernel given by
µrt−1 = pt−1 µdt−1 + µct−1 , r

σt−1 d
(p) = pt−1 σt−1 c
(p) + σt−1 (p) . (3.5.2)
56
Algorithm 2: GP Pricing Algorithm
Input: GP Prior µ0 , σ0 , k
for t = 1, 2, . . . do
√
Select price: pt ← arg maxp∈P µrt−1 (p) + ϕt σt−1
r
(p) ;
Sample the revenue function: rt ← pt · fd (pt ) − fc (pt ) + εrt ;
Update estimates:
• µdt , µct , σtd , and σtc by performing the GP posterior update (3.5.1);
• Obtain µrt and σtr .
end
We consider a negotiated parameter ϕt , which balances the trade-off between ex-

ploitation and exploration. At time t, an exploitation action is to choose the price
that maximizes the posterior mean µrt−1 , and an exploration action is to choose the
r
one that maximizes the posterior standard deviation σt−1 after t − 1 observations.
We simply choose the price that maximizes the sum of mean and standard deviation.
The pseudocode of the GP pricing algorithm is described in Algorithm 2. At time
t, the selling price pt is determined by the history Ft−1 = {pi , di , ci : i = 1, . . . , t−1}
and the UCB rule. Therefore, at time t − 1 the algorithm generates a price p? which
maximizes the UCB function and we let the next sampling price pt = p? . Given pt ,
we sample fd (pt ) and fc (pt ) from GPs individually. Then we obtain the sampled
revenue function, which is a linear combination of sampled functions fd (pt ) and
fc (pt ) with noise, provided that pt is determined. Next we perform the GP posterior
update rule (3.5.1) which gives posterior distributions for demand and total claims,
fd and fc , given Ft = Ft−1 ∪{pt , dt , ct }. Finally, we obtain the mean µrt and standard
variance σtr as defined in (3.5.2), and then use them to update the UCB function.
3.5.3 Main results
Our main result in the GP setting is shown in Theorem 3.5.1. It follows the work
of Srinivas et al. (2012), but we extend their model to an additive form since the
insurance product has demand and claims.
Theorem 3.5.1. Pick δ ∈ (0, 1) and choose
ϕt = 2 log(2t2 π 2 /(3δ)) + 2 log t2 .

57
With probability greater than 1 − δ, we obtain
p
Rg(T ) = O γT T log T ,
where γT := γT (κd + κc ) is the maximum information gain.
√
Theorem 3.5.1 shows that the regret is in the order of O γT T log T . Here γT
describes the maximum amount of information that the GP pricing algorithm can
learn from the demand and log of total claims, which is the combination of additive
kernels in our case as shown in Lemma 3.5.3. Note that the key to bounding the
regret is to bound the information gain.
We start by defining the information gain. Suppose there is a finite subset
D = {p1 , . . . , pT } ⊂ P. We denote the observations by yD = f (pD ) + εD and the
function values at these points by fD = f (pD ). The Shannon Mutual Information,
denoted by I(·), is defined as
I (yD ; fD ) = H(yD ) − H(yD | f )
where H(·) is the entropy. After T rounds, the maximum information gain between
yD and fD is
γT (κ) := max I (yD ; fD ) . (3.5.3)

D⊂P:|D|=T
In the multivariate Gaussian case, H(N (µ, Σ)) = 21 log |2πeΣ|, so that I (yD ; fD ) =
1
2
log |I + σ −2 KD |, where KD = [κ(p, p0 )]p,p0 ∈D is the Gram matrix of k evalu-
ated on D. Thus, bounds on γT (κ) depend on the chosen kernels. For example,
γT (κ) = O(d log T ) under a linear regression, γT (κ) = O((log T )d+1 ) under a squared

exponential kernel, and γT (κ) = O T d(d+1)/(2ν+d(d+1)) (log T ) under a Matérn kernel
with ν > 1. We refer readers to Srinivas et al. (2012) for more details.
To bound γT (κd +κc ), we use the following two results from Srinivas et al. (2012).
Lemma 3.5.1 gives the expression of information gain for the points selected in terms
of the predictive variances. Lemma 3.5.2 provides the finite bound on γT (κ) via the
eigendecay of the kernel, which will be useful to bound the regrets. For the proofs
of both lemmas we refer readers to Theorem 8 in Srinivas et al. (2012).
58
Lemma 3.5.1. In a Gaussian process, the information gain on fT = (f (pt )) ∈ R>
the points selected by the algorithm can be written as
T
1X
log 1 + σ −2 σt−1 (pt )2 ,

I (yT ; fT ) =
2 t=1
where σ 2 is the variance of Gaussian noise and σt−1 (pt )2 is the posterior variance
after t − 1 observations.
To derive the regret bound, we also need Assumption 3.5.1. Assume that both
demand and total claims functions fd , fc satisfy the following assumption.
Assumption 3.5.1. Let f be sampled from a GP with kernel k(p, p0 ) and function f
be almost surely continuously differentiable. Assume that partial derivatives ∂f /∂pj
of this sample path for j = 1, . . . , T satisfying the following condition with high
probability. For some constants a, b > 0, we have

∂f 2
P sup > J ≤ ae−(J/b) .
p∈P ∂pj
Lemma 3.5.2. Assume that Assumption 3.5.1 holds. Choose nT = c T τ log T ,

where c is a constant and τ > 0. For any T∗ ∈ {1, . . . , min(T, nT )}, we let Bk (T∗ ) =
0
P
s>T∗ λs , where λs is the eigenvalue of kernel k(p, p ) w.r.t. the uniform distribution
over P, then the bound on γT is

1/2
γT ≤ max (T∗ log(rnT /σ 2 ) + cσ 2 (1 − r/T )(log T )(T t+1 Bk (T∗ ) + 1))
1 − e−1 r∈{1,...,T }
+ O(T 1−t/d ) .
Lemma 3.5.3 shows the bound on γT (κd + κc ).
Lemma 3.5.3. Let κd = κd (p, p0 ) and κc = κc (p, p0 ) be kernel functions for demand
and claims, respectively. Then
γT (κd + κc ) ≤ γT (κd ) + γT (κc ) .
This lemma shows that γT (κd + κc ) has an additive form. We can find bounds on
γT (κd ) and γT (κc ) separately, and then sum them to obtain the bound on γT (κd +κc ).
Proposition 3.5.1 shows the bound on the single-period regret at each time t.
The cumulative regret for the additive combination kernel is the sum of single-period
regrets over time horizon T .
59
Proposition 3.5.1. Pick δ ∈ (0, 1) and choose
ϕt = 2 log(2t2 π 2 /(3δ)) + 2 log t2 .

With probability greater than 1 − δ, for all t ≥ 1, we obtain the bound

p
∗ √ r ub log(2a/δ)
r(p ) − r(pt ) ≤ 2 ϕt σt−1 (pt ) + ,
t2
√
where pt = arg maxp∈P µrt−1 (p) + ϕt σt−1
r
(p) and a, b, u > 0 are constants.
Proof of Theorem 3.5.1. For any price p ∈ P, we consider functions of demand and
log of total claims, fd , fc : P → R, are sampled from GPs and satisfy Assump-
tion 3.5.1. The cumulative regret is
" T
#
X
Rg(T ) = E r(p? ) − r(pt ) .
t=1
We know that the price set P is bounded, and expected demand and claims functions
fd , fc are independently multivariate Gaussian distributed with bounded variances.
Without loss of generality, we assume that κr ≤ 1. By Proposition 3.5.1 we have
p
? √ r ub log(2a/δ) (3.5.4)
r(p ) − r(pt ) ≤ 2 ϕt σt−1 (pt ) + ,
t2
r d c
where σt−1 (p) = pt−1 σt−1 (p)+σt−1 (p). By the convexity of the logarithm function, we
know that w2 ≤ v 2 log (1 + w2 )/ log (1 + v 2 ) for w ≤ v. Since σt−1
r
(pt )2 ≤ k(pt , pt ) =
1 and σ −2 σt−1
r
(pt )2 ≤ σ −2 , letting w2 = σ −2 σt−1
r
(pt )2 and v 2 = σ −2 gives
1
r
(pt )2 ≤ log 1 + σ −2 σt−1
r
(pt )2 .

σt−1 −2
log (1 + σ )
By Jensen’s inequality and the fact that ϕt is nondecreasing, we have

v
T
X √
u T
r √ u X
r
2 ϕt σt−1 (pt ) ≤ 2 ϕT tT σt−1 (pt )2
t=1 t=1
v
u T r

√ u X log 1 + σ −2 σt−1 (p t ) 2
≤ 2 ϕT tT
t=1
log (1 + σ −2 ) (3.5.5)
s
p 8I(yT ; fT )
≤ T ϕT
log (1 + σ −2 )
p
≤ c1 T ϕT γT ,
60
where c1 = 8/ log (1 + σ −2 ) and γT = γT (κd + κc ). The last inequality is obtained
by Lemma 3.5.1 and definition of γT in (3.5.3).
Summing (3.5.4) over t = 1, . . . , T and using (3.5.5), we have with probability
greater than 1 − δ,
T T p !
X X √ ub log(2a/δ)
r(p? ) − r(pt ) ≤ 2 r
ϕt σt−1 (pt ) +
t=1 t=1
t2
T T p !
X √ r
X ub log(2a/δ)
= 2 ϕt σt−1 (pt ) +
t=1 t=1
t2
p
≤ c1 T ϕT γT + c2 ,
P p
where c2 = Tt=1 ub log(2a/δ)/t2 is a constant (since a, b, u, δ are constants) and
P 2
1/t = π 2 /6. Hence, we obtain the stated regret bound.
3.6 GLM and GP algorithms with unknown de-

lays
In previous sections we assumed that the insurance premium and total claims were
paid at the beginning of each insurance period. This was to illustrate how these
methods can be applied in an insurance setting. However, in the real world, claims
are only triggered when the insured events happen so are not paid out immediately
when an insurance product is purchased. It also takes time to process claims.
A pricing decision during the delay may lead to an incorrect optimal price, and
thus increase the regret. As a consequence, we incorporate delayed claims here and
consider our pricing problem as a delayed online learning problem. Our work is
inspired by Joulani et al. (2016), who considered adversarial delays and proposed an
algorithm that reduces delayed-feedback back to non-delay online learning problem
in the full information setting. They showed that the regret under the delayed
setting has two components: the regret in a non-delayed setting and the sum of all
delays. In the full information setting, the regret is deterministic since the action is
chosen by an adversary and the feedback is also independent on the action. In our
work, we apply their result to a bandit case, where we choose the selling price based
on previous observations and then receive feedback.
61
We assume that demand is generated and observed immediately when a price is

chosen. However, claims may be delayed, and consequently, revenue may also be
delayed. Suppose a price is chosen at time t and claims are observed at time t + τt ,
where τt is the delayed time. Assume that τt , for t = 1, . . . , T , is an i.i.d. random
sequence and is independent of prices and claims. When delay occurs, the company
observes revenue rt at time t + τt . Note that it is possible that τt = 0 if there is
no delay. We assume that there exists a maximum delay time m, and the delayed
claims can only be observed by time t + m if τt > m or t + τt if τt ≤ m. Without
loss of generality, we assume that all information (i.e., demand, claims and revenue)
can be received by the end of time horizon T .
The price pt is chosen based on history Ht , which is the past information of prices
and demand {pi , di : i = 1, . . . , t − 1} and delayed claims {ci : i + τi ≤ t − 1}. If
τt = 0 for all t, we have Ht = Ft .
Consider that the company makes a pricing decision at time t0 , and then observes
claims at time t, where t = t0 + τt0 and τt0 is the delay time. We let ct0 +τt0 be claims
at time t0 + τt0 and DC t = {t0 : t0 = t − τt0 } the set of starting time for delayed claims
by the end of time t. Here DC t can be empty, for instance when τt0 = 0 or τt0 > m
for all t0 . The corresponding observed revenue at time t is rt = pt0 dt0 − ct0 +τt0 . It
is worth noting that there is less information available when choosing next price in
the delayed case than in the non-delayed case.
We use ρ(s) to denote the time when the insurance company observes the s-th
claim and rρ(s) to denote the corresponding revenue for s = 1, . . . , T . Let r̃s = rρ(s) .
The insurance company determines the next selling price p̃s+1 after observing r̃s .
i=1 1{i + τi < t} be the number of claims observed by the end of

We let N (t) = t−1
P
time t − 1 for t = 1, . . . , T , where 1{·} denotes the indicator function. Since the
company determines the next selling price by the latest observed claim, we have
pt = p̃N (t)+1 .
Let τ̃s = s − 1 − N (ρ(s)), now s − 1 is the number of claims that can be observed
if there is no delay and N (ρ(s)) is the actual number of claims observed by time
ρ(s) − 1. Therefore, τ̃s is the number of delayed claims that have not been updated
62
between choosing price pρ(s) and observing corresponding revenue rρ(s) . This gives
pρ(s) = p̃s−τ̃s . The regret can be expressed as
T
X X T
X
?
r(p ) − r(pt ) = rρ(s) (p? ) − rρ(s) (pρ(s) )
t=1 t0 ∈DC t s=1
T
X
= r̃s (p? ) − r̃s (p̃s−τ̃s )
s=1
XT T
X
?
= (r̃s (p ) − r̃s (p̃s )) + (r̃s (p̃s ) − r̃s (p̃s−τ̃s )) .
s=1 s=1
We can see that the regret with delayed claims has two components: the non-delayed
regret and an additional regret caused by delays. We are going to bound each of
these two terms separately.
3.6.2 Adaptive GLM pricing with unknown delays
In this section, we show the main result in Theorem 3.6.1. In the GLM setting
with delays, parameters are also unknown. We use maximum quasi-likelihood esti-
mation to infer the unknown parameters, which is similar to the non-delayed case.
Specifically, the MQLEs, denoted by β̂tD , D for delay, are solutions to the equation
t X
X 1
> D

`t (β̂tD ) = pi ci0 +τi0 − h pi β̂t = 0.
i=1 i0 ∈C
σ2
i
The following result is an extension of Theorem 3.4.1 to delayed cases.
Theorem 3.6.1. Suppose there exists t0 ∈ N and choose L(t) ∈ L for all t ≥ t0 ,
such that the MQLE β̂tD is strongly consistent. We denote the sum of all delays by
SD := Tt=1 τt . If the following conditions are satisfied
P
(i) λmin (t) ≥ L(t) a.s., for all t ≥ t0 ,

PT 2
D
(ii) t=1 pt − p(β̂t−1 ) ≤ k4 L(T ) a.s., for all T ≥ t0 and some k4 > 0,
then the regret bound is

T
!
X log(t)
Rg (T ) = O (SD + 1) L(T ) + ,
t=1
L(t)
p
Choosing L(t) = c t log(t) for some c > 0 gives
p
63
Theorem 3.6.1 shows the regret bound for the GLM pricing algorithm with un-
PT log(t)
known delays. We can see that the obtained O (SD + 1) L(T ) + t=1 L(t) suffers
an additional regret with delayed claims, where SD denotes the accumulated de-
p p
lays. Letting L(t) = c t log(t) gives the regret bound O T log(T ) , which is
consistent to prior work.
We now prove the main result of this section, the proof of which is similar to
that of Theorem 3.4.1.
Proof of Theorem 3.6.1. We consider the cumulative regret over selling horizon T
2
D
in terms of β̂t−1 − β0 , there exists some k5 , such that
" T
# " T
#
X X
? 2
E r(p, β0 ) − r(p? , β0 ) ≤ E k5 kpt − p k
t=1 t=1
" T
#
X 2
D
≤ E 2k5 pt − p(β̂t−1 ) (3.6.1)
t=1
" T
#
X 2
D
+ E 2k5 p(β̂t−1 ) − p? ,
t=1
for all t ≥ t0 and t0 is the smallest natural number. The first inequality is an
immediate result from Proposition 3.4.2. The second inequality is obtained by the
fact that (a + b)2 ≤ 2a2 + 2b2 for all a, b > 0, that is
2
kpt − p? k2 = pt − p(β̂t−1
D D
) + p(β̂t−1 ) − p?
2 2
D D
≤ 2 pt − p(β̂t−1 ) + 2 p(β̂t−1 ) − p? .
The first term in (3.6.1) is taking expectation of the first term in the above equation
and summing over t = t0 , . . . , T , which gives
" T # " T #
X 2 X 2
D D
E pt − p(β̂t−1 ) ≤E p̃s−τ̃s − p̃s + p̃s − p(β̂s−1 )
t=t0 s=t0
" T
# " T
#
X X 2
≤ 2E kp̃s−τ̃s − p̃s k2 + 2E D
p̃s − p(β̂s−1 ) .
s=t0 s=t0
The last inequality is obtained since

s −1
τ̃X
2
kp̃s−τ̃s − p̃s k ≤ kp̃s−τ̃s +j − p̃s−τ̃s +j+1 k2 ≤ k6 τ̃s L̇(s) ,
j=0
64
due to the fact that kp̃s − p̃s+1 k2 ≤ k6 L̇(s) where k6 = 1 + maxp∈P kpk2 > 0
is a constant. Since L(s) is an increasing concave function, by the Mean Value
Theorem we have L(s + 1) − L(s) = L̇(s̃) for any s̃ ∈ (s, s + 1). This implies
L(s + 1) − L(s) ≥ L̇(s + 1) and
T −1
X T −1
X
L̇(s + 1) ≤ L(s + 1) − L(s) = L(T ) − L(t0 ) ≤ L(T ) .
s=t0 s=t0
2
According to the condition (ii), we know that the term Ts=t0 p̃s − p̃(β̂s−1
D
P
) ≤
k4 L(T ). Using the result that Ts=1 τ̃s = Tt=1 τt in Lemma 3.6.1, which is stated
P P
immediately after this theorem, we obtain the bound on the first term
" T
#
X 2
D
2E pt − p(β̂t−1 ) ≤ 2 (k6 SD + k4 ) L(T ) . (3.6.2)
t=t0
For the second term in the (3.6.1), using (3.4.3), there exists some k7
" T
# T
X
D
2 X log(t)
2E β̂t−1 − β0 = 2k7 , (3.6.3)
t=t0 t=t0
L(t)
Applying (3.6.2) and (3.6.3) to (3.6.1) gives the regret
T
X log(t)
Rg(T ) ≤ 2k5 (k6 SD + k4 ) L(T ) + 2k5 k7 .
t=t0
L(t)
p
Let L(t) = c t log(t) for some c > 0, we obtain
p p
Rg(T ) ≤ 2k5 (k6 SD + k4 ) T log(T ) + 2k5 k7 T log(T ) .
Finally we obtain the result.
Lemma 3.6.1. For any s = 1, . . . , T , we have
T
X T
X
τ̃s = τt .
s=1 t=1
This lemma shows that the sum of τ̃s equals the sum of all delays τt over time
T.
65
Algorithm 3: GP Pricing Algorithm with Delayed Claims
Input: GP Pricing Algorithm 2
for t = 1, 2, . . . do
Collect the set of delay times DC t received by time t ;
for t0 ∈ DC t do
Select price: pt0 ← GP Pricing Algorithm 2 ;
Update GP with pt0 , ftd0 (pt0 ) and ftc0 +τt0 (pt0 );
end
end
3.6.3 Adaptive GP pricing with unknown delays
Currently there are no proven bounds for GP with delays. In this section, we
present the implementation of the GP pricing algorithm in the delayed-claim case.
The pseudocode for pricing a new released insurance product with delays via the
GP algorithm is shown in Algorithm 3. At each time t, we set a price by the GP
Pricing Algorithm 2 and collect the set of start times for delayed claims DC t = {t0 :
t0 = t − τt0 }. For each time t0 ∈ DC t , the premium is observed at time t0 , while
claims are received at time t = t0 + τt0 . We let the premium and the log of delayed
claims be pt0 · ftd0 (pt0 ) and ftc0 +τt0 (pt0 ) respectively, where f d and f c denote demand
and claims functions. After receiving the delayed claims, we update the GP for each
claim and its revenue. This then determines the next selling price.
3.7 Numerical examples in insurance
In this section, we show simulation results for algorithms discussed in the previous
sections. i.) The GLM policy defined in Algorithm 1; ii.) The GP-UCB policy
defined in Algorithm 2; iii.) The D-GLM and D-GP-UCB policies are designed for
cases with unknown delays, where D for delay. The performance of these policies is
measured by the regret.
3.7.1 Adaptive GLM pricing without delays
We consider a simple example of our problem, where both demand and log of total
claims are normally distributed with constant variance (σ = 0.05) and expectations
66
are linear functions of price.
We assume the true parameters are β0 = ((11, −0.8), (3, 0.25)), but they are
not known to the company. We let initial price vectors be p(1) = (1, 3)> and
p(2) = (1, 3.3)> and feasible price set to be [p` , ph ] = [1, 10]. Then we write demand
and total claims models as
E[D(p)] = 11 − 0.8p ,
E[C(p)] = 3 + 0.25p .
In the expected demand function, we choose a negative parameter, e.g., −0.8, be-
cause demand is strictly decreasing in price. While in the expected log of total
claims function, we choose a positive parameter, e.g., 0.25, because claims normally
increase in price.
We let T = 2000. Figure 3.1 shows price dispersion and convergence of parameter
p
estimates. As shown in Figure 3.1a, λmin (t)/ t log(t) is asymptotically bounded
p
around 0.01. This means that λmin (t) grows at the rate of t log(t). Recall that
p
this is a key condition in Theorem 3.4.1. Thus we set L(t) = 0.01 t log(t). This can
guarantee the price dispersion in our pricing problem. Figure 3.1b shows the value
2
of β̂t − β0 , which is the squared norm of the difference between the parameter
estimates β̂t and the true parameters β0 . We can see that the difference converges
to zero as t becomes large enough, i.e., β̂t → β0 as t → ∞. This shows the strong
consistency of parameter estimates.
3.7.2 Adaptive GP pricing without delays
In the GP setting, we sample demand and claims functions, fd and fc , from GPs
with Matérn kernels k3/2 (r) and k5/2 (r) respectively,
√ ! √ !
3r 3r
k3/2 (r) = 1 + exp − ,
` `
√ √ 2! √ !
5r 5r 5r
k5/2 (r) = 1 + + 2
exp − ,
` 3` `
and we let the length scale parameter be ` = 1. In addition, we set the sample noise
to be σ = 0.05 or σ 2 = 0.0025. Similar to the GLM setting, we consider the feasible
price set is [p` , ph ] = [1, 10]. We run the algorithm for T = 100 with δ = 0.1.
67
3.7.3 Adaptive GLM and GP algorithms with delays
In order to observe the effect of delays, we consider the same settings as in the non-
delayed examples. We assume that delays, τt , are unknown non-negative integers,
that are generated randomly on interval [0, m], where m = 10. The last few delays
are slightly modified to ensure that all claims and revenues are received at time T .
3.7.4 Comparison of GLM and GP algorithms with and

without delays
The performance of the four pricing policies is shown in Figure 3.2 and Figure 3.3.
Figure 3.2a plots the cumulative regret incurred by GLM pricing algorithms with
and without delays. We can see that regret increases with t in both cases. To make
p
results more precisely, we re-scale regret to Rg(T )/ T log(T ). Figure 3.2b shows
p
that Rg(T )/ T log(T ) is around a constant in both cases. This means the regret
p
has an order of T log(T ). In addition, we observe that there is a gap between
GLM and D-GLM pricing algorithms due to the delays. The delays cause larger
regret, which means that the company suffers from an extra loss.
Figure 3.3 shows the regret obtained by GP pricing algorithms with and without
delays. Figure 3.3a shows regret quickly converges to fixed values in both cases.
Moreover, the regret in the delayed setting is larger than that in the non-delayed
p
setting. Figure 3.3b shows Rg(T )/ T log(T ) is converging to a constant, which
p
verifies that the order of regret is T log(T ). The gap between GP and D-GP
pricing algorithms is due to the delays, and the company suffers from an extra loss.
Comparing Figure 3.2a and Figure 3.3a, we see that the regret obtained by the
GP pricing algorithm converges quicker and has a smaller regret. Comparing Figure
3.2b and Figure 3.3b, we see that both algorithms achieve the same order of regret
bound. Overall, the GP pricing algorithm outperforms the GLM pricing algorithm.
68
3.8 Conclusions and future directions
In this chapter, we considered an insurance company that launches a new non-

life insurance product online, but both demand and claims are unknown to the
company. The goal of the company is to select prices that maximize the long-term
revenue. We investigated two approaches—an adaptive GLM and an adaptive GP
regression model—in the insurance pricing problem. In the real world, claims only
occur when the insured events happen. Thus claims amount can not be observed
immediately due to delays. We started with a simple case where claims were assumed
to be paid without delays and then discussed the case with delayed claims. The
simple example given above showed that both the GLM and GP pricing algorithms
performed well. In the delayed case, we can see that both algorithms obtained
higher regret bounds due to extra losses. By comparison, we could see that GP
pricing algorithm converges more quickly. However, GLMs have a long history of
good performance in insurance, so it is important to also consider the GLM setting
as we have done here. More broadly, our findings suggest that the new GP regression
model is applicable in insurance.
There are several directions for future investigation. So far we have focused on
the single product dynamic pricing problem. A natural extension would be to con-
sider that the insurance company sells multiple products or multiple companies in
the market, whereas much of the literature is for the single product setting. Fur-
thermore, in this chapter we assumed that demand and claims changes only depend
on prices. However, the company also receives some contextual information that
can be used to predict demand changes. The contextual information may contain
competitors’ prices on similar products or personal features such as customer demo-
graphics and purchase history. We consider an extension of dynamic pricing with
contextual information in Chapter 4. Another extension would be to incorporate
the ruin probability, a measure of the risk when making decisions. Ruin does not
necessarily mean that the company is bankrupt, but that the capital reserved (or
revenue) is not sufficient. In the insurance market, ruin might happen as the result
of one large claim or a collection of small claims. The probability of ruin will be
incorporated in Chapter 5. In addition, we are interested in applying reinforcement
69
learning to solve online revenue management problems, which will be discussed in
Chapter 5. After all, bandit problems, as considered in this chapter, are simple
reinforcement learning routines. However, more complex future market interactions
might be considered instead and this problem may be modeled as a Markov decision
process.
70
(a) Price dispersion
(b) Convergence rate of parameter estimates
Figure 3.1: Price dispersion and convergence of parameter estimates obtained for the GLM
pricing policy.
71
(a) Cumulative regret (b) Rate of convergence
Figure 3.2: Cumulative regret and convergence rate for the GLM pricing algorithm. GLM
denotes the non-delayed case and D-GLM denotes the delayed case.
(a) Cumulative regret (b) Rate of convergence
Figure 3.3: Cumulative regret and convergence rate for the GP pricing algorithm. GP
denotes the non-delayed case and D-GP denotes the delayed case.
72
3.9 Appendices
Appendix A: Proofs of the results in Section 3.4
Proof of Proposition 3.4.1. If β̂t exists and tr(P (t)−1 )−1 ≥ L(t), we first set the next
price to be pt+1 = pce , and check the condition (3.4.5). If this condition does not
hold, which is equivalent to
−1 −1
tr P (t) + pce p>
ce < L(t + 1) , (3.9.1)
we then let the next price be p0 = pce + φt . We will show that there exists φt , such
that −1 −1
0 0>
tr P (t) + p p ≥ L(t + 1) . (3.9.2)
We apply the Sherman–Morrison formula (Bartlett, 1951), which is

A−1 uv > A−1
(A + uv > )−1 = A−1 − .
1 + v > A−1 u
We let A = P (t) and u = v = p0 , then

0 0>
−1
−1 P (t)−1 p0 p0 > P (t)−1
P (t) + p p = P (t) − ,
1 + p0 > P (t)−1 p0
which gives
!
P (t)−1 p0 p0 > P (t)−1
−1
0 0> −1
tr P (t) + p p = tr P (t) −
1 + p0 > P (t)−1 p0
!
−1
P (t)−1 p0 p0 > P (t)−1
= tr P (t) − tr
1 + p0 > P (t)−1 p0

−1 0 0 > −1
tr P (t) p p P (t)
= tr P (t)−1 −
1 + p0 > P (t)−1 p0
kP (t)−1 p0 k
= tr P (t)−1 −

,
1 + p0 > P (t)−1 p0
1 1
where k · k is denote the Euclidean norm. We know t → L(t)
is convex and L(t)
+

d 1 1
dt L(t)
≤ L(t+1) . If we can prove that
kP (t)−1 p0 k

−1
1 d 1
tr P (t) − 0 > −1 0
≤ + ,
1 + p P (t) p L(t) dt L(t)
then we can say that (3.9.2) is satisfied. Moreover, given tr (P (t)−1 ) ≤ 1
L(t)
, our aim
is to show
kP (t)−1 p0 k

d 1
0 > −1 0
≥− ,
1 + p P (t) p dt L(t)
73
Here we discuss a general case. Let t > n + 1 and λ1 ≥ . . . ≥ λn+1 > 0 be the
eigenvalues of P (t) ⊂ Rn+1 . Assume that v1 , . . . , vn+1 are associated eigenvectors,
which form an orthonormal basis of Rn+1 . Define the optimal price as a linear
combination of these unit eigenvectors, given by pce = n+1
P
i=1 αi vi . Let the next price
be p0 = pce + (vn+1,1 pce − vn+1 ) for some , where vn+1,1 is the first component of
vn+1 . We know that kvi k = 1 and |vn+1,i | ≤ 1 for all i. Then we have

0 2 2 2 2 2
kp − pce k = kvn+1,1 pce − vn+1 k ≤ 1 + max kpk .
p∈P
This demonstrates that

2 2 2
kφt k ≤ 1 + max kpk .
p∈P
We choose || < 1 such that
≥ 0 if αn+1 ≤ 0 , < 0 if αn+1 > 0 , (3.9.3)
and the first derivative of L(t), denoted by L̇(t), is
−1
2 −2 −1 2 (3.9.4)
L̇(t) ≤ (n + 1) 1 + L(n + 1) max kpk .
p∈P
Write kp0 k2P (t)−1 = p0 > P (t)−1 p0 . Since λmax (P (t)−1 ) = λmin (P (t))−1 and λmin (P (t)) ≥
L(t), for t > n + 1 we have
1 + kp0 k2P (t)−1 ≤ 1 + λmax P (t)−1 kp0 k2

= 1 + λmin (P (t))−1 kp0 k2

(3.9.5)
≤ 1 + L(t)−1 kp0 k2
≤ 1 + L(n + 1)−1 max kpk2 .

p∈P
74
Moreover,
n+1 n+1
! !! 2
−1 0 2
X X
P (t) p = P (t)−1 αi vi + vn+1,1 αi vi − vn+1
i=1 i=1
n+1 n
! !! 2
X X
−1
= P (t) αi vi + vn+1,1 αi vi + αn+1 vn+1 − vn+1
i=1 i=1
n
! 2
X
= P (t)−1 (αn+1 + (vn+1,1 αn+1 − 1)) vn+1 + (1 + vn+1,1 ) αi vi
i=1
n 2
X
= (αn+1 + (vn+1,1 αn+1 − 1)) λ−1
n+1 vn+1 + (1 + vn+1,1 ) λ−1
i αi vi
i=1
n
2 X 2
= (αn+1 + (vn+1,1 αn+1 − 1)) λ−1
n+1 kvn+1 k + 2
(1 + vn+1,1 ) λ−1
i αi kvi k2
i=1
≥ ((1 + vn+1,1 ) αn+1 − )2 λ−2

n+1 kvn+1 k
2
≥ 2 λ−2
n+1 .
(3.9.6)
Since || ≤ 1, then 1 + vn+1,1 ≥ 0. Together with (3.9.3), we can obtain
((1 + vn+1,1 ) αn+1 − )2 ≥ (1 + vn+1,1 )2 αn+1

2
+ 2 ≥ 2 .
Since P (t) is a symmetric positive definite matrix, we have
tr(P (t)−1 )−1 ≤ λmin (P (t)) ≤ n tr(P (t)−1 )−1 .
Given (3.9.1) and by the Sherman–Morrison formula, we have

−1 −1
λn+1 ≤ (n + 1) tr(P (t)−1 )−1 ≤ (n + 1) tr P (t) + pce p>
ce < (n + 1)L(t + 1) .
(3.9.7)
By (3.9.6) and (3.9.7), we obtain
2
P (t)−1 p0 ≥ 2 (n + 1)−2 L(t + 1)−2 . (3.9.8)
Together with (3.9.5) and (3.9.8), we have

kP (t)−1 p0 k2 2 (n + 1)−2 L(t + 1)−2
≥ .
1 + kp0 k2P (t)−1 1 + L(n + 1)−1 maxp∈P kpk2
q
Consider the right hand side of the above inequality, we choose = ±k L̇(t) and
let −1
2 2 −2 −1 2
K = k (n + 1) 1 + L(n + 1) max kpk ≥ 1.
p∈P
75
The last inequality is obtained by (3.9.4). Then we obtain
kP (t)−1 p0 k2 K 2 L̇(t)
≥ .
1 + kp0 k2P (t)−1 L(t + 1)2
1
Due to the convexity of L(t)
, there exists K > 1, such that KL(t) ≥ L(t + 1), and
kP (t)−1 p0 k2 K 2 L̇(t) L̇(t)

2
≥ ≥ .
1 + kp0 kP (t)−1 L(t + 1)2 L(t)2
This gives the result.
∂r(p? ,β0 )
Proof of Proposition 3.4.2. Since p(β0 ) ∈ P and ∂pi
= 0, by the Taylor series
expansion, for all p ∈ P,
|r(p, β0 ) − r(p? , β0 )| ≤ k1 kp − p? k2 ,
∂ 2 r(p,β0 )
where k1 := supp∈P p2i
< ∞. By the implicit function theorem in Duistermaat
& Kolk (2004), V can be chosen such that the function β → p(β) is continuously
differentiable with bounded derivatives. Thus for all β ∈ V and some non-random
constant k2 > 0, by the Taylor expansion,
kp(β) − p(β0 ) k≤ k2 k β − β0 k .
Assume there exits β̂t ∈ V for all t, such that
p(β̂t ) − p(β0 ) ≤ k2 β̂t − β0 a.s.
Appendix B: Proofs of the results in Section 3.5
Proof of Lemma 3.5.3. By the definition that
log I + σ −2 (Kd + Kc ) ≤ log I + σ −2 Kd + log I + σ −2 Kc
≤ γT (κd ) + γT (κc ) ,
where Kd and Kc are the Gram matrices for κd (p, p0 ) and κc (p, p0 ), respectively.
76
√
Proof of Proposition 3.5.1. By the definition that pt = arg maxp∈P µrt−1 (p) + r
ϕt σt−1 (p),
we have
√ √
µrt−1 (pt ) + r
ϕt σt−1 (pt ) ≥ µrt−1 ([p∗ ]t ) + r
ϕt σt−1 ([p∗ ]t ) .
To bound the right hand side of the above inequation, we need Lemma 3.9.1 (stated
and proven below), which implies
√ p
µrt−1 ([p∗ ]t ) + r
ϕt σt−1 ([p∗ ]t ) ≥ r(p∗ ) − ub log(2a/δ)/t2 .
Therefore
√ p
µrt−1 (pt ) + r
ϕt σt−1 (pt ) ≥ r(p∗ ) − ub log(2a/δ)/t2 .
Together with (3.9.9), we obtain

p
√ ub
log(2a/δ)
r(p? ) − r(pt ) ≤ µrt−1 (pt ) + ϕt σt−1r
(pt ) + − r(pt )
p t2
√ r ub log(2a/δ)
≤ 2 ϕt σt−1 (pt ) + .
t2
Lemma 3.9.1 establishes a confidence bound on a general case of P ⊂ Rd com-

pact. To prove this Lemma 3.9.1, we apply Lemma 3.9.2 (stated and proven below),
which provides a confidence bound on a finite decision set |P| < ∞, where all
decisions are chosen. Let πt > 0 be a sequence such that t πt−1 = 1.
P
Lemma 3.9.1. Pick δ ∈ (0, 1) and set
ϕt = 2 log(2πt2 t2 /3δ) + 2 log t2 .

With probability greater than 1 − δ/2, for any p ∈ P and t ≥ 1, we have

p
r 1/2 r ub log(2a/δ)
r(p) − µt−1 ([p]t ) ≤ ϕt σt−1 ([p]t ) + .
t2
Proof. For all p ∈ P with probability greater than 1 − δ/2, we can write
r(p) − µrt−1 ([p]t ) ≤ |r(p) − r([p]t )| + r([p]t ) − µrt−1 ([p]t ) .
The first term in the above inequality is bounded as below. By assumption 3.5.1
2
and the union bound, we have P supp∈P |∂f /∂pj | > J ≤ ae−(J/b) . Therefore, there

exist constants a, b > 0, such that

∂r 2
P sup > J ≤ ae−(J/b) .
p∈P ∂pj
77
2
Then, with probability greater than 1 − ae−(J/b) , for all p ∈ P, we have
|r(p) − r(p0 )| ≤ J kp − p0 k .
Consider a sequence of democratization Pt of cardinality ζt that satisfies
u
kp − [p]t k ≤ ,
ζt
where constant u = ph − p` is the length of the price set and [p]t is the closest price
2
to p in Pt . Let δ/2 = ae−(J/b) and we have with probability greater than 1 − δ/2,
p
|r(p) − r([p]t )| ≤ b log(2a/δ) kp − [p]t k
p
≤ ub log(2a/δ)/ζt .
Choosing ζt = t2 yields
p
|r(p) − r([p]t )| ≤ ub log(2a/δ)/t2 .
The second term is obtained by substituting [p] for p in Lemma 3.9.2,
1/2
r([p]t ) − µrt−1 ([p]t ) ≤ ϕt σt−1
r
([p]t ) .
Then we obtain the result.
Lemma 3.9.2. Pick δ ∈ (0, 1) and set ϕt = 2 log(|P| πt /δ). Then with probability
greater than 1 − δ, for any p ∈ P and t ≥ 1, we have
√
r(p) − µrt−1 (p) ≤ r
ϕt σt−1 (pt ) , (3.9.9)
r d c
where σt−1 (p) = pt−1 σt−1 (p) + σt−1 (p).
Proof. Conditioned on yt−1 = (y1 , . . . , yt−1 ), {p1 , . . . , pt−1 } are deterministic and the
2
marginals follow f (p) ∼ N (µt−1 (p), σt−1 (p)) for any fixed t ≥ 1 and p ∈ P. By the
2 /2
following tail bound, we know P(z > c) ≤ (1/2)e−c for c > 0 if z ∼ N (0, 1). Let
1/2
z = (f (p) − µt−1 (p)) /σt−1 (p) and c = ϕt , then

|f (p) − µt−1 (p)| 1/2
P > ϕt ≤ e−ϕt /2 .
σt−1 (p)
With probability greater than 1 − |P| e−ϕt /2 , we have
1/2
|f (p) − µt−1 (p)| ≤ ϕt σt−1 (p) .
78
Given r(p) = p · fd (p) − fc (p), with probability greater than 1 − |P| e−ϕt /2 , we have
r(p) − µrt−1 (p) = (pfd (p) − fc (p)) − pµd − µc

≤ pfd (p) − pµd + |(fc (p) − µc )|

d 1/2 c 1/2
≤ pϕt σt−1 (p) + ϕt σt−1 (p) .
r
Let σt−1 d
(p) = pt−1 σt−1 c
(p) + σt−1 (p) and choose |P| e−ϕt /2 = δ/πt . By the union
bound on all t, we obtain the results.
Appendix C: Proof of the result in Section 3.6
Proof of Lemma 3.6.1. By the definition of τ̃s = s − 1 − N (ρ(s)),

T
X T
X T
X T
X
τ̃s = (s − 1 − N (ρ(s))) = (t − 1) − N (ρ(s))
s=1 s=1 t=1 s=1
XT T
X XT
= (t − 1) − N (t) = τt .
t=1 t=1 t=1
The third equality is obtained because {ρ(s) : s = 1, . . . , T } is a permutation of
i=1 1{i +
{1, . . . , T }. The fourth equality derives from the definition that N (t) = t−1
P
τi ≤ t − 1}.
79
Chapter 4
Perturbed pricing
4.1 Introduction
When a company sells a new product, the objective is to select prices that maximize
revenue. However, there is often little information about demand, so over time the
company must choose prices that maximize long-run revenue and efficiently estimate
the distribution of demand. Moreover, when the product is sold online, the demand,
and thus prices, depend on the context in which the items are sold, such as who is
viewing and what search criteria they used.
A widely-used pricing strategy is the certainty equivalent pricing rule or myopic
pricing (Broder & Rusmevichientong, 2012; Keskin & Zeevi, 2014). Here, the man-
ager statistically estimates the demand over the set of prices (and contexts) and then
selects the revenue optimal price for this estimated demand, i.e., prices are chosen
as if statistical estimates are the true underlying parameters. This approach is very
appealing as it cleanly separates the statistical objective of estimating demand from
the managerial objective of maximizing revenue. However, it is well-known in both
statistics and management science literature that a certainty equivalent rule may
not explore prices with sufficient variability. This leads to inconsistent parameter
estimation and thus sub-optimal revenue.
Overcoming this difficulty is an important research challenge for which numerous
different methods have been proposed. We review these shortly, but, to summarize,
there are two principal approaches: to modify the statistical objective, or to modify
80
the revenue objective. For instance, if the statistical objective is a maximum likeli-
hood objective, then likelihood function can be modified via regularization to induce
consistency. For the revenue objective, confidence intervals and taboo regions can
be used to restrict the set of available prices and thus ensure exploration. These
modifications involve additional bespoke calculations that couple the statistical and
revenue objectives.
In this article, we advocate a different approach: pricing according to a certainty
equivalent rule with a decreasingly small random perturbation. Estimation can then
be conducted according to a standard maximum likelihood estimation. We call this
perturbed certainty equivalent pricing, or perturbed pricing, for short. From a man-
agerial perspective, the key advantage is its simplicity—we perturb the data not
the optimization. The statistical objective and revenue objective remain unchanged
and can be treated separately. Thus parameters can be estimated using conven-
tional statistical tools, and prices can be chosen with standard revenue optimization
techniques. If the magnitude of the perturbation is chosen well, our results prove
that perturbed pricing performs comparably with the best pricing strategies.
4.1.1 Overview
We present a brief summary of our model, pricing strategy, results and contributions.
A formal description and mathematical results are given in Sections 4.2 and 4.3,
respectively.
Model. We let r(p, c ; β0 ) be the revenue for a set of products at prices p under
context c given parameters β0 . We assume that context c is i.i.d. bounded random
variables and the covariance matrix of c by Σc = E cc> > 0. The revenue objective

is to find the revenue optimal price for each context:
p? (c) ∈ arg max r(p, c ; β0 ) .

p
The parameters β0 are unknown; however, these parameters can be inferred. In

particular, we receive a demand response y to the price-context input x = (p, c) as
a generalized linear model:
y = µ β0> x + ε .

81
Here µ is an increasing function and ε has mean zero. The response y can be
interpreted as the number (or volume) of items sold given the prices and context.
Given data ((xs , ys ) : s = 1, ..., t), the statistical objective is a maximum likelihood
optimization:
t
X
β̂t ∈ arg max ys β > xs − m(β > xs ) , (4.1.1)
β s=1
where m0 (z) = µ(z).
Pricing. Given an estimate β̂, the certainty equivalent price is
pce (c ; β̂) ∈ arg max r(p, c ; β̂) . (4.1.2)

p
Notice p? (c) = pce (c ; β0 ). For the maximum likelihood estimate at time t − 1, β̂t−1 ,
and new context, ct , the perturbed certainty equivalent price is
pt = pce (ct ; β̂t−1 ) + αt ut , (4.1.3)
where ut is an independent, bounded, mean zero random variable and αt is a positive

real number. The random variables ut and αt can be selected in advance. For
instance, we may choose ut ∼ Uniform([−1, 1]d ) or ut ∼ Uniform({±ei : i = 1, ...d})
1
where ei is the i-th unit vector. We will advocate taking αt = t− 4 .
Results. The regret measures the performance of the perturbed pricing strategy
compared to the revenue optimal strategy:
T
X
Rg(T ) = r(p? (ct ), ct ; β0 ) − r(pt , ct ; β0 ) .
t=1
√
It is known that Rg(T ) = Ω( T ) for any asymptotically consistent pricing policy.
1
In Theorem 4.3.1 we prove that for αt = t− 4 the regret of the perturbed pricing
satisfies
√
4.1.2 Contributions.
In Theorem 4.3.1 we show that the convergence of the perturbed certainty equivalent
pricing is optimal up to logarithmic factors. The speed of convergence is competitive
with the best existing schemes. However, as we have already discussed, the main
82
contribution is the simplicity of the policy. Current schemes often require additional
matrix inversions, eigenvector calculations, or introduce non-convex constraints to
the pricing optimization. Furthermore the scheme is flexible in leveraging contextual
information, which is an important best-practice in many online market places and
recommendation systems.
In forming perturbed certainty equivalent prices, the manager can estimate pa-
rameters as a standard maximum likelihood estimator (4.1.1), and can price accord-
ing the revenue maximization objective (4.1.2). The only change is to introduce a
small amount of randomization (4.1.3). This is appealing as the statistical optimiza-
tion and revenue optimization remain unchanged and the perturbation of prices is
simple, intuitive, and requires negligible computational overhead.
In addition to this managerial insight, there are a number of mathematical con-
tributions in this paper. The eigenvalue lower bound used in Proposition 4.3.2 is
new and also crucial to our analysis of design matrices. We build on the work of Lai
& Wei (1982) to clarify results on the rate of convergence and strong consistency
of generalized linear models in Proposition 4.3.1. Further, as we will review, much
of the current revenue maximization literature on pricing applies to revenue opti-
mization of a single item in a non-contextual setting. To this end, we include the
important generalizations of contextual information.
4.1.3 Related literature
In this section, we provide a brief review of revenue management with unknown

demand, contextual multi-armed bandits, feature-based pricing, and strong consis-
tency of generalized linear models.
Revenue Management with Unknown Demand. Optimizing revenue over

prices is a central theme in revenue management. Texts such as Phillips (2005) and
Talluri & van Ryzin (2005) provide an excellent overview. In classical works, the
demand for each price is known to the decision maker. However both in practice
and in more recent academic literature, the demand function must be estimated.
This is clearly articulated in the seminal work of Besbes & Zeevi (2009). Here the
83
classical revenue management problem of Gallego & van Ryzin (1994b) is recast as a
statistical learning problem and shortfall in revenue induced by statistical estimation
is analyzed. Subsequently there have been a variety of studies jointly applying
optimization and estimation techniques in revenue management, see den Boer (2015)
for a survey of this literature.
We consider a parametric statistical model. The distribution of demand is fixed

with unknown parameters. Here a popular policy is the certainty equivalent pricing
rule. It was first introduced by Anderson & Taylor (1976) for linear models in
econometrics. However, their greedy iterated least squares algorithm is known to be
sub-optimal; see Lai & Robbins (1982) for a counter-example and den Boer (2013)
for a counter-example in the context of revenue management. Subsequent works
have developed mechanisms to overcome this issue. Broder & Rusmevichientong
(2012) introduced a maximum-likelihood model with a single unknown parameter.
They achieved an upper bound with a O(log T ) regret in the “well-separated” case,
where all prices are informative. den Boer & Zwart (2014b) proposed a controlled
variance pricing policy, in which they created taboo intervals around the average of
previously chosen prices. They obtained an asymptotic upper-bound on T -period

regret of O T 1/2+δ for arbitrarily small δ > 0. Keskin & Zeevi (2014) considered
the expected demand as a linear function of price and provided general sufficient
conditions for greedy iterated least squares. In particular, their constrained iterated
least squares algorithm follows a similarly appealing rationale to our perturbed
certain equivalent price. However, the algorithm requires the notion of a time-
averaged price which may not be obtainable in a contextual setting. Both controlled
variance pricing policy and, under appropriate assumptions, constrained iterated
least squared, can ensure sufficient price dispersion for low regret outcomes. The
perturbed pricing policy considered in this paper was first proposed for the pricing
of a single product in Lobo & Boyd (2003). The policy is analyzed via simulation
(without supporting theoretical analysis) and they noted that randomization can
significantly improves result. This work substantially generalizes their setting and
is the first paper to provide a theoretical basis for their positive findings.
84
The revenue management literature discussed so far does not incorporate contex-
tual information on the customer, the query and the product. To this end, we first
discuss the contextual multi-arm bandit problem and then feature-based pricing.
Contextual Multi-armed Bandits. A multi-arm bandit problem is a broad

class of sequential decision making problems where an algorithm must jointly es-
timate and optimize rewards. For an overview on multi-armed bandit problems,
see Bubeck & Cesa-Bianchi (2012) and Lattimore & Szepesvári (2017). Our work is
related to literature on contextual multi-armed bandits. Here the algorithm receives
additional information which can be used to inform the algorithm’s decision. This
has been applied to many problems, such as clinical trials (Woodroofe, 1979) and
online personalized recommendation systems (Li et al., 2010).
Auer (2003) studied multi-armed bandits, where actions are selected from a set
of finite features and the expected reward is linear in these features. Following
from Auer (2003), algorithms based on confidence ellipsoids were described in Dani
et al. (2008), Rusmevichientong & Tsitsiklis (2010), Abbasi-Yadkori et al. (2011) and
√
Chu et al. (2011). Abbasi-Yadkori et al. (2011) proved an upper-bound of O(d T )
regret after T time periods with d-dimensional feature vectors. Generalizing previous
work on this linear stochastic bandit problem, Filippi et al. (2010) introduced a
generalized linear model with the upper confidence bound algorithm and achieved
√ √
O(d
e T ) regret. Li et al. (2017) improved the work of Filippi et al. (2010) by a d
factor.
Feature-based Pricing. Following from the above work on contextual bandits,

there is a growing literature on dynamic pricing with features (or covariates). The
feature information may help the decision maker to improve estimation and seg-
ment distinct market places. Amin et al. (2014) studied a linear model, where fea-
tures are stochastically drawn from an unknown i.i.d. distribution. They proposed
a pricing strategy based on stochastic gradient descent, which achieved sub-linear
e 2/3 ). Cohen et al. (2016) considered a problem similar to Amin et al.
regret O(T
(2014). They assumed that the feature vectors are adversarially selected and intro-
duced an ellipsoid-based algorithm, which obtained regret of O(d2 log(T /d)). Qiang
85
& Bayati (2016) assumed the demand follows a linear function of the prices and
covariates, and applied a myopic policy based on least-square estimations which
achieved a regret of O(log(T )). Javanmard & Nazerzadeh (2019) considered a
structured high-dimensional model with binary choices and proposed a regular-
ized maximum-likelihood policy which achieves regret O(s0 log(d) log(T )). These
prior models achieve a logarithmic rather than square root regret because demand
feedback is a deterministic function of some unknown parameter. See Kleinberg
& Leighton (2003) for an early discussion on this distinction and lower-bounds in
both settings. Ban & Keskin (2020) were the first to introduce random feature-
√
dependent price sensitivity and achieved the expected regret of O(s T log(T )) and
√
O(s T (log(d)+log(T ))) in linear and generalized linear demand models. Chen et al.
(2015) considered statistical learning and generalizations of feature-based pricing.
Maximum Quasi-likelihood Estimation for GLMs. We model demand with a

generalized linear model (GLM). McCullagh & Nelder (1989) is a classical reference.
GLMs are widely used for ratings systems and pricing decisions, see Ohlsson &
Johansson (2010). In the GLM framework, a commonly used technique to estimate
the parameters is maximum likelihood estimation. Wedderburn (1974) proposed
maximum quasi-likelihood estimation, as extension of likelihood estimations.
Following our discussion on the counter-example of Lai & Robbins (1982), the
consistency of parameter estimators is of great concern. To deal with this problem,
conditions are usually imposed, see Lai (2003) for a more detailed discussion. When
the regression model is linear, Lai & Wei (1982) proved the strong consistency
of estimation when the ratio of the minimum eigenvalue to the logarithm of the
maximum eigenvalue goes to infinity. For GLMs with canonical link functions, Chen
et al. (1999) derived similar strong consistency results under similar conditions to Lai
& Wei (1982). For GLMs with general link functions, Chang (1999) obtained strong
consistency via a last-time variable based on a sequence of martingale differences
and an additional assumption. den Boer & Zwart (2014a) reported on mathematical
errors in the works of Chang (1999) and Chen et al. (1999). Consequently some
alternative derivations have been developed by den Boer & Zwart (2014a) and Li
et al. (2017). Based on these, we develop our own strong consistency result in this
86
paper. Also, as discussed, bounds on the design matrix of our GLM are required for
convergence. Given the Schur Decomposition and Ostroswki’s Theorem (see Horn
& Johnson (2012)), we develop a new eigenvalue bound, Proposition 3, that when
combined with covariance estimation results from Vershynin (2018) enables us to
incorporate random contextual information.
The remainder of this chapter is structured as follows. In Section 4.2, we in-
troduce the model and formulate the problem. In Section 4.3 and specifically in
Theorem 4.3.1, we prove the main result of this chapter, the convergence of the
perturbed certainty equivalent pricing. We show the numerical results in Section
4.4. In Section 4.5, we present some concluding remarks and discussion of the work.
Proofs of additional results are deferred to appendices at the end of this chapter.

We formally describe the revenue objective, the statistical objective, and the per-
turbed pricing policy along with assumptions required for our theoretical analysis.
Revenue Optimization. We choose prices p ∈ P. Here P is a bounded open convex

subset of Rm . When choosing a price, we receive a context c ∈ C. This summarizes
information about the item (or items) sold and the buyer (or buyers). Here, C is a
bounded subset of Rd−m , with d ≥ m. We let x := (p, c) ∈ X , where X := P × C ⊆
Rd .
We receive rewards for different prices chosen for different contexts over time.
Further, there are unknown parameters β0 ∈ B ⊂ Rd that influence these rewards.
In particular, we let r (p, c ; β0 ) be the expected real-valued reward for prices p
under context c given parameters β0 . Since x = (p, c), we also define r (x ; β0 ) :=
r (p, c ; β0 ). We assume that (x, β) 7→ r(x ; β) is a twice continuous differentiable
function. Given the context c ∈ C, an objective is to choose prices that maximize
reward:
p? (c) ∈ argmax r (p, c ; β0 ) . (4.2.1)
p∈P
87
The solution p? (c) is the optimal price for context c. Given β0 , we assume there is
a unique optimal price p? (c) for each c ∈ C. We place one of two assumptions on
the reward function and the set of contexts,
A1a). The set C is finite and the Hessian of p 7→ r(p, c ; β0 ) is positive definite at
p? (c) for each c ∈ C.
A1b). The set C is convex and p 7→ r(p, c ; β0 ) is α-strongly concave for some α > 0.
Learning Model. The parameter β0 is unknown. However, we may learn it through

statistical estimation. In particular, we receive responses that are a generalized
linear model with parameter β0 . Given x ∈ X , we receive a response y which is a
real-valued random variable such that
y = µ β0> x + ε .

Here ε is a bounded random variable with mean zero. Further, µ : R → R, which is

called the link function, is a strictly increasing, continuously differentiable, Lipschitz
function.
Given data (xs , ys ) for s = 1, . . . , t, we must estimate unknown parameters
β0 . We let β̂t denote our estimate of β0 . A popular method for estimating β0 is
maximum (quasi-)likelihood estimation. Here, maximum quasi-likelihood estimators
β̂t are the solutions to the equations
t
X
xs ys − µ β̂t> xs = 0. (4.2.2)
s=1
When it exists, the solution to this equation is unique (since µ is strictly increasing).
When the distribution of ys given xs , for s = 1, ..., t, are independent and each
belongs to the family of exponential distributions with mean µ β̂0> xs , then (4.2.2)

is the condition on β for maximizing the log-likelihood. In this case, β̂t is the
maximum likelihood estimator. However in our case, we don’t assume (yt | xt ) is
from an exponential family, so instead, as in Wedderburn (1974), we refer to β̂t as
maximum quasi-likelihood estimators.
88
Typically β̂t can be found with standard software packages using Newton meth-
ods such as Iteratively Reweighted Least Squares. For better time complexity, in
the case of linear regression, the Sherman–Morrison formula can be applied to yield
an online algorithm with O(td2 ) complexity.
A sequence of estimators β̂, is said to be strongly consistent if, as t → ∞ with
probability 1,
β̂t → β0 .
For adaptive designs it is often possible to prove even stronger results, specifically
that with probability 1,

2 log(t)
β̂t − β0 =O . (4.2.3)
λmin (t)
Pt
where λmin (t) is the minimum eigenvalues of the design matrix s=1 xs x>
s and k·k
denotes the Euclidean norm.

Our main result, Theorem 4.3.1, holds for any generalized linear model such
that (4.2.3) holds. We prove, in Proposition 4.3.1, that (4.2.3) holds under the
assumption
A2.
0< min µ̇(β > x) .
x,β:kxk≤xmax ,
kβk≤βmax
Here xmax and βmax are the largest values of kxk and kβk for t ≥ 1.
The above assumption holds for linear regressions and also for any model where
the parameters β̂t remain bounded. We note that boundedness can be enforced
through projection, for instance Filippi et al. (2010) took this approach. For the
convergence rate (4.2.3), there are several alternative proofs on the rate of conver-
gence of adaptive GLMs designs. These are discussed in the literature review. Here,
any convergence result of the form (4.2.3) can be used in place of Assumption A2
and Proposition 4.3.1. We provide our own proof under Assumption A2 in order to
present a short self-contained treatment.
Time and Regret. Our goal to optimize revenue, (4.2.1), remains. However, re-
grettably, we will always fall short of this goal. This is because the parameters
89
β0 are unknown to us. Instead, we must simultaneously estimate β0 and choose
prices that converge on p? (c) for each c ∈ C. The variability in x = (p, c) required
to estimate β0 will inevitably be detrimental to convergence towards p? (c), while,
rapid convergence in prices may inhibit estimation and lead to convergence to sub-
optimal prices. Stated more generally, there is a trade-off between exploration and
exploitation which is well-known for bandit problems.
We let T ∈ N be the time horizon of our model. For each time t = 1, . . . , T , we
receive a vector of context ct ∈ C. Then for xt = (pt , ct ), we are given response
yt = µ β0> xt + εt ,

(4.2.4)
and receive reward rt (xt , yt ; β0 ) where
r(xt ; β0 ) = E[rt (xt , yt ; β0 ) | xt ] .
We let F = {Ft : t ∈ Z+ } be the filtration where Ft is the σ-field generated by

random variables (xs : s = 1, . . . , t). Notice that xt is Ft−1 measurable. We assume
that t defined in (4.2.4) is a martingale difference sequence w.r.t. the filtration Ft−1 .
That is E [t | Ft−1 ] = 0 and for some constant σ > 0 and γ > 2, almost surely,
sup E[2t | Ft−1 ] = σ 2 < ∞ , sup E[| t |γ | Ft−1 ] < ∞ .

t t
The regret at time T , which we denote by Rg(T ), is defined by
T
X
Rg(T ) := rt (x?t , yt? ; β0 ) − rt (xt , yt ; β0 ) .
t=1
where x?t := (p? (ct ), ct ) and yt? is the response under given x? . Recall that p? (ct )
is the optimal price for context ct given parameter β0 is known, see (4.2.1). The
regret is the expected revenue loss from applying prices pt rather than the optimal
price p? when β0 is known. Thus as β0 is unknown, we instead look to make the
regret as small as possible. As we discussed in the literature review, the best possible
√
bounds on the regret for this class of problem are of the order O( T ) (see Kleinberg
& Leighton (2003)), and any policy that achieves regret o(T ) can be considered to
have learned the optimal revenue.
90
4.2.2 Perturbed certainty equivalent pricing
The certainty equivalent price is the price that treats all estimated parameters as
if they were the true parameters of interest, and then chooses the optimal price
given those estimates. Specifically, for parameter estimates β̂ ∈ Rd , we define the
certainty equivalent price pce (c ; β̂) to be
pce (c ; β̂) ∈ argmax r(p, c ; β̂) .

p∈P
Notice this is exactly our optimization objective (4.2.1), with estimate β̂ in place of
the true parameter β0 .
For some control problems, the certainty equivalent rule can be optimal, e.g.
linear quadratic control, and the certainty equivalent rule is a widely used scheme
in such as model predictive control. However, in general, it will lead to inconsis-
tent learning and thus sub-optimal rewards and revenue (Lai & Robbins, 1982).
Nonetheless, many companies will opt to use a certainty equivalent rule as it cleanly
separates the problem of model learning from revenue optimization.
With this in mind, we propose a simple implementable variant that maintains
this separation. For parameter estimates β, we choose prices
p = pce (c ; β) + αu ,
where u is an independent, bounded, mean zero random variable in Rm and α is

a positive real number. We call this the perturbed certainty equivalent price. Here,
we simply add random noise to the certainty equivalent price in order to encourage
exploration. Over time, we maintain the update rule
pt = pce (ct ; β̂t−1 ) + αt ut , (4.2.5)
where ut are i.i.d. bounded, mean zero random vectors, β̂t−1 is our current maxi-
mum (quasi-)likelihood estimate (4.2.2). Moreover, αt is a deterministic decreasing
1
sequence. Shortly we will argue that taking αt = t− 4 is a good choice for achieving
statistical consistency while achieving a good regret bounds.
Pseudo-code for the perturbed pricing algorithm is given in Algorithm 4, below.
Here we split the procedure into 4 steps: context, where we receive the context
91
of the query and items to be sold; price, where we select the perturbed certainty
equivalent price; response, where we receive information about the items sold and
their revenue; and, estimate, where we update our MQLE parameter. As discussed
in the introduction, a key advantage of this scheme is its simplicity. Conventional
algorithms involve deterministic penalties or confidence ellipsoids for choices close
to the optimum. This in turn requires additional calculations such as matrix inver-
sions and eigenvalue decomposition which modify the task of maximizing revenue
and finding maximum likelihood estimators in a potentially non-trivial way. The
proposed approach is appealing that the proposed algorithm maintains the statisti-
cal maximum likelihood objective and the revenue objective and the randomization
added is a minor adjustment to the certainty equivalent rule.
Algorithm 4: Perturbed Pricing

Initialize: η > 0 and β̂0 ∈ B
for t=1,...,T do
Context:
Receive context ct .
Price:
Choose price
pt = pce (ct ; β̂t−1 ) + αt ut ,
where pce (c ; β̂) ∈ argmaxp∈P r(p, c ; β̂), αt = t−η , ut is i.i.d. mean zero
& covariance Σu 0.
Response:
For input xt = (pt , ct ),
receive response yt ,
receive reward rt (xt , yt ; β0 ).
Estimate:
Calculate the MQLE β̂t :
t
X
xs ys − µ β̂t> xs = 0.
s=1
end
92
4.3 Main results
In this section, we present our main result, in Theorem 4.3.1, an upper bound on
the regret under our policy. Its proof is provided in Section 4.3.3.
Theorem 4.3.1. If αt = t−η for η ∈ [1/4, 1/2) then, with probability 1, the regret
over time horizon T is
T !
p 1−2η
X log(t)
Rg(T ) = O T log T + T + .
t=1
t1−2η
1
Choosing η = 4
gives
√
The order of the regret bound above is consistent with prior results such as den
Boer & Zwart (2014b) and Keskin & Zeevi (2014) which achieve a bound of the
same order.
We now describe the results that are required to prove Theorem 4.3.1, along with
some notation.
4.3.1 Additional notations
For a vector x ∈ Rd , we let kxk denote its Euclidean norm and kxk∞ denote its
supremum norm. Because we wish to consider the design matrix ts=1 xs x>
P
s , we
add some notation needed to re-express the effect of perturbation on the price.
Specifically, we re-express the perturbed certainty equivalent rule (price) in terms
of the full input vector x = (p, c) rather than just in terms of p. Specifically, we let
xt = x̂t + αt zt ,
where

x̂t = pce ct ; β̂t , ct ,
and
zt = (ut , 0) ∈ Rd .
Given our boundedness assumptions, we apply the notation pmax , cmax ∈ R+ where
kpt k∞ ≤ pmax and kct k∞ ≤ cmax for all t ∈ Z+ . Recall that ut are vectors of i.i.d.
93
bounded random variables, and therefore so are zt . We assume that kut k∞ ≤ umax
for all t ∈ Z+ where umax ∈ R+ , thus there exists zmax ∈ R+ such that kzt k∞ ≤ zmax
for all t ∈ Z+ . We denote the covariance matrix of zt by Σz = E zt zt> . Further-

more, we let λmax (t) and λmin (t) denote the maximum and minimum eigenvalues of
Pt >
s=1 xs xs .
4.3.2 Key additional results
To prove Theorem 4.3.1, we require additional results: Lemma 4.3.1, Proposition

4.3.1 and Proposition 4.3.2, which are stated below, and their proofs are given in
Appendix B.
Lemma 4.3.1 shows that the performance of the perturbed policy depends on
how it learns the unknown parameter β.
Lemma 4.3.1. For p? ∈ argmaxp∈P r (p, c ; β0 ), there exists K0 > 0 such that, for
all p ∈ P and c ∈ C
|r (p, c ; β0 ) − r (p? , c ; β0 )| ≤ K0 kp − p? k2 . (4.3.1)
If either Assumption A1a) or A1b) holds, then there exists K1 > 0 such that
sup kp? (c ; β) − p? (c)k ≤ K1 kβ − β0 k . (4.3.2)

c∈C
This lemma establishes the continuity of revenue as a function of parameters β.

The regret is shown to be the same order of the squared feature vectors kp − p∗ k2 ,
which is equal to that of the squared error of the estimated parameters kβ̂t − β0 k2 .
This result applies a combination of the Lipschitz continuity of r(x ; β0 ) and the
Implicit Function Theorem (Theorem 9.28 of Rudin (1976)).
Proposition 4.3.1 gives the strong consistency result for GLMs.
Proposition 4.3.1. Under Assumption A2, as t → ∞ with probability 1,

2 log(t)
β̂t − β0 = O .
λmin (t)
Along with Lemma 4.3.1, Proposition 4.3.1 establishes that the key quantity to
determining the regret is λmin (t). As discussed in Section 4.1.3, under Assumption
94
A2, there are a number of similar results which can be used in place of Proposition
4.3.1. The argument presented initially takes the approach of Chen et al. (1999)
and Li et al. (2017) and then applies Lemma 1iii) of Lai & Wei (1982).
We now construct a lower bound for the smallest eigenvalue λmin (t) as follows.
Proposition 4.3.2. If αs = s−η where s = 1, . . . , t, for η ∈ [0, 1/2) then
t
!
X
λmin (t) = Ω αs2 .
s=1
The above proposition applies a new bound on minimum eigenvalue of the design
matrix ts=1 xs x>
P
s , which is critical to our proof. This algebraic eigenvalue bound is
given in Proposition 4.6.1, and proof of which is in Appendix C. This new eigenvalue
bound is obtained based on decomposition according to the Schur Complement. A
related eigenvalue bounds is Ostrowski’s Theorem, see Horn & Johnson (2012).
4.3.3 Proof of Theorem 4.3.1
We now present a proof of Theorem 4.3.1. The main results required are Lemma
4.3.1, Proposition 4.3.1 and Proposition 4.3.2 as stated in the body of the paper.
We also require two more standard lemmas, Lemma 4.3.2 and Lemma 4.3.3, which
are stated and proved immediately after the proof of Theorem 4.3.1.
Proof of Theorem 4.3.1. Using Lemma 4.3.1 and Lemma 4.3.2, we can derive the
95
2
regret over selling horizon T in terms of β̂t − β0 , that is
T
X
Rg(T ) = rt (x?t , yt? ; β0 ) − rt (xt , yt ; β0 )
t=1
XT
= r (p? (ct ), ct ; β0 ) − r (pt , ct ; β0 )
t=1
XT T
X
+ r (pt , ct ; β0 ) − rt (xt , yt ; β0 ) + rt (x?t , yt? ; β0 ) − r (p?t (c?t ), ct ; β0 )
t=1 t=1
XT
≤ |r (pt , ct ; β0 ) − r (p? (ct ), ct ; β0 )|
t=1
p
+8 2rmax T log T
T
X
kpt − p? (ct )k2
p
≤ 8 2rmax T log T + K0
t=1
T T
p X 2 X 2
?
≤ 8 2rmax T log T + K0 2 pt − p (ct , β̂t ) + K0 2 p? (ct , β̂t ) − p? (ct )
t=1 t=1
T T
p X X 2
≤ 8 2rmax T log T + 2K0 u2max αt2 + 2K0 K1 β̂t − β0 . (4.3.3)
t=1 t=1
The first inequality above is immediate from Lemma 4.3.2. The second inequality
applies (4.3.1) proven in Lemma 4.3.1. The third inequality follows (a + b)2 ≤
2a2 + 2b2 for any a, b ∈ R. The last inequality follows by the definition of the
perturb price in (4.2.5) and also from Lemma 4.3.1.
Applying Proposition 4.3.1, we obtain, for some constant K2 ,
2 log(t)
β̂t − β0 ≤ K2 .
λmin (t)
By Proposition 4.3.2, we have, for some K3 and some the constant T0 > 0, for any
t ≥ T0 ,
t
X
λmin (t) ≥ K3 αs2 . (4.3.4)
s=1
Applying (4.3.4) to (4.3.3) gives that
T T
p 2
X
2 2K0 K1 K2 X log(t)
Rg(T ) ≤ 8 2rmax T log T + 2K0 umax αt + 2K0 K1 βmax T0 + Pt ,
t=1
K 3 t=1 s=1 α 2
s
96
Thus from the form of the above, we can see that there exist constants A0 , A1 and
A2 , such that
T T
p X X log(t)
Rg(T ) ≤ A0 T log T + A1 αt2 + A2 Pt 2
.
t=1 t=2 s=1 αs
Finally notice that we take αt = t−η for η ∈ [1/4, 1/2).

By a simple calculation shown in Lemma 4.3.3, we can derive a specific upper
bound on regret,
T
p T 1−2η − η X log(t)
Rg(T ) ≤ A0 T log T + A1 + A2 (1 − 2η)
1 − 2η t=2
t1−2η − 1
p
=O T log T + T 1−2η + T 2η log(T ) .
1
Choosing η = 4
gives
√
Thus, we obtain the upper-bound on the regret.
We now cover the additional lemmas required above. Lemma 4.3.2 applied above
is an application of the Azuma-Hoeffding Inequality.
Lemma 4.3.2. With probability 1, for any sequence xt , it eventually holds that
T
X p
rt (xt , yt ; β0 ) − r(xt ; β0 ) ≤ 4 2rmax T log(T )
t=1
Proof. Since r(xt ; β0 ) is the conditional distribution of rt (xt , yt ; β0 ), the summands

of
T
X
MT = rt (xt , yt ; β0 ) − r(xt ; β0 )
t=1
form a martingale difference sequence. Thus by the Azzuma–Hoeffding Inequality

p
2 t log t
(see (Williams, 1991, E14.2)) with zt = 4 2rmax
z2
− 2t 2
P |Mt | ≥ zt ≤ 2e 8trmax = 4.
t
Thus
∞ ∞
X X 2
P(|Mt | ≥ zt ) ≤ < ∞.
t=1 t=1
t4
Thus by the Borel-Cantelli Lemma (see (Williams, 1991, S2.7)), with probability 1,
it holds that eventually |Mt | ≤ zt , which gives the result.
97
Lemma 4.3.3 is a standard integral test result.
Lemma 4.3.3. If we let αt = t−γ for γ > 0, we have

t
t1−γ − 1 X −γ t1−γ − 1
≤ s ≤1+ .
1−γ s=1
1−γ
Proof. This can be obtained by simple calculations. Since s−γ is decreasing for
γ > 0, then
t t
t1−γ − 1
X Z
−γ
s ≤1+ s−γ ds = 1 + ,
s=1 1 1−γ
and
t t
t1−γ − 1
X Z
−γ
s ≥ s−γ ds = .
s=1 1 1−γ
Thus, we obtain the result.
4.4 Numerical experiments

In this section, we examine the performance of the perturbed certainty equivalent
pricing developed in the cases of linear and logistic demand models. This short
numerical study confirms the behavior proved above. We consider moderate sized
problems with 17 explanatory variables. We note that problems of this size can be
achieved from large practical problems through dimension reduction. We do not
investigate such reductions here, instead we refer the reader to the seminal work
Li et al. (2010) where a recommendation system with several thousand features is
reduced to a problem of 36 explanatory variables and a Linear UCB model first
applied.
In our simulation experiments, we consider a company that sells a single product
and would like to make pricing decisions based on previous selling prices and 15
other contextual variables. At each time t, we select a selling price pt ∈ [pl , ph ],
and receive a context vector ct . We set the feasible price [pl , ph ] = [0.5, 5]. The
context vector ct are multivariate normal random variables with dimensions m = 15,
means equal to a vector of zeros and covariance matrix Σc = I15 , where I15 is a
98
15 × 15 identity matrix. Then, we obtain the feature vector xt = (pt , ct ) ∈ R17 ,
where pt = (1, pt ). We assume that the true values of unknown parameters of price
vector are (β0 , β1 ) = (1, −0.5)> ∈ R2 , and ct and true coefficients β2 , ..., β16 are
drawn i.i.d. from a multivariate Gaussian distribution N (0, I) ∈ R15 . We model
the revenue as the price times the demand responses, i.e. rt (xt , yt ) = pt yt and
r(x̂t ; β) = E[rt (xt , yt ) | xt ]. We simulate our policy for T = 2000 with αt = t−1/4
and measure the performance of the policy by regret.
We first consider that demand follows a linear function, i.e. a GLM with an
identity link function µ(x) = x. Figure 4.1a shows the value of kβ̂t − β0 k2 , which is
the squared norm of the difference between the parameter estimates β̂t and the true
parameters β0 . We can see that the difference converges to zero as t becomes large,
which demonstrates the strong consistency of parameter estimates. To show the
√
order of regret, we re-scale it to Rg(T )/ T log(T ) in Figure 4.1b. It shows that the
√
Rg(T )/ T log(T ) converges to a constant value as T becomes large, which verifies
√
that the regret has, at most, an order of T log(T ) under our policy. The constant
in this case is quite small close to 0.14. In practice, the response that seller receives
may be a zero-one response corresponding to the item being unsold or sold. Here
a logistic regression model is appropriate, with link function µ(x) = 1/(1 + e−x ).
Figures 4.2a shows that kβ̂t − β0 k2 converges to zero as t becomes sufficiently large
√
and Figure 4.2b shows that regret is of order T log(T ). Again the constant in the
regret is small, around 0.01, suggesting good dependence on the size of this problem.
99
√
(a) kβ̂t − β0 k2 (b) Rg(T )/ T log(T )
Figure 4.1: Convergence of parameter estimates and regret with linear demand function.
Time period T = 2000 and αt = t−1/4 . Problem parameters used are [pl , ph ] = [0.5, 2],
m = 15, true parameters of price vector [1, −0.5]> ∈ R2 , ct and its true coefficients drawn
i.i.d. from N (0, I) ∈ R15 .
√
(a) kβ̂t − β0 k2 (b) Rg(T )/ T log(T )
Figure 4.2: Convergence of parameter estimates and regret with logistic regression for
demand. Time period T = 2000 and αt = t−1/4 . Problem parameters used are [pl , ph ] =
[0.5, 2], m = 15, true parameters of price vector [1, −0.5]> ∈ R2 , ct and its true coefficients
drawn i.i.d. from N (0, I) ∈ R15 .
4.5 Discussion, conclusions and future directions
We considered a dynamic learning and contextual pricing problem with unknown

demand. This work acts to provide a theoretical foundation to positive empirical
findings in Lobo & Boyd (2003) and finds the correct magnitude of perturbation for
the best regret scaling. We allow for contextual information in the pricing decision.
We focus on a maximum likelihood estimation approach. However, for high di-

mensional problems, where dimension reduction is not possible,a stochastic gradient
descent approach must instead be used for parameter estimation. Thus one direction
100
of future research is to consider the setting where parameter estimates β̂t and prices
pt are updated online according to a Robbins–Monro rule. We focus on i.i.d. con-
texts, but it should be clear from the proofs that we may allow contexts to evolve in
a more general adaptive manner, so long variability in contextual information dom-
inates variability in prices. Also, we focus on i.i.d. perturbation, but other forms
of perturbation could be considered. For instance, Quasi-Monte-Carlo methods can
reduce variance while more systematically exploring the space perturbations. We
allow the size of perturbations to decrease uniformly over each context. However, in
practice one might want to let this decrease depending on the number of times that
a context has occurred. Thus the asynchronous implementation of the approach is
an important consideration. We consider a single retailer selling one item over mul-
tiple contexts. However, adversarial competition is often important. For instance
the interplay between large sets of contexts and a variety of sellers occurs in online
advertising auctions. Here an understanding of regret in both the stochastic and
adversarial setting would be an important research direction.
Finally to summarize, certainty equivalent pricing is commonly applied in man-
agement science. Random perturbation around the certainty equivalent price is sim-
ple and when combined with standard statistical parameter estimation it achieves
revenue comparable with more technically sophisticated learning algorithms.
101
4.6 Appendices
In this section, we prove the supporting results Lemma 4.3.1, Proposition 4.3.1 and
Proposition 4.3.2. Lemma 4.3.1 is an application of the Implicit Function Theorem.
Proposition 1 develops of the results of Lai & Wei (1982) to the case of GLMs
and follows similar lines to Chen et al. (1999) and Li et al. (2017). The proof
of Proposition 4.3.2 is a little more involved and requires an additional eigenvalue
bound Proposition 4.6.1, which is proven using a Schur Decomposition. We can
then use Proposition 4.6.1 along with random matrix theory bounds from Vershynin
(2018) to prove Proposition 2.
Additional Notation. For a positive definite matrix A ∈ Rd×d , we have kxkA =

√ Pt >
x> Ax. We denote by Vt = s=1 xs xs . For ρ > 0 we denote the closed ball
around β0 and its boundary by
Bρ (β0 ) := {β : kβ − β0 k ≤ ρ} and ∂Bρ (β0 ) = {β : kβ − β0 k = ρ} .
Let a ∧ b = min{a, b} and a ∨ b = max{a, b} for a, b ∈ R. For a d × d symmetric

matrix M , we define the operator norm (or spectral norm) by
kM xk
kM kop = max = max kM xk = max x> M y ,
x∈Rd kxk x∈S d−1 x,y∈S d−1
x6=0
where S d−1 is the unit sphere in Rd . We recall by the spectral theorem, a symmetric
matrix has real valued eigenvalues and that its eigenvectors can be chosen as an
orthogonal basis of Rd . For a square matrix M , we let λmin (M ) and λmax (M )
denote the minimum and maximum eigenvalues of M . We use notations M 0 to
say that M is positive definite and M 0 to say that M is positive semi-definite.
If M is a positive semi-definite matrix then
λmax (M ) = kM kop , λmin (M ) = min w> M w ,

w:kwk=1
102
and thus
λmin (a1 M1 + a2 M2 ) ≥ a1 λmin (M1 ) + a2 λmin (M2 ) ,
λmax (a1 M1 + a2 M2 ) ≤ a1 λmax (M1 ) + a2 λmax (M2 ) , (4.6.1)

 
M1 0 
λmin   ≥ λmin (M1 ) ∧ λmin (M2 ) .
 
 
0 M2
for a1 , a2 ∈ R+ and M1 , M2 positive definite matrices.
Appendix A: Proof of Lemma 4.3.1.
We now prove Lemma 4.3.1 which requires the Implicit Function Theorem (Theorem
9.28 of Rudin (1976)).
Proof of Lemma 4.3.1. We first prove (4.3.1). Since ∇p r (p? (c), c ; β0 ) = 0. A Tay-
lor expansion w.r.t. p gives
1
r (p, c ; β0 ) − r (p? , c ; β0 ) ≤ (p − p? )> ∇2 r (p̃, c ; β0 ) (p − p? )> ,
2
for p̃ on the line segment between p and p∗ . Thus
1 2
|r (p, c ; β0 ) − r (p? , c ; β0 )| ≤ max max ∇2 r (p̃, c ; β0 ) op
p − p? .
2 c∈C p̃∈P
Define
K0 := max max ∇2 r (p̃, c ; β0 ) op
.
c∈C p̃∈P
Notice K0 < ∞, since r (p, c ; β0 ) is twice continuously differentiable and the sets C
and P are compact. The result in (4.3.1) is proved.
We now apply the Implicit Function Theorem to bound kp − p? k2 . We first
consider the case under Assumption A1a), and then under Assumption A1b).
Under Assumption A1a), the Implicit Function Theorem implies that for each
c ∈ C, there exists a neighborhood Vc ⊆ Rd such that p? (c ; β) is uniquely defined
and is continuously differentiable in B. Thus taking V = ∩c∈C Vc and applying the
Taylor expansion, for all β ∈ V give
kp? (c ; β) − p? (c)k ≤ k1 kβ − β0 k (4.6.2)
103
where k1 := supc∈C supβ∈V |∇β p? (c ; β)|. The constant k1 is finite since ∇β p? (c ; β)
is continuous, C is finite and V can be chosen to be contained in a compact set.
Under Assumption A1b), we know by strict concavity that for each B and C,
p? (c ; β) is unique. Further by Assumption A1b), ∇2p r (p, c ; β0 ) is invertible. So
the Implicit Function Theorem applies to ∇p r (p, c ; β0 ). For all c ∈ C, β ∈ B, again
applying the Taylor expansion gives
kp? (c ; β) − p? (c)k ≤ sup sup k∇β p? (c ; β)kop kβ − β0 k .

c∈C β∈B
Also by the Implicit Function Theorem, p? = p? (c ; β) satisfies,
−1
∇β p? = − ∇2p r (p? , c ; β) ∇p,β r (p? , c ; β) .

Thus, by the definition of the operator norm
k∇β p? k ≤ ∇2p r (p? , c ; β)−1 op

k∇p,β r (p? , c ; β)kop
≤ α−1 sup k∇p,β r (p? , c ; β)kop ,

p,c,β
where we use the fact that r (p, c ; β) is twice continuously differentiable and
min ∇2p r (p? , c ; β) op

≥ α,
p,c,β
since r(p, c ; β) is α-strongly concave. Further k2 := supp,c,β k∇p,β r (p? , c ; β)kop <
∞ because r is continuously differentiable and because of the compactness of P, C
and B. Thus
kp? (c ; β) − p? (c)k ≤ K1 kβ − β0 k (4.6.3)
for K1 = k2 /α.
We have both (4.6.2) and (4.6.3) holding. In other-words, under both Assump-
tions A1a) and A1b), we have that
sup kp? (c ; β) − p? (c)k ≤ K1 kβ − β0 k ,

c∈C
for all β ∈ V in a neighborhood of β.
104
Appendix B: Proof of Proposition 4.3.1
Proposition 4.3.1 develops of the results of Lai & Wei (1982) to the case of GLMs.
The proof follows similar lines to Chen et al. (1999) and Li et al. (2017) and then
applies Lemma 1iii) of Lai & Wei (1982). We note there are some reported errors
in the proofs of Chen et al. (1999). So like Li et al. (2017), we must take some care
to work around these issues and make sure the proof method is applicable in our
setting.
Pt
Proof of Proposition 4.3.1. We define Zt = s=1 s xs , where s = ys − µ(β0> xs ).
Our proof proceeds by bounding ||Zt ||Vt−1 above and below.
The bound in Proposition 4.3.1 is trivial if λmin (t) = 0. Thus we assume that
the increasing function λmin (t) is positive at t. We define
t
X
xs µ β > xs − µ β0> xs .

Gt (β) :=
s=1
Clearly, Gt (β0 ) = 0. Further since β̂t is defined to solve

t
X t
X
µ(β̂t xs )xs = ys xs .
s=1 s=1
Thus holds that Gt (β̂t ) = Zt because

t
X t
X
β̂t> xs xs µ β0> xs

Gt (β̂t ) = xs µ −
s=1 s=1
t
X t
X t
X
β0> xs

= ys xs − xs µ = xs s = Zt .
s=1 s=1 s=1
The function µ(·) is continuously differentiable and strictly increasing. Thus by the
Mean Value Theorem, there exists β̃ on the line segment between β̂t and β0 (β̃
must depend on xs ) such that
t
X
Gt (β̂t ) − Gt (β0 ) = xs µ β̂t> xs − µ β0> xs
s=1
Xt
= µ̇ β̃ > xs xs x>
s β̂ t − β 0 = ∇Gt (β̃) β̂ t − β 0 . (4.6.4)
s=1
where ∇Gt (β) is the derivative of Gt .
105
Given Assumption A2, we define
κ= min µ̇(β > x) .

x,β:kxk≤xmax ,
kβ−β0 k≤βmax
We know that, by assumption, κ > 0. Thus from (4.6.4), we know that
∇Gt (β̃) κVt κλmin (t)I ,
and thus Vt−1 κ∇Gt (β̃)−1 . This implies that
kGt (β̂t )k2V −1 = kGt (β̂t ) − Gt (β0 )k2V −1

t t
>
= β̂t − β0 ∇Gt (β̃)Vt−1 ∇Gt (β̃) β̂t − β0
>
≥ κ β̂t − β0 ∇Gt (β̃) β̂t − β0
≥ κ2 λmin (Vt )kβ̂t − β0 k2 .
Since Zt = Gt (β̂t ),
2
kZt k2Vt−1 ≥ κ2 λmin (Vt ) β̂t − β0 .
To obtain the upper bound on kZt k2Vt−1 , we use Lemma 4.6.1 (stated below), that
almost surely
kZt k2Vt−1 = O (log λmax (t)) .
Combining the two bounds above yields

log λmax (t)
||Zt ||2V −1 =O .
t λmin (t)
Finally with the observation that λmax (t) ≤ tx2max (which follows from (4.6.1)), we
then obtain the result.
The following lemma is a restatement of Lemma 1iii) in Lai & Wei (1982). For
its proof, we refer to Lai & Wei (1982).
Lemma 4.6.1 (Lai & Wei (1982)). For
Qt = Zt> Vt−1 Zt ,
on the event {limt→∞ λmax (t) = ∞} it holds that almost surely
Qt = O (log λmax (t)) .
106
Appendix C: Proof of Proposition 4.3.2
To prove Proposition 4.3.2, we derive a new eigenvalue bound that is critical to our
proof. This algebraic eigenvalue bound is given in Proposition 4.6.1.
 
 A B
Proposition 4.6.1. For any symmetric matrix of the form M =  , we
 
 
B> C
have
λmin (C)2 −1 >

λmin (M ) ≥ λ min A − BC B ∧ λmin (C) .
(kBkop + λmin (C))2 + λmin (C)2
To prove Proposition 4.3.2, we apply Proposition 4.6.1 along with a random
matrix bound from Vershynin (2018). We state and prove this result (as Lemma
4.6.3) after the proof of Proposition 4.3.2. Also we require Lemma 4.6.2 which is a
standard eigenvalue bound stated and the concentration bound Lemma 4.6.3 after
the proof of Proposition 4.3.2. Now, we give the proof of Proposition 4.3.2.

Proof of Proposition 4.3.2. Applying the shorthand ps = pce cs , β̂s . We expand
the design matrix as follows
t
X t
X
xs x>
s = (ps + αs zs , cs ) (ps + αs zs , cs )>
s=1 s=1
Xt
= (ps , cs ) (ps , cs )> + αs2 zs zs> + αs zs (ps , cs )> + αs (ps , cs ) zs> .
s=1
By Lemma 4.6.2, we have

t
! t
!
X X
λmin xs x>
s ≥ λmin (ps , cs ) (ps , cs )> + αs2 zs zs>
s=1 s=1
t
X
− αs zs (ps , cs )> + αs (ps , cs ) zs> . (4.6.5)
s=1 op
By Proposition 4.6.1 we can bound the first term in (4.6.5)

t
!
X > 2 >
λmin (ps , cs ) (ps , cs ) + αs zs zs
s=1
> 2
Pt
λmin s=1 cs cs
≥ P 2 2
t >
Pt >
Pt >
p c
s=1 s s op + λmin c c
s=1 s s + λmin s=1 cs cs
( t
! t
!)
X X
> 2 > >
× λmin ps ps + αs zs zs ∧ λmin cs cs . (4.6.6)
s=1 s=1
107
We now provide lower-bounds on the various terms given above. First, since sets P
and C are bounded, we have
t
X
ps c >
s ≤ t pmax cmax . (4.6.7)
s=1 op
Second, by Lemma 4.6.2 and Lemma 4.6.3, we have
t
! t
X X
cs c> c
cs c> c

λmin s ≥ tλmin (Σ ) − s −Σ , (4.6.8)
s=1 s=1 op
where
t
r 2
X
cs c> c

s −Σ ≤ 16t log(t) c2max + kΣc kop ,
s=1 op
and also
t
! t
!
X X
λmin c s c>
s ≤ λmax cs c>
s ≤ tc2max . (4.6.9)
s=1 s=1
Similarly, for the last term, by Lemmas 4.6.2 and 4.6.3,
t
!
X
ps p> 2 >
≥ λmin αs2 zs zs>

λmin s + αs zs zs
s=1
t
X t
X
αs2 λmin (Σz ) αs2 zs zs> − Σz

≥ −
s=1 s=1 op
v
t
X
u
u t
2 X
2 z
≥ α λmin (Σ ) − t16 log(t) z 2
s
z
max + kΣ kop αs4 .
s=1 s=1
(4.6.10)
Pt
The first inequality is obtained since s=1 ps p>
s is positive definite matrix and
t
!
X
λmin ps p>
s > 0.
s=1
108
Applying (4.6.7), (4.6.8), (4.6.9) and (4.6.10) to (4.6.6) gives the result
t
!
X
λmin (ps , cs ) (ps , cs )> + αs2 zs zs>
s=1
p 2
λmin (Σc ) − log(t)/t
≥
(pmax cmax + c2max )2 + c2max
 v 
 Xt u
u 2 Xt 
z 2 2 z 4
× λmin (Σ ) αs − 16 log(t) zmax + kΣ kop
t αs
 s=1 s=1

( r )
2
∧ tλmin (Σc ) − 16t log(t) c2max + kΣc kop .
We can simplify it as following. There exists a constant time T0 depending on Σc

and Σz such that
 v 
 X t u
u 2 Xt 
z 2 2 z 4
λ (Σ ) αs − 16 log(t) zmax + kΣ kop
t αs
 min 
s=1 s=1
( r )
2
c
∧ tλmin (Σ ) − 16t log(t) c2max + kΣc kop
t
X λmin (Σz )
≥ αs2 ,
s=1
2
for all t ≥ T0 .
Now for the 2nd term in (4.6.5). By triangle inequality, we have
t
X t
X t
X
> >
αs zs (ps , cs ) + αs (ps , cs ) zs> ≤ αs zs (ps , cs ) + αs (ps , cs ) zs>
s=1 op s=1 op s=1 op
t
X
≤2 αs zs (ps , cs )> .
s=1 op
Using Lemma 4.6.4, eventually it holds that,

v
Xt u
u t
X
> 2
αs zs (ps , cs ) ≤ 16 (zmax xmax ) log(t)
t αs2 .
s=1 op s=1
Combining the above two inequalities, we obtain that for sufficiently large t,
! v
Xt t
X λmin (Σz ) u
u
Xt
> 2 2
λmin xs xs ≥ αs − 16 log(t) (zmax xmax )
t αs2 .
s=1 s=1
2 s=1
q
Notice that, for αt ∝ 1/t for η < 1/2, the term log(t) ts=1 αs2 , above, is domi-
η
P
nated by ts=1 αs2 , and thus, we obtain the required result.

P
109
The following is a well-known eigenvalue bound.
Lemma 4.6.2. Let A, B be symmetric positive definite matrices of size d×d. Then,
λmin (A) ≥ λmin (B) − ||A − B||op .
Proof. Recall that

λmin (A) = min x> Ax.
x:kxk=1
We can write
x> Ax = x> (A − B)x + x> Bx .
For the first term, by Cauchy-Schwartz inequality we have
−x> (A − B)x ≤ |hx, (A − B)xi| ≤ kxkk(A − B)xk ≤ kA − Bkop .
For the second term, we know
x> Bx ≥ λmin (B) .
Then we have
x> Ax = x> (A − B)x + x> Bx ≥ −kA − Bkop + λmin (B) .
Therefore,
λmin (A) ≥ λmin (B) − kA − Bkop .
We obtain the result.
The following result gives a concentration bound on covariance matrices. The

proof can be found in Vershynin (2018). (Also see Vershynin (2012).)
Lemma 4.6.3 (Vershynin (2018)). If xs ∈ Rd are bounded such that, for all s ≥ 1
E xs x> x

E [xs | Fs−1 ] = 0 and s | Fs−1 = Σs .
We assume that there exists xmax ∈ R+ , such that bound kxs k∞ ≤ xmax with proba-
bility 1 and Σxs is positive definite. Then,
   
t
(ε/2)2

 

X
> x 2d

P xs xs − Σs ≥ ε ≤ 2 · 9 exp − P 2 .
s=1 op
 2 t

x2max + kΣxs kop


s=1
110
PT
Proof. We show this argument in two steps. We first control t=1 xt x>
t over a ε-net,
and then extend the bound to the full supremum norm by a continuity argument.
1
Using Lemma 4.6.5 (stated below) and choosing ε = 4
and, we can find an ε-net
N of the unit sphere S d−1 with cardinality
|N | ≤ 9d .
By Lemma 4.6.6 (stated below), the operator norm of xs x>

s can be bounded on N ,
that is
t
* t
! +
X X
xs x> x
xs x> >

s − Σs ≤ 2 max s − E xs xs v, w
v,w∈N
s=1 op s=1
t
!
X
≤ 2 max v > xs x> >

s − E xs xs w . (4.6.11)
v,w∈N
s=1
We first fix v, w ∈ N , and by Azuma–Hoeffding inequality1 for any ε > 0 we can

state that
 
t
! !
(ε/2)2
 
X ε  
v> xs x> >

P s − E xs xs | Fs−1 w ≥ ≤ 2 exp − P 2 .
2

s=1  2 t

x2max + kΣxs kop


s=1
Given that kxs k∞ ≤ xmax , we have xs x> >

s − E xs xs | Fs−1 ≤ x2max + kΣxs kop .
Next, we unfix v, w ∈ N using a union bound. Since that N has cardinality
bounded by 9d , we obtain
t
! !
>
X ε
xs x> >

P max v − E xs xs | Fs−1
s w ≥
v,w∈N
s=1
2
t
! !
X
>
X
>
>
ε
≤ P v xs xs − E xs xs | Fs−1 w ≥
v,w∈N s=1
2
 
(ε/2)2

 

2
≤ |N | · 2 exp − P 2
 2 t

x 2 + kΣ xk 

s=1 max s op
 
ε2

 

2d
≤ 9 · 2 exp − P 2 .
 8 t 2 x
s=1 xmax + kΣs kop
 

1
We note that Vershynin applies a Hoeffding bound. This is the only substantive difference in
the proof here.
111
Together with (4.6.11), we have
  ! !
X t t
X
xs x> x
≥ ε = P 2 max v > xs x> >

P s − Σs s − E xs xs | Fs−1 w ≥ε
v,w∈N
s=1 op s=1
 
2
 
 ε 
≤ 2 · 92d exp − P 2 .
 8 t

x 2 + kΣ xk 

s=1 max s op
Thus, we obtain the result.
A straight-forward consequence of this result is the following.
Corollary 4.6.1. With probability 1, eventually in t it holds that

v
Xt u
u Xt 2
> x

xs xs − Σs ≤ 16 log(t)
t 2 x
xmax + kΣs kop .
s=1 op s=1
Proof. Proof of Corollary 4.6.1. Notice, if we set

v
u
u X t 2
εt = 16 log(t)
t 2 x
xmax + kΣs kop ,
s=1
then,  
t
X 2 · 92d
xs x> x

P s − Σs ≥ εt  ≤ .
s=1
t2
op
By the Borel–Cantelli Lemma, the result holds.
Lemma 4.6.4.
  ( )
t
X (ε/2)2
P  αs zs x>
s
 2d
≥ ε ≤ 2 · 9 exp − .
2 (zmax xmax )2 ts=1 αs2
P
s=1 op
and thus, with probability 1, eventually it holds that

v
X t u
u t
X
> 2 2
αs zs xs ≤ 16zmax xmax log(t)
t α2 (4.6.12)
s=1 op s=1
Proof. Similar to the proof of Lemma 4.6.3, we show this argument in two steps. We
first control ts=1 αs zs x>
P
s over a ε-net, and then extend the bound to the full supre-
mum norm by a continuity argument. Notice that the summands of ts=1 αs zs x>
P
s
are a bounded martingale difference sequence.
112
1
Using Lemma 4.6.5 (stated below) and choosing ε = 4
and, we can find an ε-net
N of the unit sphere S d−1 with cardinality
|N | ≤ 9d .
By Lemma 4.6.6 (stated below), the operator norm can be bounded by terms on N ,
that is
t
* t
! +
X X
αs zs x>
s ≤ 2 max αs zs x>
s v, w
v,w∈N
s=1 op s=1
t
!
X
≤ 2 max v > αs zs x>
s w . (4.6.13)
v,w∈N
s=1
We first fix v, w ∈ N , and by Azuma–Hoeffding inequality, for any ε > 0 we can

state that
t
! ! ( )
X ε (ε/2)2
P v> αs zs x>
s w ≥ ≤ 2 exp − Pt .
s=1
2 2 s=1 (αs zmax xmax )2
Next, we unfix v, w ∈ N using a union bound. Since that N has cardinality bounded
by 9d , we obtain
t
! ! t
! !
>
X ε X X ε
P max v αs zs x>
s w ≥ ≤ P v >
αs zs x>
s w ≥
v,w∈N
s=1
2 v,w∈N s=1
2
( )
2
(ε/2)
≤ |N |2 · 2 exp − Pt
2 s=1 (αs zmax xmax )2
( )
2
ε
≤ 92d · 2 exp − Pt .
8 s=1 (αs zmax xmax )2
Together with (4.6.13), we have

  ! !
Xt t
X
P αs zs x>
s ≥ ε = P 2 max v > αs zs x>
s w ≥ε
v,w∈N
s=1 op s=1
( )
ε2
≤ 2 · 92d exp − .
8 (zmax xmax )2 ts=1 αs2
P
Thus, we obtain the result. For (4.6.12), the argument follows in an identical manner
to Corollary 4.6.1.
113
Lemmas 4.6.5 and 4.6.6 are stated below. For the proofs of Lemmas 4.6.5 and
4.6.6, we refer to Section 4 in Vershynin (2018).
Lemma 4.6.5 (Covering Numbers of the Euclidean ball). The covering numbers of
the unit Euclidean ball is such that, for any ε > 0
d d
1 2
≤N ≤ +1 .
ε ε
The same upper bound is true for the unit Euclidean sphere S d−1 .
Lemma 4.6.6 (Quadratic Form on a Net). Let A be an m × n matrix and ε ∈

[0, 1/2). For any ε-net N of the sphere S n−1 and any ε-net M of the sphere S m−1 ,
we have
1
sup hAx, yi ≤ kAkop ≤ sup hAx, yi . (4.6.14)
x∈N ,y∈M 1 − 2ε x∈N ,y∈M
Moreover, if m = n and A is symmetric, then
1
sup |hAx, xi| ≤ kAkop ≤ sup |hAx, xi| .
x∈N 1 − 2ε x∈N
Appendix D: Proof of Proposition 4.6.1.
Proposition 4.6.1 gives a new eigenvalue bound. It essentially allows us to isolate

the randomness caused by contextual information. We can then subsequently apply
random matrix concentration bounds to this. Proposition 4.6.1 requires two lemmas,
Lemma 4.6.7 which is a standard result on the Schur Complement, and Lemma
4.6.8 which is a straightforward calculus argument. These are stated and proven
immediately after the proof of Proposition 4.6.1.
Proof of Proposition 4.6.1. We want to lower-bound the minimum eigenvalue of M ,

where M is a positive semi-definite matrix, and thus have a bound for the “only if”
direction of Lemma 4.6.7. We show that a positive semi-definite matrix containing a
positive definite sub-matrix can be made positive definite by adding an appropriate
sub-matrix.
114
Applying Lemma 4.6.7 gives
λmin (M ) = min w> M w

w:kwk=1
   >
I BC −1  A − BC −1 B > 0  I BC −1 
= min w>   w
   
 
w:kwk=1    
0 I 0 C 0 I
  
−1
 A − BC B> 0 w1 
−1
= min w1 , w1 + BC w2  .
  

w:kwk=1   
0 C w1 + BC −1 w2
(4.6.15)
Given the vector applied above, we now lower bound of the vector (w1 , w1 + BC −1 w2 ).
2 2
w1 , w2 + BC −1 w1 = kw1 k2 + w2 + BC −1 w1

2
≥ kw1 k2 + kw2 k − BC −1 w1 ∨ 0

2
2 −1
≥ kw1 k + kw2 k − BC op
kw1 k ∨ 0 .
Since kwk2 = kw1 k2 + kw2 k2 = 1, we can write the above bound in the form of
p √ 2
f (p) = p + 1−p−b p ∨0 ,
where p = kw1 k2 and b = kBC −1 kop . By Lemma 4.6.8, we have
2 1
w1 , w1 + BC −1 w2

≥ 2 . (4.6.16)
kBC −1 kop + 1 +1
For any c ≥ 0,
 
A − BC −1 B > 0 
>
 v = cλmin A − BC −1 B > ∧ λmin (C) .

min v  (4.6.17)
 
v:kvk=c  
0 C
115
Thus, applying (4.6.16) and (4.6.17) to (4.6.15) gives
  
−1
 A − BC B> 0 w1 
−1
λmin (M ) = min w1 , w1 + BC w2 
  
 
w:kwk=1   
0 C w1 + BC −1 w2
 
A − BC −1 B > 0 
>
≥ min v  v
 
v:kvk≥c  
0 C
λmin A − BC −1 B > ∧ λmin (C)

≥ 2 .
−1
kBC kop + 1 + 1
Substituting,
kBkop
BC −1 op
= ,
λmin (C)
gives the required result.
The following is a well known lemma for matrices under a Schur decomposition.
Lemma 4.6.7. For any symmetric matrix M of the form

 
 A B
M = ,
 
 
B> C
if C is invertible then M 0 if and only if C 0 and A − BC −1 B > 0.
Proof. The Schur Complement of C in M is given by
A − BC −1 B > ,
obviously the Schur Complement is symmetric. Notice that we can use this to
decompose M as
   >
I BC −1  A − BC −1 B > 0  I BC −1 
M =  .
   
 
   
0 I 0 C 0 I
A block diagonal matrix is positive definite if and only if each diagonal block is
positive definite, which concludes the proof.
116
√ √ 2
Lemma 4.6.8. For a function f (p) = p + 1 − p − b p ∨ 0 , we have
1
min f (p) ≥ .
p∈[0,1] (b + 1)2 + 1
√ √
Proof. Since 1−p≥1− p, and then
√
f (p) = p + ((1 − (b + 1) p) ∨ 0)2 .
We define
g(x) = x2 + ((1 − (b + 1)x) ∨ 0)2 .
Note the above bound states

√
f (p) ≥ g( p) .
To obtain the result, we prove the claim that
1
g(x) ≥ .
(b + 1)2 + 1
We can obtain this by finding the minimum of g(x) and showing that x? is the
minimum value such that 1 − (1 + b)x > 0. Note that function g(·) is convex with
g(0) = g(1) = 1, so finding a local minimum is sufficient. Now, we let g 0 (x) = 0,
that is
d 2 d
x + (1 − (b + 1)x)2 = ((b + 1)2 + 1)x2 − 2 ((b + 1)x) + 1 = 0 .
dx dx
This implies
b+1
x? = ,
(b + 1)2 + 1
and
1
(x? )2 + (1 − (b + 1)x? )2 = .
(b + 1)2 + 1
Further notice that 1 − (1 + b)x? > 0, so this point is a minimum of the function
g(x). Thus, we have
1
min f (p) ≥ min g(x) ≥ .
p∈[0,1] x∈[0,1] (b + 1)2 + 1
The result is obtained.
117
Chapter 5
Reinforcement learning in
insurance
5.1 Introduction
Reinforcement learning (RL) is the study of optimal decision-making and focuses on

learning how to make sequential decisions in environments with unknown dynam-
ics. In the RL framework, an agent (i.e., a learner or decision maker) continually
interacts with an environment (i.e., everything outside the agent). The key idea is
learning through interaction. The agent observes the state of the environment and
then takes some action; the environment responds to this action in the form of a
reward and presents new situations (i.e., the next state) to the agent. The goal of
RL is to improve the agent’s future reward given its past experience.
Markov decision processes (MDPs) describe the environment in RL, when the
environment is fully observable. MDPs are a classical formalization of sequential
decision making, where actions influence not just immediate rewards, but also sub-
sequent states, and through those future rewards. Thus trade-off between immediate
and delayed rewards is needed in MDPs.
We study an insurance pricing problem, in which the insurance company earns
some premiums, but also has to pay claims to policyholder if the insured events
happen. We assume that the insurer decides to pay a dividend to shareholders.
A dividend can only be paid when the revenue at that time is positive. If the
118
revenue becomes negative, we say that the company is ruined and has to stop its
business. The objective of the company is to set prices so as to maximize the
expected discounted dividend payout over a given time horizon (or until ruin).
In this chapter, we investigate the use of RL techniques in solving insurance

pricing problems in the dynamic environment. The pricing problem is a sequential
decision problem, and consequently, we can apply MDP to solve the insurance pricing
problems. There are several different types of RL, which can be roughly divided
into the following three classes: model-based, model-free, and policy search. In the
model-based approach, the agent first constructs a model of the environment, and
then uses it to derive the optimal policy. Here, it is assumed that the dynamics of
the MDP are known. In the model-free approach, the agent learns directly from
experience, instead of learning the model. In the policy search approach, the agent
learns a parameterized policy by searching directly in (a subspace of) the policy
space.
The main contributions of this chapter are summarized as follows. We consider

an insurance company which sells a new product online and we set up an MDP
model to solve the dynamic pricing problem. In this model, the company chooses
prices (i.e., actions), observes revenues (i.e., states) and then decides dividend pay-
out (i.e., rewards) based on observations. We then use RL based policies including
Q-learning, Sarsa, Expected Sarsa, deep Q-learning with a neural network, and
cross-entropy search to learn the optimal pricing policy. Our results show that these
policies work well without prior knowledge and are potentially applicable in the real
market.
The remainder of this chapter is structured as follows. In the next section we

provide a review of literature that is most closely related to our work. In Section 5.3,
we provide background on RL. In Section 5.4, we formulate the insurance pricing
problem as an MDP. In Section 5.5, we present an application of RL to a simple
example.
119
5.2 Related literature
Our interest is specifically in reinforcement learning based models so we provide a

review of relevant work in this section.
Reinforcement learning. RL has enjoyed a surge in popularity in the last few

years and has been applied to a wide range of problems in many areas. For example,
in robotics and control, we refer readers to Kober et al. (2013) for a survey of the
many noteworthy applications of RL. In human computer interaction, it has been
used to automatically optimize a fielded dialogue system with human users (Singh
et al., 2002). Most famously, RL was the basis for DeepMind’s game-playing agents
for Atari games (Mnih et al., 1995) and the classic board game Go (Silver et al.,
2016). Sutton & Barto (2018) provided an excellent introduction to the core ideas
and algorithms of RL.
Reinforcement learning in dynamic pricing. MDPs are a very powerful mod-

eling technique in decision-making problems, where they are used to find the optimal
policies that either maximize rewards or minimize costs. MDPs have a long history
in dynamic control problems. Miller (1968) studied continuous-time MDPs with
finite states and actions. Rothstein (1968) proposed a Markovian decision model
to address the hotel overbooking problem, and Qiu & Pedram (1999) introduced a
continuous-time, controllable MDP to solve a dynamic power management problem.
Using MDPs to solve the dynamic pricing problems in revenue management is
relatively recent. Brooks et al. (1999) compared machine learning based parametric
pricing models in the electronic goods market. Gosavi et al. (2002) applied RL to
solve the airline seat allocation and overbooking problem, in which they considered
the problem as a semi-MDP over an infinite time horizon. Gupta et al. (2002)
considered an online multi-unit Dutch auction problem in RL framework and used
one-step RL methods to learn optimal price decisions. Raju et al. (2006) used
RL techniques to determine dynamic prices in an electronic retail market. They
considered an infinite horizon learning problem with customer segmentation, with
the aim to optimize either long-term discounted profit or long-run average profit
120
per unit time. Collins & Thomas (2012) compared three different RL approaches
for the airline pricing game and showed that learning policies eventually converge
on the expected solution. Rana & Oliveira (2014) applied RL with and without
eligibility traces to learn and optimize pricing perishable products with real-time
selling demand. We also refer to Könönen (2006), Han et al. (2008) and Kutschinski
et al. (2013) for dynamic pricing by multi-agent RL.
Cross-entropy method for policy search. A variety of successful policy search

methods have been introduced such as a direct search in the policy space (Rosenstein
& Barto, 2001) and policy gradient methods (Sutton et al., 2000; Baxter et al., 2001).
Mannor et al. (2003) introduced a policy search approach based on the cross-
entropy (CE) method and showed that this approach converges quickly and has good
performance in the context of RL. The CE method was originally designed for the
simulation of rare events in complex stochastic networks (e.g. see Rubinstein (1997),
Homem-de Mello (2007)), and then used to address continuous multi-extremal op-
timization (e.g. see Rubinstein (1999)). Mannor et al. (2003) used the CE method
for both rare event estimation and combinatorial optimization. For the proof of the
convergence of the CE method, we refer to Homem-de Mello & Rubinstein (2002).
Furthermore, Rubinstein & Kroese (2004) and Margolin (2005) proved the asymp-
totic convergence of the modified CE method under certain assumptions. The proof
of Margolin (2005) is similar to the ant algorithm in Gutjahr (2000). For a tutorial
on the CE method and its applications, we refer to de Boer et al. (2003); and for
an extensive list of recent work we refer to de Boer et al. (2003) and Botev et al.
(2013).
5.3 Markov decision processes

We consider the interaction of an agent with an environment over a discrete series
of time steps, t = 1, . . . , T . Extensions to the continuous case are possible, see
Bertsekas (1995) and Bertsekas & Tsitsiklis (1996). Figure 5.1 illustrates how this
interaction works. At each step t, the agent observes the current state st ∈ S from
its environment, where S is the set of all possible states, and takes some action
121
Agent
state reward action

st rt at
rt+1
st+1 Environment
Figure 5.1: The agent-environment interaction in an MDP.
at ∈ A(st ), where A(st ) is the set of all available actions that the agent may take in
state st . Then at the next time step t + 1, the agent receives a reward rt+1 from the
environment and moves to a new state st+1 .
We let Gt be the expected return, which is the sum of discounted rewards, at
time t. The goal of agent is to select actions to maximize the expected return over
time horizon T (which needs not be finite)
T
X
Gt = γ t−1 rt ,
t=1
where γ ∈ [0, 1] is called discount factor. Clearly, if γ = 0 then we simply choose

at that maximizes the immediate reward rt+1 . If γ < 1 and the reward sequence
rt is bounded, then Gt is finite. As γ → 1, we take future rewards more heavily
into account. Normally γ is chosen such that it is only slightly smaller than 1.
Introducing a discount factor can often make problems more tractable analytically
and is very common in practice. For a non-discounted problem, we can simply take
γ = 1. The sequence of time steps from 1 to T is called an episode.
The process of the agent observing the environment, receiving a reward and
moving to the next state is called a Markov decision process or an MDP for short.
In an MDP, the current observation st summarizes all information about previous
states and is described as the Markov state. Thus the environment’s response st+1
at time step t + 1 depends only on the state st and action at at time step t, not on
the history of the process so far. We consider a finite MDP, where the sets of states,
actions, and rewards (S, A and R) all have a finite number of elements.
122
Value functions and optimality
We use a value function to measure the performance of an agent in a given state

in terms of expected return. However the expected return depends on the action-
selection rules, called policies. Essentially, a policy, denoted by π, is a mapping from
states to probabilities over actions, π = P(at = a | st = s). In general, policies may
be either deterministic or stochastic. The value function of state st = s under a
policy π, denoted by vπ (s), is the expected return when starting in s and following
π thereafter, defined by
vπ (s) := Eπ[Gt | st = s]
= Eπ[rt+1 + γvπ (st+1 ) | st = s] ,
for all s ∈ S and Eπ denotes the expected value of a random variable given that the
agent follows policy π over episodes. The last equality is the Bellman equation for vπ ,
which shows that the value function can be decomposed into two parts: immediate
reward rt+1 and the discounted value of its possible successor state γvπ (st+1 ). Note
that the value of the terminal state, if any, is always zero.
Similarly, we can define the value of taking a particular action in a given state.
Starting from a state s, if we take action a, and thereafter under a policy π, we have
qπ (s, a) := Eπ[Gt | st = s, at = a]
= Eπ[rt+1 + γqπ (st+1 , at+1 ) | st = s, at = a] .
Here qπ is called the action-value function, or sometimes called the Q-function.

Solving an MDP problem is to find a policy which achieves the maximum of
reward in the long run. It has been shown, for any finite MDP, there exists at least
one optimal policy π ? that is better than or equal to all other policies (Bertsekas,
1995), π ? = arg maxπ E[Gt | st = s]. We typically use v ? (s) to denote the optimal
value function, which is the value function following π ? , and define it by
v ? (s) := max vπ (s)

π
= max E [rt+1 + γv ? (st+1 ) | st = s, at = a] .

a
123
Similarly, the optimal action-value function, denoted by q ? (s, a), is defined by
q ? (s, a) := max qπ (s, a)

π
h i
? 0
= E rt+1 + γ max0
q (s t+1 , a ) | s t = s, at = a .
a
5.3.1 Temporal-difference algorithms
Temporal-difference (TD) learning is a model-free, on-line learning approach, where

the agent learns directly from episodes of experience without knowing MDP transi-
tions/rewards and the final outcome. TD learning updates the current value v(st )
toward the estimated return rt+1 + γv(st+1 ) instead of the actual return,
v(st ) ← v(st ) + α [rt+1 + γv(st+1 ) − v(st )] .
Here rt+1 + γv(st+1 ) is called the TD target. The term δt := rt+1 + γv(st+1 ) − v(st )
is called the TD error, which is the difference between the value in state st and
the value at the subsequent state st+1 plus the reward rt+1 accumulated along the
way. The errors decreases as value functions are repeatedly updated. The step-size
parameter α ∈ (0, 1] controls the learning rate.
There are two main classes of TD methods: off-policy and on-policy. In on-
policy methods, the agent learns the optimal policy and behaves using the same
policy. Sarsa is an example of an on-policy method. In off-policy methods, the
agent learns about the optimal policy by using two policies, one that is used for
updating (the update policy) and another that is used for exploration (the behavior
policy). Q-learning is an example of an off-policy method. Sarsa and Q-learning
will be discussed in next section.
Sarsa
Sarsa was introduced by Rummery & Niranjan (1994) and named by Sutton (1996)
because it uses all of the values st , at , rt+1 , st+1 , at+1 for each update. The algorithm
tries to learn the action-value function. More specifically, at every time step, we
estimate qπ (s, a) for the same behavior policy π and for all states s and actions a.
We write Q(s, a) to refer to its approximation. In Sarsa, the update rule for the
124
Algorithm 5: Sarsa Algorithm
Initialize: Discount factor γ, step size α ∈ (0, 1], small > 0;
Q(s, a), ∀s ∈ S, a ∈ A(s) arbitrarily, and Q(terminal − state, ·) = 0;
Repeat
Initialise s;
Choose action a from s using -greedy policy derived from Q ;
Repeat
Take action a and observe r and next state s0 ;
Choose action a0 from s0 using -greedy policy derived from Q ;
Q(s, a) ← Q(s, a) + α (r + γQ(s0 , a0 ) − Q(s, a)) ;
s ← s 0 , a ← a0 ;
Until s is a terminal state;
Until all episodes are observed;
action-value function is
Q(st , at ) ← Q(st , at ) + α (rt+1 + γQ(st+1 , at+1 ) − Q(st , at )) .
If st+1 is the terminal, then we let Q(st+1 , at+1 ) = 0. A complete description of Sarsa
is given in Algorithm 5. If all states are visited infinitely often and with appropriate
choice of step-size α, Sarsa converges to the optimal policy and action-value function
with probability 1 (Singh et al., 2000).
Q-learning
Unlike Sarsa, which chooses all actions according to the same fixed policy, Q-learning
(Watkins, 1989) chooses actions through another policy, i.e., the action that maxi-
mizes the Q-value at the new state st+1 . Q-learning’s update rule is given by

Q(st , at ) ← Q(st , at ) + α rt+1 + γ max Q(st+1 , a) − Q (st , at ) .
a
The Q-function is learned by approximating the optimal action-value function q ?

directly. A description of Q-learning for a finite MDP is given in Algorithm 6. If all
state-action pairs are visited an infinite number of times and with proper choice of α,
Q-learning control converges to the optimal action-value function with probability
1 (Watkins & Dayan, 1992).
125
Algorithm 6: Q-learning Algorithm
Q(s, a), ∀s ∈ S, a ∈ A(s) arbitrarily, and Q(terminal − state, ·) = 0;
Repeat
Initialise s;
Repeat
Choose action a from s using -greedy policy derived from Q;
Take action a and observe r and next state s0 ;
Q(s, a) ← Q(s, a) + α (r + γ maxa Q(s0 , a) − Q(s, a)) ;
s ← s0 ;
Algorithm 7: Expected Sarsa Algorithm

Action-value function Q(s, a), ∀s 6= terminal state ∈ S, a ∈ A(s) arbitrarily,
and Q(terminal − state, ·) = 0;
Repeat
Initialise s;
Repeat
Choose action a from s using -greedy policy derived from Q;
Take action a and observe r P and next state s0 ;
Q(s, a) ← Q(s, a) + α (r + γ u π(u | s0 )Q(s0 , u) − Q(s, a)) ;
s ← s0 ;
Expected Sarsa
Expected Sarsa can be viewed as an on-policy version of Q-learning (van Seijen

et al., 2009). Expected Sarsa is similar to Q-learning, but uses the expected value
of the next state–action values instead of the maximum values. The algorithm takes
into account how likely each action is to be taken under the current policy and
eliminates the variance caused by the random selection of actions. The update rule
of the algorithm is
Q(st , at ) ← Q(st , at ) + α (rt+1 + γEπ [Q(st+1 , at+1 ) | st+1 ] − Q(st , at ))

!
X
← Q(st , at ) + α rt+1 + γ π(a | st+1 )Q(st+1 , a) − Q(st , at ) .
a
Expected Sarsa is described in Algorithm 7.
126
5.3.2 Function approximation
For large MDPs, there may be too many states and/or actions to store in memory,
and it is also very slow to learn the value of each state individually. Therefore
a function approximator is often used to estimate the true state-value function,
q̂(s, a, w) ≈ q(s, a), using a weight vector w. The goal is to find weights, w, that
minimize the mean-squared error L(w) defined by
L(w) := E (q(s, a) − q̂(s, a, w))2 .

It is often sufficient to sample the gradient of L(w) by stochastic gradient descent
1
∆w = α∇w L(w) = αEπ [(q(s, a) − q̂(s, a, w)) ∇w q̂(s, a, w)] ,
2
where α is a step-size parameter. Typically linear function approximators are used

but nonlinear approximators—such as neural networks—are also often used, we refer
to Sutton & Barto (2018) for more details.
5.3.3 Neural networks
A neural network (NN) is a biologically inspired model which consists of a network

architecture composed of artificial neurons (also called units or processing elements
etc.) (e.g. see McCulloch & Pitts (1943), Widrow & Hoff (1960) and Rumelhart
et al. (1986)). One of the important ingredients of a NN architecture is the choice
of a non-linear activation function, which describes the weights on the connections
between input and output. The most popular choices of activation functions are the
sigmoid/logistic function,
1
f (x) = ,
1 + e−x
or the hyperbolic tangent function
f (x) = tanh(x) ,
or the rectified linear unit (ReLU) function
f (x) = max(0, x) .
127
Input Hidden Hidden Output
layer layer layer layer
I1
Hl1 Hl2
I2 O1
..
I3
.. .. .
.. . . Om
.
In
Figure 5.2: A feedforward NN with n input units, m output units, and two hidden layers.
NNs consist of an input layer, an output layer and often one or more “hidden”
layers between the two. Those with one hidden layer are called shallow networks,
and those with more than one hidden layer are called deep networks. A feedforward
neural network is a popular NN, which is a directed acyclic process. In this net-
work, the information moves from the input layer, goes through the hidden layer(s),
and finally leaves through the output layer. If the network has loops, it is called a
recurrent NN. A specialized kind of feed-forward NNs is convolutional neural net-
works, which have been tremendously successful in practical applications such as
facial recognition (Lawrence et al., 1997), image classification (Krizhevsky et al.,
2012), speech recognition (Abdel-Hamid et al., 2014; Amodei et al., 2016). Figure
5.2 illustrates a feed-forward NN architecture, which has an input layer with n input
units, two hidden layers and m output units.
A popular method to train a deep NN is backpropagation (Rumelhart et al.,

1986), which calculates the gradient of the loss function w.r.t. the weights of the
128
network. The information is sent alternately backwards through the network. The
use of deep NNs for function approximation in RL is known as deep reinforcement
learning, and has achieved impressive success in recent years. In deep RL, NNs
can use TD errors to learn value functions. For example, DeepMind’s Atari game-
playing agent used a deep convolutional network to approximate Q-values (Mnih
et al., 1995).
5.3.4 Deep Q-Network
Any NN used to approximate Q-values is called a deep Q-network or DQN for short,
such that
q(s, a, w) ≈ q ? (s, a) .
The network is updated according to the loss function in Q-values. We use the
mean-squared error L(·) to compute the loss, which is the difference between the
approximated and true Q-values. In practice, the true value function is unknown.
Thus, we substitute the optimal target r + γ maxa0 q ? (s0 , a0 ) with the approximate
target r + γ maxa0 q(s0 , a0 , w). At each iteration i, we have
2
0 0
L(wi ) = Eπ r + γ max 0
q(s , a , wi ) − q(s, a, wi ) .
a
Here L(wi ) can be optimized by stochastic gradient descent. However, RL with a

nonlinear function approximator such as a NN is not stable and can even diverge.
This issue could be addressed by experience replay, first proposed by Lin (1992).
The idea of experience replay is to store the agent’s experiences (st , at , rt , st+1 ) into
a replay memory at each time step t and sample a random batch from this memory.
Then we compute the target value for the sample states and use stochastic gradient
descent to update the network weights. Experience replay has several advantages.
For example, it removes the correlations from prior experience by randomizing over
the data, and avoids oscillations or divergence by averaging the data distribution.
5.3.5 Exploration and exploitation trade-off
All TD algorithms face a trade-off between exploitation and exploration, i.e., choos-
ing the best actions based the current knowledge (i.e., exploitation), and choosing
129
actions that potentially yield higher rewards in the future (i.e., exploration).
An -greedy policy is one of the most popular action-selection rules for balancing
exploration with exploitation. This action-selection rule is to choose the best action
with probability 1 − or choose a random action with probability , where is
called the exploration rate. The best action is the action (or one of the actions) that
gives the highest estimated action value so far, and called as the greedy action. We
normally let be small, which guarantees that we can take the best prices most of
the time. Assume there are m actions that have been tried, then the -greedy policy
can be written as


 /m + 1 − , if a? = arg maxa∈A q(s, a) ,


π(a | s) =


 /m,
 otherwise .
Another approach is a decaying -greedy policy, where slowly decays over time. A
simple way to obtain the decay is multiplying by a real number less than 1, denoted
by decay rate, then we can write × decay rate. The agent tends to explore more
when it does not have enough information about the environment at the beginning,
and eventually the agent gains more of an understanding of the environment in
order to exploit more. A typical implementation may start with = 0.1, and set a
decay rate = 0.995 or 0.999 and a minimum = 0.01.
5.3.6 Cross-entropy method in MDPs
The cross-entropy (CE) method is a simple and versatile technique for optimization,
based on the CE minimization, which has asymptotic convergence properties. For a
more detailed introduction to foundation and various other applications of the CE
method, we refer to de Boer et al. (2003). Let X be a finite set of states and f (x)
be a performance function over all x ∈ X . We want to find the maximum of f over
X , and we denote the maximum value by `∗ , such that
`∗ = max f (x) .
x∈X
We set thresholds or levels ` ∈ R and define an indicator function 1{f (X)≥`} . We

wish to estimate
γ(`) = P(f (X) ≥ `) = E[1{f (X)≥`} ] ,
130
i.e., the probability that the value f exceeds some fixed `.
Recall that an MDP is defined by a tuple (S, A, P, R), where S = {1, . . . , n} is
a finite set of states; A = {1, . . . , m} is a finite set of possible actions; P is a state
transition probability matrix with elements P (s0 | s, a) representing the transition
probability from state s to state s0 when action a is chosen; and R is the set of all
possible rewards, r(s, a), received by choosing action a in state s.
Assume there exits a stopping time τ , at which the process terminates. We
consider a policy that does not depend on the time t and denote the policy as a
vector x = (x1 , . . . , xn ) with xi ∈ A being the action taken in state i. Then we can
write the total reward as
τ
X
f (x) = Ex r(st , at ) ,
t=1
starting from some fixed state s1 . Here r(st , at ) is the reward by choosing action
at in state st a time t, and Ex denotes the expectation under the policy x. We
assume that the process starts from a specific initial state s1 = sstart and ends at an
absorbing state (or terminal state), denoted by ster , with zero reward. Furthermore,
we assume that the terminal state is always reached at time τ . We want to calculate
the associated sample reward function f at each state.
Now define an auxiliary m × n probability matrix P = {Psa } with elements
Psa where s = 1, . . . , n and a = 1, . . . , m. Note that Psa is short for π(a | s) or
P(at = a | st = s), which denotes the probability of taking action a in state s and
Pm
a=1 Psa = 1 for any s. Assume that the matrix P is initialized to a uniform matrix
Psa = 1/m. Once this matrix P is defined, the CE algorithm comprises the following
two phases at each iteration t:
1. Generating N random trajectories (samples) X1 , . . . , XN by the auxiliary pol-

icy matrix P and simultaneous calculating the reward of trajectory i, denoted
by f (Xi ), for i = 1, . . . , N . The trajectory generation for the MDP is given in
Algorithm 8.
2. Updating parameters of the matrix P on the basis of the data collected at the
first phase.
(i) (i) (i) (i) (i) (i) (i)

Given a trajectory Xi = {s1 , a1 , r1 , . . . , sτ −1 , aτ −1 , rτ −1 , sτ }, we calculate the
131
Algorithm 8: CE Method in MDPs
Input: Auxiliary policy matrix P .
For ( i = 1, . . . , N ):
Start from some given initial state s1 = sstart and set t = 1 (iteration
counter).
Repeat
Generate an action at by Psa in (5.3.1);
Observe a reward rt = r(st , at ) and a new state st+1 ;
Set t = t + 1;
Until st = ster ;
Obtain a trajectory
(i) (i) (i) (i) (i) (i)
Xi = {s1 , a1 , r1 , . . . , sτ −1 , aτ −1 , rτ −1 , s(i)
τ },
and calculate the reward of the trajectory

τ
X (i) (i)
f (Xi ) = r(st , at ) .
t=1
Output: f .
cumulative reward from each state until termination. For each state s = sj , the
(i) (i)
reward is fsj (Xi ) = τt=j r(st , at ). Each state s is updated separately according
P
to the performance fs (Xi ) obtained from state s onwards. Thus, at iteration t, the
parameter matrix, denoted by Pt,sa , is updated by
1{fs (Xi )≥`t,s } 1{Xi ∈Xsa }
PN
Pt,sa = Pi=1 , (5.3.1)
N
i=1 1{fs (Xi )≥`t,s } 1{Xi ∈Xs }
where the space set Xs contains all trajectories to state s and the space set Xsa
contains all trajectories to state s where action a is taken. We consider a (1−ρ)100%
quantile of the rewards, denoted by `t,s , which is also the best rewards in each
iteration t. This can be obtained by ordering rewards fs (X1 ), . . . , fs (XN ) from
smallest to largest, fs(1) ≤ · · · ≤ fs(N ) , and let `t,s be
`t,s = fs(d(1−ρ)N e) .
Here ρ is chosen to be not very small, say ρ ≥ 10−2 , such that the event {fs (X) ≥
`t,s } is not too small. Note that a different threshold parameter `t,s is used for each
state s at each iteration t.
The CE method is an iterative method, and the update of the parameters here
is based on the best performing samples. In the setting of MDPs, the CE method
132
takes advantage of the Markov property, i.e., only considering the reward from the
visit to that state onward. The choice of action in a given state only affects the
reward from the state onward, not the past. This enables the algorithm to speed up
and reduce bias.
In this section, we formulate the insurance pricing problem in a RL framework, where

the company is a natural learning agent. The company chooses a price (i.e., action);
and receives observations from the the market (i.e., environment); and receives a
reward. Thus it is natural to apply RL here.
We consider an MDP with one-dimensional state space and action space. The
state space S is the set of all possible revenues rt for t = 1, . . . , T . For each state,
the action space A is the set of all possible prices pt for t = 1, . . . , T . We denote the
dividend pay-out by Divt and assume Divt ∈ [0, rt ]. The next state, i.e., revenue at
the beginning of time t + 1, is
rt+1 = rt + zt − Divt , (5.4.1)
where zt = pt dt − ct is the difference between premium and claims at time t. If zt

are all non-positive, then the company makes no profit and the suggestion would be
to stop the business and pay all the money to shareholders.
At the beginning of each selling period, the insurer decides a dividend pay-out,
and it is only paid when the revenue is above a certain level ` > 0. We set the
dividend at time t to be proportional to the excess of rt − `, i.e., Divt = c(rt − `)
if rt > ` or Divt = 0 if rt ≤ ` where c ∈ [0, 1]. The reward function is the dividend
pay-out, given by

c(rt − `), if rt ≥ ` ;

Divt = (5.4.2)
0,

if rt < ` .
We define the ruin time τ by
τ := inf{t ∈ N : rt < 0} .
133
The performance of policies measured by expected reward is defined by
" T ∧τ #
X
t−1
Eπ γ Divt , (5.4.3)
t=1
where γ ∈ [0, 1] is the discount factor and T ∧ τ is the optimal stopping time. If
the revenue is always positive before selling horizon T , the company stops at time
T ; otherwise, it stops at ruin time τ .
5.5 Numerical examples in insurance

Similar to Section 3.7, we consider that demand and log of total claims models are
E[D(p)] = 11 − 0.8p ,
E[C(p)] = 3 + 0.25p ,
and the feasible price set is [p` , ph ] = [1, 10]. At time t, the insurer has to make
decisions on paying dividends to shareholders based on their revenue. Without loss
of generality, we set c = 10% and ` = 1 in (5.4.2), then the dividend that the insurer
pays is 
0.1 · (rt − 1), if rt > 1 ;

Divt = (5.5.1)
0, if rt ≤ 1 .

That is, if rt > 1, the insurer pays shareholders 10% of the excess of rt − 1, otherwise
nothing. Thus, at time t, the insurer chooses a selling price pt , and then observes
demand dt and total claims ct . Claims are only paid when insured events happen.
However insured events do not always occur, so neither do claims. We introduce
a parameter wt and set it to be 0 or 1. Furthermore we assume that there is only
a 50% chance that the insured events will happen. We generate a random number
and if it is less than or equal to 0.5, we let wt = 1 which means that the company
needs to pay claims; and if is greater than 0.5, we let wt = 0 which implies that the
company does not pay claims. Then the revenue at time t + 1 is
rt+1 = rt + (pt dt − wt ct ) − 10% · max{0, rt − `} .
Our goal is to maximize the expected reward given in (5.4.3).
134
5.5.1 Temporal-difference methods
We first investigate three TD methods—Q-learning, Sarsa and Expected Sarsa. We

choose prices based on the -greedy policy. Recall that Q-learning, Sarsa and Ex-
pected Sarsa depend on the choice of discount factor γ, step-size (i.e., learning rate)
α and exploration rate . Unless specified, we use default parameter settings of
α = 0.5, = 0.1 and γ = 0.9 and those are chosen by the experiments that we will
explain shortly.
Figure 5.3 shows the results of these three algorithms for a variety of different
step-sizes α = 0.01, 0.03, 0.1, 0.3 and 1.0. Performance, measured by the average
reward, is a function of the number of episodes T . The time steps can be time units
in hours or minutes or seconds. We let Q-learning and Sarsa learn for 1, 000, 000
episodes but we used 5000, 000 episodes for Expected Sarsa because it converges
quickly. We independently ran each algorithm 10 times and averaged the reward
over 100 episodes. We can see that when T is large enough, the Q-learning, Sarsa
and Expected Sarsa algorithms converge to the optimal policy. However, when
α = 1, Sarsa performs badly, as can be seen from Figure 5.4, which shows how the
performance of Sarsa varies with the choice of α. Moreover, it can be seen that
Expected Sarsa is more stable. This is because by taking expectations, Expected
Sarsa eliminates the variances caused by the random selection of the next action.
Figure 5.4 compares the performances of TD methods in terms of α. We ran this
experiment with 100, 000 episodes for 2 times, and then averaged the reward over 100
episodes. For this problem, Expected Sarsa performs better than Q-learning, and Q-
learning performs better than Sarsa. In addition, Expected Sarsa shows a significant
improvement over Sarsa over a wide range of α. We can see that the optimal α for
Q-learning and Expected Sarsa is 1 but it is smaller than 1 for Sarsa. Sarsa only
performs well for a small range of α values and it collapses entirely for large ones.
This policy is further improved over the initial random policy through learning,
although divergence causes the policy to get worse in the long run. Moreover, the
performance of Q-learning more closely resembles Expected Sarsa as α increases.
Figure 5.5 shows the results with TD methods on a variant of exploration rates
= 0.1, 0.2, 0.4, 0.6, 0.8 and 1.0. We can see that the performance of these three TD
135
(a) Q-learning (b) Sarsa
(c) Expected Sarsa
Figure 5.3: Performance of Q-learning, Sarsa and Expected Sarsa. The curves display
the T -period regret for different step-sizes α = 0.01, 0.03, 0.1, 0.3, 1.0, where periods T are
106 , 106 and 5 × 105 respectively. In all cases, the problem parameters used are γ = 0.9
and = 0.1.
algorithms decreases with increasing levels of exploration. It is worth noting that

TD algorithms perform poorly with only exploratory moves, i.e., = 1. Similarly if
we set = 0, we will not do any exploration and the performance is likely to be bad.
Without any exploration, the insurance pricing problem is perfectly deterministic.
Thus we set > 0 for robust learning.
Figure 5.6 compares the performance of Q-learning, Sarsa and Expected Sarsa
over 100, 000 episodes. Here rewards were averaged over 100 episodes. As expected,
the performance of Expected Sarsa is better than that of Q-learning, and Q-learning
converges to Expected Sarsa when the number of episodes becomes large. Moreover,
Sarsa’s performance is the worst but monotonically increasing. Overall, all of the
TD algorithms asymptotically converge to the same optimal policy, and Expected
Sarsa outperforms Q-learning and Sarsa.
136
Figure 5.4: A comparison of Q-learning, Sarsa and Expected Sarsa as a function of α.
The three curves display the T -period regret in period 105 , respectively. In all cases, the
problem parameters used are γ = 0.9 and = 0.1.
(a) Q-learning (b) Sarsa
(c) Expected Sarsa
Figure 5.5: Performance of Q-learning, Sarsa and Expected Sarsa for different exploration
rates = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. The curves display the T -period regret in period 105 ,
respectively. In all cases, the problem parameters used are γ = 0.9 and α = 0.5.
137
Figure 5.6: A comparison of T -period regret of Q-learning, Sarsa and Expected Sarsa.
The three curves display the T -period regret in period 105 , respectively. In all cases, the
problem parameters used are γ = 0.9, = 0.1 and α = 0.5.
5.5.2 Cross-entropy method
Since the insurance pricing problem is multi-armed bandit, we can use the cross-
entropy (CE) method to search for the optimal policy.
We generated N = 100 random trajectories and ran T = 100 iterations with

ρ = 10%. Each experiment was repeated 50 times. We set discount factor γ = 0.9.
Figure 5.7 shows the reward and the convergence interval obtained by the CE pricing
policy. The blue curve is the average reward for each price, and we can see that
the optimal price is 7 and optimal expected reward is around 1350. The green area
shows the confidence interval or convergence zone, which catches the optimal value
that the algorithm converges to.
In Figure 5.8, we plot the average reward obtained by the CE algorithm with
ρ = 1%, 5% and 10%. We can see that CE with ρ = 10% converges faster than
that with ρ = 5%, and both are faster than that with ρ = 1%. This is because
we considered a (1 − ρ)-quantile of the performances `t,s and counted how many of
the samples X1 , . . . , XN have a performance greater than or equal to `t,s in state s
at iteration t. A larger `t,s is closer to the best of the performances in state s at
iteration t, which leads to a quicker convergence to the optimal performance. The
value of ρ does not affect the optimal policy, and CE for different ρ all converges to
138
Figure 5.7: Pricing Policy. The curve is the average reward for varying prices, and the
green area is the CE convergence zone. The problem parameters are N = 100, T = 100
and ρ = 10%.
Figure 5.8: A comparison of average reward for CE policies with 90%, 95% and 99%
percentiles. In all cases, the problem parameters are N = 100 and T = 100.
the optimal expected value as shown in Figure 5.8.
5.5.3 Deep Q-learning
The insurance pricing problem can be viewed as a simple MDP with one-dimensional
state space, and each revenue is a distinct state. Therefore, we can apply deep
Q-learning with network (DQN). In our case, the input layer receives 1 piece of
information, i.e., only 1 input node. There are 2 hidden layers, each has 24 nodes
and a sigmoid activation. In the output layer, there is a separate output unit for
139
Figure 5.9: Performance of DQN using Keras and Q-learning. The curves display the
T -period regret in period 20000, respectively. In both cases, the problem parameters used
are constant γ = 0.9, α = 0.5 and decaying .
each possible price, thus there are 10 nodes. The model is compiled using a mean-
squared error loss function and Adam optimizer with step-size α. We used Keras to
implement the DQN.
We set parameters α = 0.5 and γ = 0.9. Prices were selected using a decaying
-greedy policy with decaying from 1.0 to 0.01 at a rate of 0.995 and fixing at 0.01
thereafter. Figure 5.9 compares the performance of Q-learning with and without
function approximation with the same parameters. We performed 20,000 episodes
on 10 different occasions and averaged the reward over 100 episodes. We can see
that Q-learning converges quicker and shows the more stable learning, and DQN
achieves higher average reward but its learning process is unstable.
5.6 Conclusions and future directions
In this chapter, we showed that the insurance pricing problem can be modeled as
an MDP. We described all the elements of an MDP: the state (i.e., revenue) and
action (i.e., price) spaces, and the reward (i.e., dividend pay-out). We presented
the different RL-based methodologies, the basic algorithms and their modifications,
and discussed their applications to the insurance pricing problem problem. Our
140
results showed that RL techniques could be useful for solving pricing problems in
the insurance context, where available information is limited.
There are several directions for future work. First, we showed the potential
application of RL in insurance but it is still theoretically not well understood. For
example, the convergence of the methods mentioned above deserves further study
since the numerical results does not provide the guidance on convergence rates. Thus
the next step might be to find bounds of convergence rates, at least in some particular
cases. The lack of theoretical support and heavy regulations or government in
insurance practices will affect the use of RL in the insurance industry.
When we apply RL to a real-world business, there might be hundreds of products
to consider, and thousands of factors on how to price them. We have considered
a single monopolistic company that sells one product. Therefore, the next imme-
diate step to consider would be a company that sells a huge number of products
or a huge number companies in the market. In addition, we may consider a con-
textual case by introducing features. Possible features are age and gender of the
customer, consumer behavior, geography, market size and etc. The insurance indus-
try is different from other finance services industries because insurance companies
must price its products without knowing the actual costs. Given the uncertainty
of pricing insurance products, companies need to predict both demand and claims
for their products in the future in order to estimate premiums at the beginning.
Adding more information can not only improve the demand and claims estimation,
but enable insurers to know their customers and provide personalized insurance that
matches each individual customer’s requirements.
141
Bibliography
Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algo-

rithms for linear stochastic bandits. In Proceedings of the 24th In-
ternational Conference on Neural Information Processing Systems (NIPS)
(pp. 2312–2320). URL: http://david.palenica.com/papers/linear-bandit/
linear-bandits-NIPS2011-camera-ready.pdf.
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014).
Convolutional neural networks for speech recognition. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 22 , 1533–1545. doi:10.1109/TASLP.
2014.2339736.
Agarwal, A., & Duchi, J. C. (2011). Distributed delayed stochastic optimization.

In Proceedings of the 24th International Conference on Neural Information Pro-
cessing Systems (NIPS) NIPS’11 (pp. 2312–2320). USA: Curran Associates Inc.
URL: https://arxiv.org/pdf/1104.5525.pdf.
Albrecher, H., & Kortschak, D. (2009). On ruin probability and aggregate claim
representations for pareto claim size distributions. Insurance Mathematics and
Economics, 45 , 362–373. doi:10.1016/j.insmatheco.2009.08.005.
Amin, K., Rostamizadeh, A., & Syed, U. (2014). Repeated contextual

auctions with strategic buyers. In Advances in Neural Information Pro-
cessing Systems 27 (NIPS 2014). URL: http://amin.kareemx.com/pubs/
AminRostamizadehSyedNIPS2014.pdf.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case,
C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen,
142
Z., Chrzanowski, M., Coates, A., Diamos, G., & et al (2016). Deep speech 2 :
End-to-end speech recognition in english and mandarin. In Proceedings of The
33rd International Conference on Machine Learning (pp. 173–182). New York,
New York, USA: PMLR volume 48 of Proceedings of Machine Learning Research.
URL: http://proceedings.mlr.press/v48/amodei16.pdf.
Anderson, T. W., & Taylor, J. B. (1976). Strong consistency of least squares

estimates in normal linear regression. The Annals of Statistics, 4 , 788–790.
doi:10.1214/aos/1176343552.
Anderson, T. W., & Taylor, J. B. (1979). Strong consistency of least squares es-
timates in dynamic models. The Annals of Statistics, 7 , 484–489. doi:10.1214/
aos/1176344670.
Antonio, K., & Valdez, E. A. (2012). Statistical concepts of a priori and a posteriori
risk classification in insurance. AStA Advances in Statistical Analysis, 96 , 187–
224. doi:10.1007/s10182-011-0152-7.
Asmussen, S., & Albrecher, H. (2010). Ruin Probabilities volume 14 of Ad-

vanced Statistical Science & Applied Probability. (2nd ed.). Singapore; Hack-
ensack, N.J.: World Scientific. URL: https://ebookcentral.proquest.com/
lib/manchester/detail.action?docID=731122.
Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs.

Journal of Machine Learning Research, 3 , 397–422. doi:10.1109/SFCS.2000.
892116.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the
multiarmed bandit problem. Machine Learning, 47 , 235–256. doi:10.1023/A:
1013689704352.
Ban, G.-Y., & Keskin, N. B. (2020). Personalized dynamic pricing with ma-
chine learning: High dimensional features and heterogeneous elasticity. Manage-
ment Science, . URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_
id=2972985. Forthcoming.
143
Bartlett, M. S. (1951). An inverse matrix adjustment arising in discriminant analysis.
Annals of Mathematical Statistics, 22 , 107–111. doi:10.1214/aoms/1177729698.
Baxter, J., Bartlett, P. L., & Weaver, L. (2001). Experiments with infinite-horizon,
policy-gradient estimation. Journal of Artificial Intelligence Research, 15 , 351–
381. doi:10.1613/jair.807.
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control, Vols. I and

II . (4th ed.). Athena Scientific. URL: http://www.athenasc.com/dpbook.html.
Bertsekas, D. P., & Perakis, G. (2006). Dynamic pricing: A learning approach. In

S. Lawphongpanich, D. Hearn, & M. Smith (Eds.), Mathematical and Computa-
tional Models for Congestion Charging (pp. 45–79). Boston, MA: Springer volume
101 of Applied Optimization. doi:10.1007/0-387-29645-X_3.
Bertsekas, D. P., & Tsitsiklis, N., John (1996). Neuro-Dynamic Programming. (4th
ed.). Athena Scientific. URL: http://athenasc.com/ndpbook.html.
Besbes, O., & Zeevi, A. (2009). Dynamic pricing without knowing the demand
function: Risk bounds and near-optimal algorithms. Operations Research, 57 ,
1407–1420. URL: http://www.jstor.org/stable/25614853.
den Boer, A. (2013). Dynamic Pricing and Learning. Ph.D. thesis VU University of
Amsterdam. URL: http://dare.ubvu.vu.nl/bitstream/handle/1871/39660/
dissertation.pdf.
den Boer, A. (2015). Dynamic pricing and learning: Historical origins, current
research, and new directions. Surveys in operations research and management
science, 20 , 1–18. doi:10.1016/j.sorms.2015.03.001.
den Boer, A., & Zwart, B. (2014a). Mean square convergence rates for maxi-
mum quasi- likelihood estimators. Stochastic Systems, 4 , 375–403. doi:10.1287/
12-SSY086.
den Boer, A., & Zwart, B. (2014b). Simultaneously learning and optimizing using
controlled variance pricing. Management Science, 60 , 770–783. doi:10.1287/
mnsc.2013.1788.
144
de Boer, P., Kroese, P., D., Mannor, S., & Rubinstein, R. Y. (2003). A tutorial
on thde boere cross-entropy method. Annals of Operations Research, 134 , 19–67.
doi:10.1007/s10479-005-5724-z.
Botev, I. Z., Kroese, D., P., Rubinstein, Y. R., & L’Ecuyer, P. (2013). The cross-
entropy method for optimization. In Handbook of Statistics – Machine Learn-
ing: Theory and Applications (pp. 35–59). Elsevier volume 31. doi:10.1016/
B978-0-444-53859-8.00003-5.
Brochu, E., Cora, V. M., & de Freitas, N. (2010). A tutorial on bayesian op-
timization of expensive cost functions, with application to active user mod-
eling and hierarchical reinforcement learning. CoRR, abs/1012.2599 . URL:
http://arxiv.org/abs/1012.2599.
Brockman, M., & Wright, T. (1992). Statistical motor rating: making efficient use
of your data. Journal of the Institute of Actuaries, 119 , 457–543. doi:10.1017/
S0020268100019995.
Broder, J., & Rusmevichientong, P. (2012). Dynamic pricing under a general para-
metric choice model. Operations Research, 60 , 965–980. doi:10.1287/opre.1120.
1057.
Brooks, C. H., Fay, S. A., Das, R., MacKie-Mason, J. K., Kephart, J. O., & Durfee,
E. H. (1999). Automated strategy searches in an electronic goods market: learning
and complex price schedules. In Proceedings of the First ACM Conference on
Electronic Commerce (EC-99), Denver, CO, USA (pp. 31–40). URL: https:
//doi.org/10.1145/336992.337000. doi:10.1145/336992.337000.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochas-
tic multi-armed bandit problems. CoRR, 5 , 1–122. doi:10.1561/2200000024.
Business Insider (2018). Amazon changes prices on its products about every 10
minutes—here’s how and why they do it. https://www.businessinsider.com/
amazon-price-changes-2018-8.
145
Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data.
Econometric Society Monographs (second edition ed.). Cambridge: Cambridge
University Press. doi:10.1017/CBO9781139013567.
Chang, Y.-c. I. (1999). Strong consistency of maximum quasi-likelihood estimate

in generalized linear models via a last time. Statistics & Probability Letters, 45 ,
237–246. doi:10.1016/S0167-7152(99)00063-2.
Chapelle, O., & Li, L. (2012). An empirical evaluation of thompson sampling.

In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Wein-
berger (Eds.), Advances in Neural Information Processing Systems 24 (pp.
2249–2257). Curran Associates, Inc. URL: http://papers.nips.cc/paper/
4321-an-empirical-evaluation-of-thompson-sampling.pdf.
Chen, K., Hu, I., & Ying, Z. (1999). Strong consistency of maximum quasi-likelihood
estimators in generalized linear models with fixed and adaptive designs. The
Annals of Statistics, 27 , 1155–1163. doi:10.1214/aos/1017938919.
Chen, X., Owen, Z., Pixton, C., & Simchi-Levi, D. (2015). A statistical learning
approach to personalization in revenue management. Available at SSRN 2579462 ,
.
Chow, Y. S. (1965). Local convergence of martingales and the law of large num-
bers. The Annals of Mathematical Statistics, 36 , 552–558. doi:10.1214/aoms/
1177700166.
Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear
payoff functions. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics (pp. 208–214). volume 15. URL: http://
proceedings.mlr.press/v15/chu11a.html.
Cohen, M., Lobel, I., & Paes Leme, R. (2016). Feature-based dynamic pricing.
ACM Conference on Economics & Computation (EC), 15 , 40–44. URL: http:
//dx.doi.org/10.2139/ssrn.2737045.
146
Collins, A., & Thomas, L. (2012). Comparing reinforcement learning approaches for
solving game theoretic models: a dynamic airline pricing game example. Journal
of the Operational Research Society, 63 , 1165–1173. doi:10.1057/jors.2011.94.
Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization
under bandit feedback. In 21st Annual Conference on Learning Theory (COLT)
(pp. 355–366). URL: http://colt2008.cs.helsinki.fi/papers/80-Dani.pdf.
Desautels, T., Krause, A., & Burdick, J. W. (2014). Parallelizing exploration-

exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine
Learning Research, 15 , 4053–4103. URL: http://jmlr.org/papers/volume15/
desautels14a/desautels14a.pdf.
Dionne, G., & Vanasse, C. (1989). A generalization of automobile insurance rating

models: The negative binomial distribution with a regression component. ASTIN
Bulletin, 19 , 199–212. doi:10.2143/AST.19.2.2014909.
Drygas, H. (1976). Weak and strong consistency of the least squares estimators
in regression models. Z. Wahrscheinlichkeitstheorie verw Gebiete, 34 , 119–127.
doi:10.1007/BF00535679.
Duistermaat, J. J., & Kolk, J. A. C. (2004). Multidimensional Real Analysis I:

Differentiation. Cambridge Studies in Advanced Mathematics. Cambridge, Mass:
Cambridge University Press.
Embrechts, P., Klüppelberg, C., & Mikosch, T. (1997). Modelling extremal events
for insurance and finance. Applications of mathematics, 33. Berlin ; London:
Springer.
Fahrmeir, L., & Kaufmann, H. (1985). Consistency and asymptotic normality of

the maximum likelihood estimator in generalized linear models. The Annals of
Statistics, 13 , 342–368. doi:10.1214/aos/1176346597.
Filippi, S., Cappe, O., Garivier, A., & Szepesvári, C. (2010). Parametric ban-
dits: The generalized linear case. In Advances in Neural Information Processing
147
Systems 23 (NIPS 2010) (pp. 586–594). URL: https://sites.ualberta.ca/
~szepesva/papers/GenLinBandits-NIPS2010.pdf.
Gallego, G., & van Ryzin, G. (1994a). Optimal dynamic pricing of inventories with
stochastic demand over finite horizons. Management Science, 40 , 999–1020. URL:
http://www.jstor.org.manchester.idm.oclc.org/stable/2633090.
Gallego, G., & van Ryzin, G. (1994b). Optimal dynamic pricing of inventories with
stochastic demand over finite horizons. Management science, 40 , 999–1020.
Gosavi, A., Bandla, N., & Das, K. T. (2002). A reinforcement learning approach to
a single leg airline revenue management problem with multiple fare classes and
overbooking. IIE Transactions, 34 , 729–742. doi:10.1023/A:1015583703449.
Gupta, M., Ravikumar, K., & Kumar, M. (2002). Adaptive strategies for price
markdown in a multi-unit descending price auction: a comparative study. In
IEEE International Conference on Systems, Man and Cybernetics (pp. 373–378).
volume 1. doi:10.1109/ICSMC.2002.1168003.
Gutjahr, W. J. (2000). A graph-based ant system and its convergence. Future

Generations Computing, 16 , 873–888. doi:10.1016/S0167-739X(00)00044-3.
Haberman, S., & Renshaw, A. E. (1996). Generalized linear models and actuarial
science. Journal of the Royal Statistical Society. Series D (The Statistician), 45 ,
407–436. doi:0.2307/2988543.
Han, W., Liu, L., & Zheng, H. (2008). Dynamic pricing by multiagent reinforce-
ment learning. In Proceedings of the 2008 International Symposium on Electronic
Commerce and Security ISECS ’08 (pp. 226–229). Washington, DC, USA: IEEE
Computer Society. doi:10.1109/ISECS.2008.179.
Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
Institute of International Finance (2016). Innovation in insurance: How technology

is changing the industry. https://www.iif.com/portals/0/Files/private/
32370132_insurance_innovation_report_2016.pdf.
148
Javanmard, A., & Nazerzadeh, H. (2019). Dynamic pricing in high-dimensions. Jour-
nal of Machine Learning Research, 20 , 1–49. URL: http://jmlr.org/papers/
v20/17-357.html.
de Jong, P., & Heller, G. Z. (2008). Generalized Linear Models for Insurance Data.
Cambridge: Cambridge University Press.
Joulani, P., György, A., & Szepesvári, C. (2013). Online learning under de-
layed feedback. In Proceedings of the 30th International Conference on Ma-
chine Learning (ICML) (pp. 1453–1461). Atlanta, Georgia, USA. URL: http:
//proceedings.mlr.press/v28/joulani13.pdf.
Joulani, P., György, A., & Szepesvári, C. (2016). Delay-tolerant online convex
optimization: Unified analysis and adaptive-gradient algorithms. In Proceedings of
the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-2013) (pp. 1744–
1750). Phoenix, Arizona, USA. URL: https://sites.ualberta.ca/~pooria/
publications/AAAI16-Extended.pdf.
Keskin, N. B., & Zeevi, A. (2014). Dynamic pricing with an unknown demand
model: Asymptotically optimal semi-myopic policies. Operations Research, 62 ,
1142–1167. doi:10.1287/opre.2014.1294.
Kleinberg, R., & Leighton, T. (2003). The value of knowing a demand curve: Bounds
on regret for online posted-price auctions. In 44th Annual IEEE Symposium on
Foundations of Computer Science, 2003. Proceedings. (pp. 594–605). IEEE.
Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A
survey. International Journal of Robotics Research, 32 , 1238–1274. doi:10.1177/
0278364913495721.
Koller, M. (2012). Stochastic Models in Life Insurance. EAA Series. Berlin, Heidel-
berg: Springer Berlin Heidelberg.
Könönen, V. (2006). Dynamic pricing based on asymmetric multiagent reinforce-

ment learning: Research articles. International Journal of Intelligent Systems,
21 , 73–98. doi:10.1002/int.v21:1.
149
Krizhevsky, A., Ilya, S., & Geoffrey, H., E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing
Systems 25 (pp. 1097–1105). Curran Associates, Inc. doi:10.1061/(ASCE)GT.
1943-5606.0001284.
Kutschinski, E., Uthmann, T., & Polani, D. (2013). Learning competitive pricing
strategies by multi-agent reinforcement learning. Journal of Economic Dynamics
and Control , 27 , 2207–2218. doi:10.1016/S0165-1889(02)00122-7.
Lai, T. (2003). Stochastic approximation: invited paper. The Annals of Statistics,

31 , 391–406. doi:10.1214/aos/1051027873.
Lai, T., & Robbins, H. (1982). Iterated least squares in multiperiod control. Ad-
vances in Applied Mathematics, 3 , 50–73. doi:10.1016/S0196-8858(82)80005-5.
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.
Advances in Applied Mathematics, 6 , 4–22. URL: http://dx.doi.org/10.1016/
0196-8858(85)90002-8.
Lai, T. L., Robbins, H., & Wei, C. Z. (1979). Strong consistency of least squares
estimates in multiple regression. Journal of Multivariate Analysis, 9 , 343–361.
doi:10.1016/0047-259X(79)90093-9.
Lai, T. L., & Wei, C. Z. (1982). Least squares estimates in stochastic regression
models with applications to identification and control of dynamic systems. The
Annals of Statistics, 10 , 154–166. doi:10.1214/aos/1176345697.
Lattimore, T., & Szepesvári, C. (2017). The end of optimism? an asymptotic

analysis of finite-armed linear bandits. In Proceedings of the 20th International
Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April
2017, Fort Lauderdale, FL, USA (pp. 728–737). URL: http://proceedings.
mlr.press/v54/lattimore17a.html.
Lawrence, S., Giles, C. L., Ah Chung Tsoi, & Back, A. D. (1997). Face recognition: a
convolutional neural-network approach. IEEE Transactions on Neural Networks,
8 , 98–113. doi:10.1109/72.554195.
150
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit ap-
proach to personalized news article recommendation. In Proceedings of the 19th
International Conference on World Wide Web WWW’10 (p. 661–670). Associa-
tion for Computing Machinery. doi:10.1145/1772690.1772758.
Li, L., Lu, Y., & Zhou, D. (2017). Provably optimal algorithms for generalized
linear contextual bandits. In Proceedings of the 34th International Conference
on Machine Learning (pp. 2071–2080). volume 70. URL: http://proceedings.
mlr.press/v70/li17c.html.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning,

planning and teaching. Machine Learning volume, 8 , 293–321. doi:10.1007/
BF00992699.
Lobo, M. S., & Boyd, S. (2003). Pricing and learning with uncertain demand. In
INFORMS Revenue Management Conference.
Mannor, S., Rubinstein, R. Y., & Gat, Y. (2003). The cross entropy method for
fast policy search. In Proceedings of the Twentieth International Conference on
International Conference on Machine Learning ICML’03 (pp. 512–519). AAAI
Press. URL: https://www.aaai.org/Papers/ICML/2003/ICML03-068.pdf.
Margolin, L. (2005). On the convergence of the cross-entropy method. Annals of

Operations Research, 134 , 201–214. doi:10.1007/s10479-005-5731-0.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. (2nd ed.).
London: Chapman and Hall.
McCulloch, S., W., & Pitts, W. (1943). A logical calculus of the ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5 , 115–133. doi:10.1007/
BF02478259.
McKinsey & Company (2003). The power of pricing. https://www.

mckinsey.com/business-functions/marketing-and-sales/our-insights/
the-power-of-pricing.
151
McKinsey & Company (2018). Digital insurance in 2018: Driv-
ing real impact with digital and analytics. https://www.
mckinsey.com/industries/financial-services/our-insights/
digital-insurance-in-2018-driving-real-impact-with-digital-and-analytics.
Homem-de Mello, T. (2007). A study on the cross-entropy method for rare-event

probability estimation. Journal on Computing, 19 , 313–484. doi:10.1287/ijoc.
1060.0176.
Homem-de Mello, T., & Rubinstein, R. Y. (2002). Rare event estimation for static
models via cross-entropy and importance sampling. In Winter Simulation Con-
ference (pp. 1–35). San Diego, CA, USA: IEEE. URL: http://citeseerx.ist.
psu.edu/viewdoc/download?doi=10.1.1.15.6923&rep=rep1&type=pdf.
Miller, B. L. (1968). Finite state continuous time markov decision processes with
an infinite planning horizon. Journal of Mathematical Analysis and Applications,
22 , 552–569. doi:10.1016/0022-247X(68)90194-7.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., &
Hassabis, D. (1995). Human-level control through deep reinforcement learning.
Nature, 518 , 529–533. doi:10.1038/nature14236.
Nasdaq MarketInsite (2019). Digital insurance in 2018: Driving real im-

pact with digital and analytics. https://www.nasdaq.com/articles/
technology-adoption-in-the-insurance-industry-part-1-dawn-of-the-revolution-2019-06-27
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal

of the Royal Statistical Society. Series A (General), 135 , 370–384. doi:10.2307/
2344614.
Ohlsson, E., & Johansson, B. (2010). Non-Life Insurance Pricing with Generalized
Linear Models. Berlin, Heidelberg: Springer.
152
Phillips, R. L. (2005). Pricing and revenue optimization. Stanford, Calif.: Stanford
Business Books.
Pike-Burke, C., Agrawal, S., Szepesvári, C., & Grunewalder, S. (2018). Bandits with
delayed, aggregated anonymous feedback. In Proceedings of Machine Learning
(ICML) (pp. 4105–4113). volume 80. URL: http://proceedings.mlr.press/
v80/pike-burke18a/pike-burke18a.pdf.
Qiang, S., & Bayati, M. (2016). Dynamic Pricing with Demand Covariates. Working
paper Stanford University Graduate School of Business Stanford, CA. URL:
http://web.stanford.edu/~bayati/papers/dpdc.pdf.
Qiu, Q., & Pedram, M. (1999). Dynamic power management based on continuous-
time markov decision processes. In Proceedings 1999 Design Automation Confer-
ence (Cat. No. 99CH36361) (pp. 555–561). doi:10.1109/DAC.1999.781377.
Raju, C., Narahari, Y., & Ravikumar, K. (2006). Learning dynamic prices in elec-
tronic retail markets with customer segmentation. Annals of Operations Research,
143 , 59–75. doi:10.1007/s10479-006-7372-3.
Ramsay, C. M. (2003). A solution to the ruin problem for pareto distributions. In-
surance Mathematics and Economics, 33 , 109–116. doi:10.1016/S0167-6687(03)
00147-1.
Rana, R., & Oliveira, S. F. (2014). Real-time dynamic pricing in a non-stationary

environment using model-free reinforcement learning. Omega, 47 , 116–126.
doi:10.1016/j.omega.2013.10.004.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine

learning. Adaptive computation and machine learning. Cambridge, Mass: MIT
Press.
Rosenstein, M. T., & Barto, A. G. (2001). Robot weightlifting by direct pol-

icy search. In Proceedings of the Seventeenth International Joint Conference on
Artificial Intelligence (pp. 839––846). URL: http://citeseerx.ist.psu.edu/
viewdoc/download?doi=10.1.1.10.5890&rep=rep1&type=pdf.
153
Rothstein, M. (1968). Hotel overbooking as a markovian sequential decision process.
Journal of Mathematical Analysis and Applications, 5 , 389–404. doi:10.1111/j.
1540-5915.1974.tb00624.x.
Rubinstein, R. Y. (1997). Optimization of computer simulation models with rare

events. European Journal of Operational Research, 99 , 89–112. doi:10.1016/
S0377-2217(96)00385-2.
Rubinstein, R. Y. (1999). Rare-event simulation via cross-entropy and importance

sampling. Second Workshop on Rare Event Simulation, 99 , 1–17. doi:10.1016/
S0377-2217(96)00385-2.
Rubinstein, R. Y., & Kroese, D., P. (2004). The Cross-Entropy Method . Information
Science and Statistics. New York, NY: Springer-Verlag New York. doi:10.1007/
978-1-4757-4321-0.
Rudin, W. (1976). Principles of Mathematical Analysis. International series in pure

and applied mathematics. McGraw-Hill. URL: https://books.google.co.uk/
books?id=kwqzPAAACAAJ.
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by

back-propagating errors. Nature, 323 , 533–536. doi:10.1038/323533a0.
Rummery, G. A., & Niranjan, M. (1994). On-Line Q-Learning Using Connec-

tionist Systems. Technical Report TR 166 Cambridge University Engineering
Department Cambridge, England. URL: http://mi.eng.cam.ac.uk/reports/
svr-ftp/auto-pdf/rummery_tr166.pdf.
Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized ban-

dits. Mathematics of Operations Research, 35 , 395–411. doi:10.1287/moor.1100.
0446.
Schmidli, H. (2008). Stochastic Control in Insurance. Probability and Its Applica-

tions, 1431-7028. London: Springer London.
van Seijen, H., van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical
and empirical analysis of expected sarsa. In 2009 IEEE Symposium on Adaptive
154
Dynamic Programming and Reinforcement Learning (pp. 177–184). doi:10.1109/
ADPRL.2009.4927542.
Silver, D., Huang, A., Maddison, C., & et al. (2016). Mastering the game of go
with deep neural networks and tree search. Nature, 529 , 484–489. doi:10.1038/
nature16961.
Singh, S., Jaakkola, T., Littman, L. M., & Szepesvári, C. (2000). Convergence results
for single-step on-policy reinforcement-learning algorithms. Machine learning, 38 ,
287–308. doi:10.1023/A:1007678930559.
Singh, S., Litman, D., Kearns, M., & Walker, M. (2002). Optimizing dialogue
management with reinforcement learning: Experiments with the njfun system.
Journal of Artificial Intelligence Research, 16 , 105–133. doi:10.1613/jair.859.
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2012). Information-theoretic
regret bounds for gaussian process optimization in the bandit setting. IEEE Trans-
actions on Information Theory, 58 , 389–434. doi:10.1109/TIT.2011.2182033.
Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples

using sparse coarse coding. In Advances in Neural Information Processing Systems
8 (pp. 1038–1044). MIT Press. URL: http://incompleteideas.net/papers/
sutton-96.pdf.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning : an introduction.

Adaptive computation and machine learning (2nd ed.). MA, USA: MIT Press
Cambridge. URL: http://incompleteideas.net/book/RLbook2018.pdf.
Sutton, R. S., McAllester, D., Singh, S., & Yishay, M. (2000). Pol-
icy gradient methods for reinforcement learning with function approxima-
tion. In Advances in Neural In- formation Processing Systems (NIPS) (pp.
1057–1063). URL: https://homes.cs.washington.edu/~todorov/courses/
amath579/reading/PolicyGradient.pdf.
Talluri, K., & van Ryzin, G. (2005). The theory and practice of revenue management.
Boston, MA: Springer. doi:10.1007/b139000.
155
The Wall Street Journal (2015). Now prices can change
from minute to minute. https://www.wsj.com/articles/
now-prices-can-change-from-minute-to-minute-1450057990.
Vershynin, R. (2012). How close is the sample covariance matrix to the actual
covariance matrix? Journal of Theoretical Probability, 25 , 655–686.
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Ap-

plications in Data Science. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press. URL: https://www.math.uci.edu/
~rvershyn/papers/HDP-book/HDP-book.pdf.
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8 , 279–292.

doi:10.1007/BF00992698.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis

King’s College Cambridge, UK. URL: http://www.cs.rhul.ac.uk/~chrisw/
new_thesis.pdf.
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models,

and the gauss-newton method. Biometrika, 61 , 439–447. doi:10.2307/2334725.
Widrow, B., & Hoff, M., E. (1960). Adaptive switching circuits. 1960 IRE WESCON
Convention Record , (pp. 96–104). doi:10.5555/65669.104390.
Williams, D. (1991). Probability with Martingales. Cambridge mathematical text-

books. Cambridge University Press. doi:10.1017/CBO9780511813658.
Woodroofe, M. (1979). A one-armed bandit problem with a concomitant vari-

able. Journal of the American Statistical Association, 74 , 799–806. doi:10.1080/
01621459.1979.10481033.
Wüthrich, M. V. (2017). Non-life insurance: Mathematics & statistics. URL: https:

//ssrn.com/abstract=2319328.
Wüthrich, M. V., & Buser, C. (2017). Data analytics for non-life insurance pricing.
Swiss Finance Institute Research Paper No. 16-68 , . URL: https://ssrn.com/
abstract=2870308.
156
Zhang, S. G., & Liao, Y. (1999). On some problems of weak consistency of quasi-
maximum likelihood estimates in generalized linear models. Science in China
Series A: Mathematics, 51 , 1287–1296. doi:10.1007/s11425-007-0172-7.
157
ProQuest Number: 28526381
INFORMATION TO ALL USERS

The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.
Distributed by ProQuest LLC ( 2021 ).

Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17,

United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization

of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

Dynamic Pricing With Applicati

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic Pricing With Applicati

Uploaded by

Copyright:

Available Formats

DYNAMIC PRICING WITH

A thesis submitted to the University of Manchester

3 Adaptive pricing in non-life insurance 38

5 Reinforcement learning in insurance 118

2.1 Commonly used link functions . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Price dispersion and convergence of parameter estimates obtained for

4.1 Convergence of parameter estimates and regret with linear demand

4.2 Convergence of parameter estimates and regret with logistic regres-

5.1 The agent-environment interaction in an MDP. . . . . . . . . . . . . 122

a.s. Almost surely

GLM Generalized linear model

i.i.d. Independent and identically distributed

MDP Markov decision process

MLE Maximum likelihood estimator

MQLE Maximum quasi-likelihood estimator

UCB Upper confidence bound

w.r.t. With respect to

β, β Single parameter, vector of parameters

γT Maximum information gain after T rounds

κ, K Covariance function (or kernel), covariance matrix

λmax , λmin Maximum, minimum eigenvalues

Bν Modified Bessel function

Z+ Set of positive integers

N Set of natural numbers

R Set of real numbers

R+ Set of positive real numbers

A(s) Set of all actions available in state s

C Set of all contexts

N Normal (or Gaussian) distribution

Rg(T ) Cumulative regret over time horizon T

S Set of all all possible states

Var Variance function

C(p) Claim function of price p

D(p) Demand function of price p

I Shannon mutual information

p, p? , P, P Price, optimal price, price matrix, set of prices

SD Sum of all delays

Chapter 2. This chapter provides the necessary background. We give a short

Chapter 5. In this chapter, we analyze the insurance pricing problem in the

2.1 Notation and terminology

1. kxk ≥ 0, with kxk = 0 if and only if x = 0.

3. kx + yk ≤ kxk + kyk (the triangle inequality).

Examples of vector norms of a vector, x ∈ Rd , include the 2-norm or Euclidean

the 1-norm, defined as

and the supremum norm or infinity norm, defined as

kxk∞ = max |xi | .

A matrix norm is a function k · k : Rm×n → R, which, for all A, B ∈ Rm×n and

Examples of operator norms are

where S n−1 is the unit sphere in Rn .

1. hx, xi ≥ 0 with hx, xi = 0 if and only if x = 0.

2. hx + y, zi = hx, zi + hy, zi.

3. hαx, yi = αhx, yi.

4. hx, yi = hy, xi.

For all x, y ∈ V , by the Cauchy-Schwarz inequality we have

|hx, yi| ≤ kxk · kyk .

Matrices and eigenvalues. A matrix A ∈ Rn×n is symmetric if A> = A. The

then λ is an eigenvalue of A and x is the corresponding eigenvector.

λmax (A) = kAk , λmin (A) = min w> Aw ,

for a1 , a2 ∈ R+ and matrices A1 , A2  0. Further, for a block diagonal matrix,

Here we define a ∧ b = min{a, b} and a ∨ b = max{a, b} for a, b ∈ R.

where B is a p × p matrix and C is a q × q matrix, with n = p + q (so C is a p × q

2.2 Generalized linear models

µi = E[yi ] = b0 (θi ) , Var(yi ) = a(φ)b00 (θi ) , (2.2.2)

for a1 , a2 ∈ R+ and matrices A1 , A2 0. Further, for a block diagonal matrix,

bn := (Xn> Xn )−1 Xn> yn , β := (Xn> Xn )−1 Xn> (yn − n ) ,

C4. supi∈N E[|i |γ ] < ∞ a.s. for some γ > 2 ,

= λmin (n)−1 > > −1 >

Qn = > > −1 >