Professional Documents
Culture Documents
APPLICATION TO INSURANCE
2020
Yuqing Zhang
Department of Mathematics
Contents
Abstract 8
Declaration 9
Copyright Statement 10
Acknowledgements 11
Nomenclature 12
1 Introduction 15
2 Background material 19
2.1 Notation and terminology . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Example: GLMs in non-life insurance . . . . . . . . . . . . . . 29
2.3 Strong consistency of estimates . . . . . . . . . . . . . . . . . . . . . 31
2
3.4.3 Adaptive GLM pricing policy . . . . . . . . . . . . . . . . . . 49
3.4.4 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Gaussian process pricing model . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Model and assumptions . . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Adaptive GP pricing policy . . . . . . . . . . . . . . . . . . . 56
3.5.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 GLM and GP algorithms with unknown delays . . . . . . . . . . . . . 61
3.6.1 Model and assumptions . . . . . . . . . . . . . . . . . . . . . 62
3.6.2 Adaptive GLM pricing with unknown delays . . . . . . . . . . 63
3.6.3 Adaptive GP pricing with unknown delays . . . . . . . . . . . 66
3.7 Numerical examples in insurance . . . . . . . . . . . . . . . . . . . . 66
3.7.1 Adaptive GLM pricing without delays . . . . . . . . . . . . . 66
3.7.2 Adaptive GP pricing without delays . . . . . . . . . . . . . . . 67
3.7.3 Adaptive GLM and GP algorithms with delays . . . . . . . . 68
3.7.4 Comparison of GLM and GP algorithms with and without
delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . 69
3.9 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Perturbed pricing 80
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Model and assumptions . . . . . . . . . . . . . . . . . . . . . 87
4.2.2 Perturbed certainty equivalent pricing . . . . . . . . . . . . . 91
4.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.1 Additional notations . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.2 Key additional results . . . . . . . . . . . . . . . . . . . . . . 94
4.3.3 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3
4.5 Discussion, conclusions and future directions . . . . . . . . . . . . . . 100
4.6 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4
List of Tables
5
List of Figures
3.2 Cumulative regret and convergence rate for the GLM pricing algo-
rithm. GLM denotes the non-delayed case and D-GLM denotes the
delayed case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Cumulative regret and convergence rate for the GP pricing algorithm.
GP denotes the non-delayed case and D-GP denotes the delayed case. 72
5.2 A feedforward NN with n input units, m output units, and two hidden
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6
5.3 Performance of Q-learning, Sarsa and Expected Sarsa. The curves
display the T -period regret for different step-sizes α = 0.01, 0.03, 0.1, 0.3, 1.0,
where periods T are 106 , 106 and 5 × 105 respectively. In all cases,
the problem parameters used are γ = 0.9 and = 0.1. . . . . . . . . . 136
5.4 A comparison of Q-learning, Sarsa and Expected Sarsa as a function
of α. The three curves display the T -period regret in period 105 ,
respectively. In all cases, the problem parameters used are γ = 0.9
and = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.5 Performance of Q-learning, Sarsa and Expected Sarsa for different
exploration rates = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. The curves display the
T -period regret in period 105 , respectively. In all cases, the problem
parameters used are γ = 0.9 and α = 0.5. . . . . . . . . . . . . . . . . 137
5.6 A comparison of T -period regret of Q-learning, Sarsa and Expected
Sarsa. The three curves display the T -period regret in period 105 , re-
spectively. In all cases, the problem parameters used are γ = 0.9, =
0.1 and α = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.7 Pricing Policy. The curve is the average reward for varying prices, and
the green area is the CE convergence zone. The problem parameters
are N = 100, T = 100 and ρ = 10%. . . . . . . . . . . . . . . . . . . . 139
5.8 A comparison of average reward for CE policies with 90%, 95% and
99% percentiles. In all cases, the problem parameters are N = 100
and T = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.9 Performance of DQN using Keras and Q-learning. The curves display
the T -period regret in period 20000, respectively. In both cases, the
problem parameters used are constant γ = 0.9, α = 0.5 and decaying .140
7
The University of Manchester
Yuqing Zhang
Doctor of Philosophy
Dynamic Pricing with Application to Insurance
December 15, 2020
E-commerce has grown explosively in the last decade and dynamic pricing is one
of the main drivers of this growth. Due to digitization and technology advances,
companies are able to gather information about a product’s features, particularly
in relation to pricing, and then dynamically improve their pricing decisions in or-
der to maximize revenue over time. However, when a company sells a new product
online, there is often little information about the demand. This thesis aims to inves-
tigate dynamic pricing with unknown demand, i.e., how can a company dynamically
learn the impact of prices or other context on product demand, and simultaneously
maximize long-run revenue?
We first focus on the non-life insurance industry. Compared with other finan-
cial businesses, the insurance industry is relatively slow to adapt new technologies.
Dynamic pricing for the insurance pricing problems has only very rarely been consid-
ered before. We consider two adaptive models for demand and claims—a generalized
linear pricing model and a Gaussian process pricing model, based on the work of
den Boer & Zwart (2014b), Srinivas et al. (2012) and Joulani et al. (2016). Here,
neither demand or claims is known to the company. In the real world, claims are
often delayed because they are only triggered when the insured events happen so are
not paid out immediately when an insurance product is purchased. We first show
how these methods can be applied in a simple insurance setting without any delays,
and then we extend them to the setting with delayed claims. Our study shows that
dynamic pricing is potentially applicable to the non-life insurance pricing problem.
We then propose a simple randomized rule for the optimization of prices in
revenue management with contextual information. A popular method—certainty
equivalent pricing—treats parameter estimates as certain and then separately opti-
mizes prices, but this is well-known to be sub-optimal. To overcome this problem,
we advocate a different approach: pricing according to a certainty equivalent rule
with a small random perturbation, and call this perturbed certainty equivalent pric-
ing or perturbed pricing. We show that if the magnitude of the perturbation is
chosen well, our new perturbed pricing performs comparably with the best pricing
strategies. Furthermore, we achieve a new eigenvalue lower bound on design matrix.
Finally, we study the application of reinforcement learning to the insurance pric-
ing problem. Reinforcement learning focuses on learning how to make sequential
decisions in environments with unknown dynamics and it has been successfully ap-
plied to a wide range of problems in many areas. We extend the insurance model
from before to the case where the company pays dividends to shareholders and ruin
probability is involved. Starting with reviewing the basics of reinforcement learn-
ing, we model the insurance pricing as a Markov decision problem and then apply
reinforcement-learning based techniques to solve the pricing problem. The numeri-
cal simulation shows that reinforcement learning could be a useful tool for solving
pricing problems in the insurance context, where available information is limited.
8
Declaration
No portion of the work referred to in the thesis has been
submitted in support of an application for another degree
or qualification of this or any other university or other
institute of learning.
9
Copyright Statement
i. The author of this thesis (including any appendices and/or schedules to this thesis)
owns certain copyright or related rights in it (the “Copyright”) and s/he has given
The University of Manchester certain rights to use such Copyright, including for
administrative purposes.
ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents
Act 1988 (as amended) and regulations issued under it or, where appropriate, in
accordance with licensing agreements which the University has from time to time.
This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other intel-
lectual property (the “Intellectual Property”) and any reproductions of copyright
works in the thesis, for example graphs and tables (“Reproductions”), which may
be described in this thesis, may not be owned by the author and may be owned
by third parties. Such Intellectual Property and Reproductions cannot and must
not be made available for use without the prior written permission of the owner(s)
of the relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this thesis, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University
IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487),
in any relevant Thesis restriction declarations deposited in the University Library,
The University Library’s regulations (see http://www.manchester.ac.uk/library/ab-
outus/regulations) and in The University’s Policy on Presentation of Theses.
10
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my supervisor Neil
Walton, for his continued patience, guidance, advice, and support at every stage of
my doctoral studies. I have greatly benefited from our meetings and conversations,
and the results in this work would never have been obtained without him. I am
thankful to my financial supporters: to China Scholarship Council for my living
stipend and to the University of Manchester for covering my tuition fees.
I am deeply grateful for the support, care and encouragement of my friends in
the Alan Turing building, especially Bindu, Chen, and Davide, who were always
available on my tough days. Next, I wish to thank all those who I had the pleasure
to share offices with. Thank you Clement, Dan, Dalal, Helena, Gian Maria, Michael,
Monica for amazing discussions (on mathematics and everything else), and especially
Tom for reading my thesis draft and giving me helpful advice.
Finally, I would like to thank my family—mum, dad and my little brother—for
standing by me and for being always supportive throughout my life, even though
they don’t really know what I do.
11
Nomenclature
Acronyms
CE Cross-entropy
GP Gaussian process
RL Reinforcement learning
TD Temporal-difference
Greek Symbols
α Step-size parameter
Γ Gamma function
12
γ Discount factor
µ Mean function
Ω Lower bound
π Policy
σ Variance
Other Symbols
E Expectation
1 Indicator function
P Probability measures
Roman Symbols
F, H Filtration
13
R Set of all possible rewards
O Upper bound
w Vector of weights
a An action
r Revenue
s, s0 States
14
Chapter 1
Introduction
Digital technologies are fundamentally changing how businesses operate across all
industries. The rapid growth of information technology and the internet allows
companies to gather information about a product’s features, particularly in relation
to pricing, and respond to the information very quickly and effectively. Quickly
adapting to real-time data from e.g., the Internet of Things (IoT) and social media,
combined with valuable historic data, enables companies to not only improve their
core operations but to launch entirely new products.
The objective of the company is to maximize long-term revenue over time. Pric-
ing correctly is the fastest and most effective way to achieve this objective (McKinsey
& Company, 2003). In a simple world with perfect knowledge, finding the optimal
price, i.e., the price that maximizes revenue, is straightforward. This can be achieved
by algebraic calculation. However, in the real world, companies do not always have
enough information about the underlying demand to make pricing decisions.
This thesis aims to solve the problem that is how to optimally set prices that both
maximize long-run revenue and efficiently estimate the distribution of demand. To
address this question, we investigate dynamic pricing; i.e., the study of how demand
responds to prices in a changing environment. Dynamic pricing has been successfully
applied in many industries such as airline ticketing, hotel bookings, car rentals,
and fashion, for more details we refer to the textbooks Phillips (2005) and Talluri
& van Ryzin (2005). Nowadays many other industries have realized the benefits
of dynamic pricing including taxi services, sports complexes and even zoos (The
15
Wall Street Journal, 2015). Most notably, Amazon changes prices on its products
about every 10 minutes based on customers’ shopping patterns, competitors’ prices,
profit margins, inventory, and other factors (Business Insider, 2018). The success of
Amazon and other online sellers shows the advantages of using dynamic pricing in
e-commerce and even brick-and-mortar retail to optimizing pricing and increasing
revenues.
Specifically, we consider a monopolist who sells a single new product over a finite
time horizon. The company makes its pricing decisions based on the history of past
prices and demands. We also consider the context, in which the items are sold,
such as who is viewing and what search criteria they used. When the company
launches a new product online, there is often little information available about the
distribution of demand, or the relationship between demand and price, but this
can be learned through observations over time. A trade-off between exploration
(learning) and exploitation (earning) is a natural in dynamic pricing decisions. The
company needs to balance exploitation—choosing prices that gave the best reward in
the past—and exploration—choosing prices that potentially yield higher reward in
the future. Online selling enables companies to observe demand for their products in
real time, and dynamically adjust the price of their products. This can help improve
pricing accuracy and efficiency, and make pricing more convenient and transparent.
This thesis investigates how a company in e-commerce can maximize revenue
by applying dynamic pricing. All chapters are self-contained and have detailed
introductions. The literature review in each chapter will provide a detailed survey
of work directly related to this thesis. We now describe the structure and results of
this thesis as follows.
16
work which established the conditions under which strong consistency is assured in
generalized linear models.
Chapter 3. In this chapter, we consider the dynamic pricing problem in the non-
life insurance setting, based on the work of den Boer & Zwart (2014b), Srinivas
et al. (2012) and Joulani et al. (2016). The non-life insurance product consists of
demand and heavy-tail distributed claims, the functions of which are not known
to the company. We focus on two adaptive approaches—generalized linear models
and Gaussian process regression models. Parameter estimation is conducted by
maximum quasi-likelihood estimation, which is an extension of maximum-likelihood
estimation. In the real world, claims are only triggered when the insured events
happen so are not paid out immediately when an insurance product is purchased.
Therefore, we investigate pricing algorithms both with and without delayed claims.
Our objective is to choose prices that maximize the revenue. The main challenge
here is that the revenue is unknown. Thus we use regret to measure the performance
of policies, where regret is the expected revenue loss relative to the optimal policy.
The objective now is to minimize the regret. We derive asymptotic upper bounds
on the regret that hold for adaptive pricing policies. A simple example shows that
both the GLM and GP pricing polices perform well.
Chapter 4. In this chapter, we consider a company that sells new products on-
line. When a product is sold online, the demand, and thus prices, depend on the
context in which the items are sold, such as who is viewing and what search crite-
ria they used. The objective of the company retains, that is, to choose prices that
maximize long-run revenue and efficiently estimate the distribution of demand. We
advocate a different approach: pricing according to a certainty equivalent rule with
a small random perturbation. We call this perturbed certainty equivalent pricing, or
perturbed pricing, for short. Estimation is then conducted according to a standard
maximum likelihood objective.
We show that the convergence of the perturbed certainty equivalent pricing is
optimal up to logarithmic factors. The key advantage of the perturbed certainty
equivalent pricing is its simplicity—we perturb the data not the optimization. If the
17
magnitude of the perturbation is chosen well, then our results suggest that perturbed
pricing performs comparably with the best pricing strategies. This policy is also
flexible in leveraging contextual information, which is an important best-practice in
many online market places and recommendation systems.
18
Chapter 2
Background material
This chapter serves three main purposes. First, we establish the notation and ter-
minology that will be used in the following chapters. Second, we briefly review the
main concepts that are needed for the work in Chapters 3 and 4. In the last section,
we introduce theoretical results that will be necessary later on and provide relevant
literature for the interested readers.
Sets and orders. The set of natural numbers is denoted by N, the set of positive
integers by Z+ , the set of real numbers by R, and the set of positive real numbers
by R+ . The set of real n-dimensional vectors is denoted by Rn , and the set of real
matrices with m rows and n columns is denoted by Rm×n .
It is often useful to talk about the rate at which some function changes as
its argument grows (or shrinks), without worrying too much about the detailed
form. The notation O(·) denotes “of the same order”, which implies that functions
g(n) and f (n) “grow (or shrink) at the same rate”, and o(·) denotes “ultimately
smaller than (or negligible compared to)”. For example, g(n) = O(f (n)) means
that |g(n)| ≤ c|f (n)| for some constant c for large n, while g(n) = o(f (n)) means
that |g(n)|/|f (n)| → 0 for large n.
19
Vectors and matrices. A matrix A ∈ Rm×n can be written as
a11 a12 . . . a1n
a
21 a22 . . . a2n
A = (aij ) = ,
.. .. ..
..
. . . .
am1 am2 . . . amn
where aij (= (A)ij ) is the entry of A in the ith row and jth column. We denote by
A> ∈ Rn×m the transpose of A, such that (A)>
ij = (A)ji . There are several special
matrices. For example, a square matrix is called a diagonal matrix if aij = 0 when
i 6= j; a square matrix is called the identity matrix if aij = 0 when i 6= j and
aij = 1 when i = j; a matrix is called a block matrix if its elements are partitioned
according to a block pattern and the blocks along the diagonal are square.
A vector norm is a function k · k : Rn → R, which, for all vectors x, y ∈ Rn and
α ∈ R, satisfies the following conditions:
2. kαxk = |α|kxk.
20
1. kAk ≥ 0, with kAk = 0 if and only if A = 0.
2. kαAk = |α|kAk.
3. kA + Bk ≤ kAk + kBk.
Given a vector norm k · k, the corresponding operator norm (often called the subor-
dinate or induced matrix norm) on Rm×n is defined by
kAxk
kAk = max .
n
x∈R \{0} kxk
where kAk2 is called the spectral norm, which is the most relevant to this thesis.
√
For a positive definite matrix A ∈ Rn×n , we have kxkA = x> Ax.
If A ∈ Rn×n , the operator norm becomes
kAxk
kAk = max = max kAxk = max x> Ay ,
n x∈R \{0} kxk x∈S n−1 x,y∈S n−1
21
For example, the Euclidean inner product over V = Rn , which is a function h·, ·i :
Rn × Rn → R, is defined by
n
X
hx, yi = xi y i = x> y = y > x .
i=1
Moreover, equality holds if and only if x and y are linearly dependent. (This latter
statement is sometimes called the “converse of Cauchy-Schwarz.”)
Ax = λx ,
and
λmin (a1 A1 + a2 A2 ) ≥ a1 λmin (A1 ) + a2 λmin (A2 ) ,
22
Schur complements. Let A ∈ Rn×n , then A can be written as a 2 × 2 block
matrix
B C
A= ,
D E
Trace and determinant. The trace of A ∈ Rn×n , denoted by tr(A), is the sum
of its diagonal elements, i.e.,
n
X
tr(A) = aii .
i=1
Two key properties are that tr(A) is the sum of the eigenvalues of A and that
tr(AB) = tr(BA) for all A ∈ Rm×n and B ∈ Rn×m .
The determinant of a matrix A ∈ Rn×n , denoted by det(A) or |A|, is the product
of the eigenvalues of A. Moreover, it can be shown that |αA| = αn |A| for all α ∈ R,
and that the determinant is multiplicative, i.e., for any A, B ∈ Rn×n , one has that
|AB| = |A||B|
23
as the Poisson, Normal and Binomial distributions. Thus the distribution of the
response needs not be Normal. The second is that GLMs use a function, called
the link function, to connect the linear predictor with the mean of the dependent
variables. The link function can be an identity, a log or a power function. When it
is an identity function, GLMs become classic linear models.
In GLMs, a commonly used technique to find the parameters of the model is
to maximize likelihood estimation. This method is simple, consistent and normally
work well enough. However, maximum likelihood estimation requires complete infor-
mation, which is not always possible. To address this problem, Wedderburn (1974)
proposed quasi-likelihood estimation, an extension of likelihood estimation for which
only the first two moments of the observations are needed.
2.2.1 Overview
GLMs are popular statistical models that are often used in the framework of an
exponential family. There are three elements in GLMs. Assume y1 , . . . , yn are
realizations of the random variables Y1 , . . . , Yn . The first element is that the response
yi belongs the exponential family of distribution if its density function can be written
as
yi θi − b(θi )
f (yi ; θi , φ) = exp + c(yi , φ) , (2.2.1)
a(φ)
where θi is the canonical (or scale) parameter which changes with i, and φ > 0
is the dispersion parameter. Functions a(·), b(·), c(·) are fixed and known for all
i = 1, . . . , n, and b(·) is assumed twice continuously differentiable with invertible
first derivative. Usually, we have that a(φ) = φ or a(φ) = φ/wi for a known weight
wi . The mean and variance of yi are
where b0 (θi ), b00 (θi ) are the first and second derivatives of b(θi ) w.r.t. θi . Since µi
depends on θi , we may write the variance as
where V (·) is called the variance function. This function captures the relationship
between the mean and variance of yi .
24
Table 2.1: Commonly used link functions
Normal identity µi ηi
Poisson log log(µi ) exp(ηi )
µi exp(ηi )
Binomial logit log 1−µ i (1+exp(ηi ))
g(µi ) = ηi ,
mined then so is µi and also θi . Note that the classic linear model is a special case
of GLMs where the response variable is typically assumed to be normal distributed
with an identity link function, i.e., g (µi ) = µi .
Some commonly used distributions and link functions are summarized in Table
2.1. Among those, the log- and logit- link functions are normally used for insurance,
because they have multiplicative model structure and make sure the predictions
are always positive. If g(µi ) = θi when ηi = θi , then g is called the canonical
link function. The canonical link function is often used because it simplifies the
mathematical analysis (e.g. see Wedderburn (1974)).
25
2.2.2 Parameter estimation
Likelihood estimation
Since the dispersion parameter φ does not affect the maximization of `, we can write
n
X yi θi − b(θi )
`(β ; y) = + c(yi , φ) .
i=1
a(φ)
The following are examples of the log-likelihood functions for some common dis-
tributions. Under the normal model, yi ∼ N (µi , σ 2 ) , the log-likelihood function
is
n
X yi µi − µ2 /2 i yi2 1 2
`(β ; y) = − − log 2πσ .
i=1
σ2 2σ 2 2
If yi ∼ Poisson(µi ) then
n
X
`(β ; y) = yi log(µi ) − µi − log(yi !) .
i=1
where c is independent of π.
The maximum likelihood estimators (MLEs) of β, denoted by β̂, are derived by
maximizing the log-likelihood function `, defined by
26
To find the maximum, we can differentiate `(β) w.r.t. all βj , j = 1 . . . , p and set all
these partial derivatives equal to zero,
∂`
= 0.
∂βj
we obtain
n n
X y i − µi xij X yi − g −1 (x>
i β) xij
· 0 = >
·
−1 (x β)) g 0 (µ )
= 0,
i=1
a(φ)V (µ i ) g (µ i ) i=1
a(φ)V (g i i
since µi = g −1 (x>
i β). The condition for a maximum is that the second partial
27
Quasi-likelihood estimation
∂ 2 `(β)
∂`(β) 1
E = 0, −E 2
= .
∂µi ∂µi a(φ)V (µi )
∂` y i − µi
= ,
∂µi a(φ)V (µi )
which has several common properties with the log-likelihood derivative. Then we
define the quasi-likelihood for each yi as the integral
µi
yi − ν
Z
q(µi ; yi ) = dν ,
yi a(φ)V (µi )
n
X
q(µ; y) = q(µi ; yi ).
i=1
28
The function q(µ ; y) is called the quasi-likelihood, or more correctly the log quasi-
likelihood. Similarly, maximum quasi-likelihood estimators (MQLEs) of β are de-
noted as β̂, which are derived by maximizing the quasi-likelihood q. This is equiv-
alent to solving
∂q
= 0.
∂βj
Applying the chain rule gives
n n
X y i − µi ∂µi X yi − µi xij
· = · 0 = 0,
i=1
a(φ)V (µi ) ∂βj i=1
a(φ)V (µi ) g (µi )
where j = 1, . . . , p. In the special case where g(µ) is the canonical link function for
a GLM, the above eqaution is simplified to
n
X yi − µi (β)
· xij = 0 .
i=1
a(φ)
A non-life insurance product is a promise to make certain payments (i.e., the in-
surance claims) for unpredictable losses under certain conditions (where by law an
element of chance/uncertainty must play a role) to the client—the policyholder—
during a time period (typically one year), in exchange for a certain fee (i.e., the
premium) from the client at the start of the contract.
In non-life insurance, the total claims can be expressed as the claim frequency
times the severity. The claim frequency is the number of claims during a specific time
period (e.g., one year) and the claim severity is the size of individual claims. We can
model the claim frequency and the claim severity separately. A GLM with a Poisson
29
distribution is usually used to model claim frequency, and a gamma distribution is
usually used to model severity. Maximum likelihood estimation is used to find model
parameters.
The number of claims that occur for insurance policy i during a policy period,
denoted by Ni , is often assumed to follow a Poisson distribution with mean λi (e.g.
see Dionne & Vanasse (1989); de Jong & Heller (2008); Antonio & Valdez (2012);
Cameron & Trivedi (2013)), given by Ni ∼ Poisson(λi ). The claim frequency λi is
“explained” by a set of observable variables/features, xi , via a link function of the
form
g(λi ) = x>
i β.
Using a Poisson distribution implies that the mean is equal to the variance so that
E[Ni ] = Var(Ni ) = λi . A popular choice for g(λi ) is the log-link function, log(λi ) =
x> >
i β, which guarantees λi = exp(xi β) is positive and turns it into an additive form.
The MLE of β is the solution of the partial derivatives of the log-likelihood function
n n
∂` X X
= xij (yi − λi ) = xij (yi − exp(x>
i β)) = 0 ,
∂βi i=1 i=1
for j = i, . . . , p.
In insurance, the size of claims is non-negative and generally has a long tail to
the right. The gamma distribution is often used to model the claim size, because
gamma random variables are continuous, non-negative and skewed to the right, with
the possibility of large values in the upper tail.
We let Ni be the number of claims on policy i, and ci1 , ci2 , . . . , ciNi be the in-
dividual claim size for the Ni -th observed claims on policy i for i = 1, . . . , n. The
30
individual claim size is assumed to be independently gamma distributed. The prob-
ability density function is given by
ν
1 νci vci
f (ci ) = exp − ,
Γ(v) µi µi
with the mean E [ci ] = µi and variance V (ci ) = µ2i /ν. Then the log-likelihood
function is
n Y Ni v
Y 1 vcik vcik 1
`(β) = exp −
i=1 k=1
Γ(v) µ i µi cik
Ni
n X
!
ci −µ−1
X i ln µi
= n(ν ln ν − ln Γ(ν)) + (ν − 1) ln ci + −1
− −1 .
i=1 k=1
ν ν
n
∂`(β) X νci
= −ν − xij = 0 .
∂βj i=1
µi
For more details, we refer readers to Chapter 2 in Ohlsson & Johansson (2010) and
Sections 7.2.1 and 7.3.2 in Wüthrich (2017). The typical data consists of claim
information for a number of policies over a number of years, and a number of ex-
planatory variables such as policyholder specific features (e.g. age, gender, etc,),
insured object specific features (e.g. in the case of a car: type, power, age, etc.), and
insurance contract specific features (e.g. claims experience). A natural question is
how much premium the insurer should require for an insurance product. There are a
number of pricing principles for calculating premiums, see Wüthrich (2017, Chapter
6). To avoid ruin, the insurance company needs to charge an expected premium
that exceeds the expected total claims.
31
asymptotic normality of the MLE of the true parameter under some general condi-
tions. A parameter estimate, β̂, is said to be (weakly) consistent if it converges, in
probability, to the “true” unknown parameter, β0 as sample size n → ∞. We say
β̂, is strongly consistent if, with probability 1 (or almost surely), it converges to β0
as sample size n → ∞.
In the classical linear regression model with i.i.d. errors, a necessary and sufficient
condition for weak (Drygas, 1976) and strong (Lai et al., 1979) consistency of the
least squares estimator is
n
!
X
λmin xi x>
i → ∞.
i=1
We consider the following regression problem in Lai et al. (1979) and Lai & Wei
(1982),
yn = β1 xn1 + · · · + βp xnp + n ,
for n = 1, 2, .... Here n are unobservable random errors, β1 , ..., βp are unknown
parameters, and yn is the observed response corresponding to the design levels
xn1 , . . . , xnp . Write inputs as a vector, xn = (xn1 , · · · , xnp )> . We let Xn = (xij : i =
1, . . . , n, j = 1, . . . , p) be the matrix of inputs and yn = (y1 , . . . , yn )> be the matrix
of outputs. Further we let bn be the least squares estimate of β = (β1 , . . . , βp )>
given Xn and yn , i.e.,
where n = (1 , . . . , n )> . Assume errors i are a martingale difference sequence w.r.t.
the filtration generated by {xj , yj−1 : j ≤ i}. Let Fi be a sequence of increasing
filtration such that yi ∈ Fi and xi ∈ Fi−1 , then we know εi is Fi -measurable and
E[εi | Fi ] = 0 and supi∈N E[ε2i ] < ∞ for all i ≥ 1.
We define a fixed design where the design vectors xi are fixed, non-random
p-dimensional vectors, and an adaptive design where the xi are sequentially deter-
mined random vectors.
Review of related results. Lai et al. (1979) proved that the least squares esti-
mator for linear regression models is strongly consistent under following conditions:
32
−1
C1. Xn> Xn → 0 a.s. as n → ∞.
P∞ P∞ 2
C2. i=1 ci i converges a.s. for any sequence of constants {ci } satisfying i=1 ci <
∞.
Pn
xi x>
C1’. λmin i=1 i → ∞ a.s. as n → ∞.
Anderson & Taylor (1976) showed that when εi are i.i.d. with mean zero and variance
σ 2 > 0, condition C1 implies the strong consistency of bn . Drygas (1976) also proved
that condition C1 is a necessary and sufficient condition for weak consistency in a
classical linear regression model with i.i.d. errors.
However, we now want to consider a stochastic setting where we sequentially
choose inputs xi and then get observations yi . In this case, the condition C1 (or
C1’) is not sufficient for the strong consistency of bn . To solve this problem, Lai &
Pn >
Wei (1982) made the slightly stronger assumption that λmax i=1 xi xi goes to
Pn
infinity faster than log(λmax ( i=1 xi x>
i )). The following result gives a condition on
the eigenvalues of the design matrix Xn> Xn for bn to converge to β and also gives a
rate of convergence.
Theorem 2.3.1. If λmin (n) and λmax (n) are, respectively, the minimum and maxi-
mum eigenvalues of the design matrix Xn> Xn . Under the following conditions:
log(λmax (n))
C3. λmin (n) → ∞ a.s., and λmin (n)
→ 0 a.s.,
Proof. We give a brief proof here, for more details see Lai & Wei (1982). To prove
33
this theorem we will require Lemma 2.3.1, a restatement of Sherman–Morrison for-
mula (stated and proved after the proof). Note that
2
kbn − βk2 = (Xn> Xn )−1 Xn> n
2 2
≤ (Xn> Xn )−1/2 (Xn> Xn )−1/2 Xn> n
−1 Pn −1
We bound Qn by Lemma 2.3.1. Let Vn = Xn> Xn = i=1 xi x>
i and by
Lemma 2.3.1,
Vn−1 xn x>
n Vn−1
Vn = (Xn> Xn )−1 = >
(Xn−1 Xn−1 + xn x>
n)
−1
= Vn−1 − .
1 + x>n Vn−1 xn
Now we obtain the recursive form of Qn . Define N = inf{n : Xn> Xn is nonsigular} <
∞, and for k > N ,
k
! k
!
X X
Qk = x>
i i Vk xi i
i=1 i=1
k−1
! k−1
! k−1
!
X X X
= x>
i i Vk xi i + x> 2 >
k Vk xk k + 2xk Vk xi i k
i=1 i=1 i=1
k−1
! k−1
!
>
X
X V x x
k−1 k k k−1 V
= Qk−1 + x>
i i − >
xi i
i=1
1 + x k Vk−1 x k i=1
k−1
!
>
X
xk Vk−1
+ x> 2
k Vk xk k + 2 xi i k
1 + x> k Vk−1 xk i=1
2
k−1
x>
P !
k Vk−1 i=1 xi i
> k−1
X
x V
k k−1
= Qk−1 − + x> 2
k Vk xk k + 2 xi i k .
1 + x> V
k k−1 kx 1 + x >
k Vk−1 xk i=1
Let
Pk−1 2
x>
k Vk−1 i=1 xi i
γk = ,
1 + x>
k Vk−1 xk
θk = x> 2
k Vk xk k ,
k−1
!
x>
X
k Vk−1
ωk−1 = 2 xi i ,
1 + x> k Vk−1 xk i=1
34
where γk ≥ 0, θk ≥ 0, and ωk−1 are Fk−1 -measurable because εk−1 is a martingale
difference sequence w.r.t. Fk−1 . Summing these, we have for t > N ,
n
X n
X n
X
Qn = QN − γk + θk + ωk−1 k .
k=N +1 k=N +1 k=N +1
n
!
n n
! 21 +α
X X X
2
max Qn , γk = O θk + ωk−1 a.s.
k=N +1 k=N +1 k=N +1
The local martingale convergence theorem and the strong law of large numbers
(Chow, 1965) imply nk=N +1 ωk−1
2
≤ 4 nk=N +1 γk . To bound the term nk=N +1 θk ,
P P P
we require Lemma 2.3.2 (stated and proved after the proof). Then we have that
n
X n
X
θk = x> 2
k Vk xk k
k=N +1 k=N +1
n
!
X
=O x>
k Vk xk
k=N +1
Proof. Recall that the outer-product of two vectors wv > is the matrix (wi vj )n,n
i=1,j=1 .
> −1 wv >
(I + wv ) =I− .
1 + v>w
35
Multiplying (I + wv > ) on both sides, the right hand side becomes I and the left
hand side is
wv > wv > + wv > wv >
>
I− 1 + wv = I + wv > −
I + v>w 1 + v>w
w(1 + v > w)v >
= I + wv > −
1 + v>w
=I.
Let λmax (n) and λmin (n) denote the maximum and minimum eigenvalue of An , re-
spectively. Since An −An−1 = wn wn> is nonnegative definite, λmax (n) ≥ λmax (n−1),
λmin (n) ≥ λmin (n − 1), and An is nonsingular for n ≥ s. Furthermore, |An | is the
product of all eigenvalues and for n ≥ s, |An | ≥ λ(s)d−1 λmax (n). So we see that
n
X
wk> A−1
k wk = O(log |An |) = O(log λmax (n)) .
k=1
36
Under similar conditions to Lai et al. (1979) and Lai & Wei (1982), Chen et al.
(1999) studied the strong consistency results for MQLE in GLMs but they considered
the GLM with a canonical link function µ. Assume that in the fixed design case,
E [yi ] = µ x>
i β ,
Chen et al. (1999) derived the strong consistency results for GLMs, that respectively,
parallel those of Lai et al. (1979) for fixed designs and Lai & Wei (1982) for adaptive
designs. However, the proof contains a mistake, see Zhang & Liao (1999).
Chang (1999) studied the consistency in GLMs under general link functions with
an additional assumption
1
C5. λmin (n) ≥ εnα , for some α > 2
and ε > 0.
37
Chapter 3
3.1 Introduction
38
health insurance (Ohlsson & Johansson, 2010; Wüthrich, 2017). As one of the oldest
financial businesses, insurance has been relatively slow to adopt new technologies
because it is traditionally cautious, heavily regulated, relies on legacy systems, and
has low customer engagement (Institute of International Finance, 2016; McKinsey
& Company, 2018; Nasdaq MarketInsite, 2019). With the rise of digitization and
dynamic pricing, the insurance industry is transforming to a new era, by diversifying
with e-commerce and offering insurance products online.
We address the insurance pricing problem with demand and heavy-tail dis-
tributed claims using an adaptive generalized linear model (GLM) and an adaptive
Gaussian process (GP) regression model. Here demand and claims are both un-
known to the company. A GLM is a parametric model with unknown parameters
and it is widely used in insurance. We model demand and total claims by GLMs
and then adaptively apply maximum likelihood estimation to infer the unknown pa-
rameters. This method is inspired by den Boer & Zwart (2014b), who considered
the expected demand is a generalized linear function of the price for a single prod-
uct. A GP is a Bayesian non-parametric model, which generalizes the Gaussian
probability distribution and models random variables with stochastic processes. We
39
sample demand and total claims from GPs and then choose the optimal price that
gives the highest upper bound on its confidence interval. This is based on Srinivas
et al. (2012), who obtained a finite regret bound for GP method. However, in the
real world, claims are only triggered when the insured events happen so are not paid
out immediately when an insurance product is purchased. We consider the delayed
situation as a delayed feedback problem and apply a method from Joulani et al.
(2016), who considered a full information case. Therefore, we investigate pricing
algorithms both with and without delayed claims and our results show that they
both achieve asymptotic upper bounds on regret.
One of our main contributions compared to previous work is that we study the
dynamic pricing problem in an insurance context, where the company observes not
only the demands but also claims. Second, our pricing problem is an optimization
and bandit learning of an additive function, since the revenue has two components—
premiums and claims. Third, in the delayed case, we adapt the previous result of a
full information setting to a bandit case. We show that GLM and GP mechanisms
are both simple, easy to implement and lead to good performance in the insurance
pricing problem. This suggests that online learning is a promising avenue for future
investigations and applications in insurance.
40
3.2 Related literature
41
Insurance pricing. In insurance, the company makes pricing decisions based on
their own historical data on pricing policies and claims. Here, premium is the ex-
pected income that the insurance company gains and claims are the expected out-
come that the insurance company loses. Thus, the wealth of the insurance company
increases with premiums but decreases whenever claims occur.
A popular approach to insurance pricing is generalized linear models (GLMs).
The use of GLMs in actuarial work was first proposed in the 1980s by British ac-
tuaries including Brockman & Wright (1992) and Haberman & Renshaw (1996).
McCullagh & Nelder (1989) applied GLMs to insurance ratemaking, fitting a GLM
to different types of data, including average claim costs for a motor insurance port-
folio and claims frequency for marine insurance. GLMs have now become a well-
established and standard industry practice across a wide range of insurance pricing,
from health insurance (de Jong & Heller, 2008) to vehicle insurance (Ohlsson & Jo-
hansson, 2010; Wüthrich & Buser, 2017). We refer to Haberman & Renshaw (1996)
for an overview of applications of GLMs in insurance. An introduction to the use of
GLMs for insurance problems specific to insurance data can be found in de Jong &
Heller (2008), and for a tariff analysis and some useful extensions of standard GLM
theory see Ohlsson & Johansson (2010). Stochastic control theory is also applied
in insurance, which deals with a more realistic situation of dynamic strategies, for
more details see Schmidli (2008), Asmussen & Albrecher (2010) and Koller (2012).
Machine learning techniques have become increasingly popular in insurance appli-
cations for enhancing and supplementing GLMs analysis, and GPs are widely used
in machine learning. Wüthrich & Buser (2017) provided an overview and insight
into classical methods including GLMs and machine learning methods in non-life
insurance pricing.
42
all arms in the past, and exploitation—choosing the arm with higher expected re-
ward. The upper confidence bound (UCB) algorithm is commonly used to balance
exploration and exploitation, which estimates the mean reward of each arm and a
corresponding confidence interval, and then selects the arm that achieves a highest
upper confidence bound (Lai & Robbins, 1985; Auer et al., 2002). For an overview on
multi-armed bandit problems see Bubeck & Cesa-Bianchi (2012). In the insurance
pricing problem, a learning algorithm sequentially selects prices based on observed
information, while simultaneously adapting its price-selection strategy to maximize
its revenue.
43
Bayesian optimization and gaussian process. Bayesian optimization is an-
other efficient approach to addressing global optimization problems, especially when
objective functions are unknown or expensive to evaluate. There are two significant
stages in Bayesian optimization. The first stage is to learn the objective function
from available samples, which typically works by assuming the unknown function is
sampled from a GP. The second stage is to determine the next sampling points. The
UCB criterion is a popular approach for this. The Bayesian optimization problem
can be cast as a multi-armed bandit problem with a continuous set of arms. More
recently, the gaussian process-upper confidence bound (GP-UCB) algorithm was
proposed as a Bayesian optimization by Srinivas et al. (2012). This is noteworthy
because it achieved sublinear cumulative regret. For a comprehensive overview of
Bayesian optimization and its applications, we refer readers to Brochu et al. (2010).
44
3.3 Problem formulation
We consider an insurance company that sells a new product over a finite selling
horizon T > 0. At the beginning of each time period t = 1, . . . , T , the selling
price pt is determined. We define the set of acceptable prices by P = [pl , ph ], where
0 < pl < ph are the minimum and maximum selling prices.
We assume that dynamic pricing is only associated with past prices. We denote
demand and total claims functions by D(·) and C(·). Given a determined selling
price pt ∈ P at time t, the insurance company observes random functions D(pt ) and
C(pt ) under the chosen price pt . Here C(pt ) denotes the logarithm of total claims.
We consider that price is a design variable, demand and total claims are random
variables that only depend on the price. Normally, total claims increase with an
increase in prices, so we assume that demand and total claims respond to changes
in prices simultaneously.
In insurance, the premium is the expected income that the company earns, and
claims are the outcome that the company pays out. If the selling price is known,
the expected revenue at time t is
The goal is to find an optimal pricing policy that generates highest revenue over
T periods. We use regret to evaluate company’s pricing policy by comparing its
expected revenue to the best possible expected revenue. Define the optimal price p?
at time t by
p? := arg max r(pt ) . (3.3.2)
pt ∈P
We can write the cumulative regret over time horizon T as
" T #
X
∗
Rg(T ) := E r(p ) − r(pt ) . (3.3.3)
t=1
Notice that the objective of the company now is to minimize the cumulative regret.
45
unknown parameters. These parameters can be inferred. We apply maximum quasi-
likelihood estimation to estimate the unknown parameters in the model.
We introduce the model and assumptions in Section 3.4.1, and explain how to
estimate the unknown parameters in Section 3.4.2. In Section 3.4.3, we discuss our
pricing algorithm. In Section 3.4.4, we give the main result, Theorem 3.4.1, in this
chapter. Proofs of this section are collected in Appendix A.
We do not require that the insurance company has complete knowledge on de-
mand distributions, only the first two moments of demand and the selling price are
required. Commonly used demand models are normally, Poisson, negative binomial,
Bernoulli and logistic distributed demand. Once the demand distribution is chosen,
the variance function is then determined.
Much attention has been paid to the distribution of claims. The most popular
approach is to assume that the distribution is heavy-tailed (Ramsay, 2003; Albrecher
& Kortschak, 2009). This is because most claims are relatively small, but occasion-
ally a large claim occurs that leads to a long right tail. For example, we may assume
that the expected total claims follow a lognormal distribution. The lognormal distri-
bution shows a good fit especially for large claims in insurance due to its skewness.
Notice that large claims can even cause ruin. A comprehensive introduction into
the extremal event can be found in Embrechts et al. (1997).
We assume that the company knows the forms of the first two moments of demand
and total claims. In addition, we also assume that the model for demand distribution
at time t is
E[D(pt )] = h1 (a0 + a1 pt ) ,
Similarly, we assume the expectation and variance of the logarithm of total claims
are
E[C(pt )] = h2 (b0 + b1 pt ) ,
46
Here parameters a0 , a1 and b0 , b1 are all unknown but functions h1 (·), h2 (·), v1 (·), v2 (·) :
R → R, are known. Further, h1 (·), h2 (·) are called link functions. Recall that the
link function is called the canonical link function when ḣ(x) = v(h), otherwise it
is called a general link function. Variances of the randomly distributed demand
and log of total claims are functions of constants σ1 , σ2 > 0 and variance functions
v1 (·), v2 (·) of the expected demand and log of total claims. Both h(·) and v(·) are
continuously differentiable with first derivatives denoted by ḣ(·) and v̇(·).
We denote by a = (a0 , a1 )> and b = (b0 , b1 )> , then the expected revenue in
(3.3.1) can be expressed as a function in terms of p, a, b,
47
Let the filtration (Ft )t∈N be generated by {pi , di , ci : i = 1, . . . , t} for each t. Write
error terms ηi = yi − h(p>
i β̂t ). This forms a martingale difference sequence w.r.t. Ft ,
i.e., ηi is Fi -measurable and E[ηi | Fi−1 ] = 0. We also define a design matrix P (t),
which is the sum of the transpose matrices achieved from price vectors, given by
t
X
P (t) = pi p>
i . (3.4.2)
i=1
β̂t → β0 a.s.
A1. E[ηi | Fi−1 ] = 0 and supi≥1 E[|ηi |γ |Fi−1 ] < ∞ a.s. for some γ > 2 .
A2. limt→∞ λmin (t)/log λmax (t) = ∞ a.s., where λmax (t) and λmin (t) are the maxi-
mum and minimum eigenvalues of ti=1 pi p>
P
i .
In GLMs with the canonical link functions, Chen et al. (1999) proved strong consis-
tency and convergence of MQLEs. Their proof contains a mistake, see Zhang & Liao
(1999). In GLMs under general link functions, Chang (1999) obtained the strong
consistency for the MQLE by using a last-time random variable under an additional
assumption
1
A3. λmin (t) ≥ ctγ a.s. for some c > 0 and γ > 2
independent of t.
However there is also a flaw in this proof, see den Boer & Zwart (2014a). Under
assumptions A1, A2 and A3, den Boer & Zwart (2014a) obtained mean square
convergence rates in case of a canonical link function such that as t → ∞
2 log λmax (t)
E β̂t − β0 =O a.s. (3.4.3)
λmin (t)
p
In our case, we consider λmin (t) ≥ c t log(t) where c > 0, so it is sufficient for us
to use (3.4.3) directly.
48
3.4.3 Adaptive GLM pricing policy
A popular pricing policy is the certainty equivalent pricing, which was first applied
in a simple linear regression model (Anderson & Taylor, 1976) and later developed
in a more general multiple regression model (Anderson & Taylor, 1979). We define
the certainty equivalent price, denoted by pce (β), to be
pt = pce (β̂t ) .
49
P (t) defined in (3.4.2) is
t
X
P (t) = pi p>
i .
i=1
where tr(P (t)) is the trace of the design matrix. Let L be a class of positive dif-
ferentiable monotone increasing functions. Choose a function L(t) ∈ L, such that
L(t) → ∞ and t → 1
L(t)
is convex. We let tr(P (t)−1 )−1 ≥ L(t) for all t ≥ 2, then
A proper choice of function L(t) makes sure sufficient price dispersion, and therefore
guarantees the convergence of parameter estimates to the true parameters, as well
as the asymptotic convergence of our price sequence to the optimal price.
We consider a general case here. Based on the work of den Boer & Zwart (2014b),
we let v1 , . . . , vn+1 be associated unit eigenvectors, which form an orthonormal basis
of Rn+1 , for t > n + 1. The price given uncertain β̂ can be written as a linear
combination of unit eigenvectors such that p(β̂t ) = n+1
P
i=1 αi vi .
50
Algorithm 1: GLM Pricing Algorithm
Initialisation:
Choose L1 ∈ L.
Choose linearly independent initial price vectors p(1), p(2) in P.
For all t ≥ 3:
Estimation:
Calculate β̂t using (3.4.1).
Pricing:
(I) If β̂t does not exist or tr(P (t)−1 )−1 L(t), then set pt+1 = p1 , . . ., pt+j = pj ,
where j is the smallest integer satisfying tr(P (t + j)−1 )−1 ≥ L(t + j).
(II) If β̂t exists and tr(P (t)−1 )−1 ≥ L(t), then set pt+1 = pce , where pce = p(β̂t ),
and consider −1 −1
tr P (t) + pt+1 p> t+1 ≥ L(t + 1) .
where v2,1 is the first component of v2 and there exists a constant k > 0 such that
q
k L̇(t) ≤ 1.
The following proposition guarantees that when prices are chosen the price dis-
persion condition (3.4.5) is satisfied.
Proposition 3.4.1. Assume MQLE β̂t exists and tr(P (t)−1 )−1 ≥ L(t). If we set
the next price to be pt+1 = pce + φt , where φt is defined in (3.4.4), for t ≥ 3 such
that
>
−1 −1
tr P (t) + pt+1 pt+1 ≥ L(t + 1) . (3.4.5)
Our main result in the GLM setting is shown in Theorem 3.4.1, proof of which is
also provided in this section. We state below the result for the canonical case under
consideration.
Theorem 3.4.1. Suppose there exists t0 ∈ N and choose L(t) ∈ L for all t ≥ t0 ,
such that the MQLE β̂t is strongly consistent. If the following conditions are satisfied
51
PT 2
(ii) t=1 pt − p(β̂t−1 ) ≤ k0 L(T ) a.s., for all T ≥ t0 and some k0 > 0,
|r(p, β0 ) − r(p? , β0 )| = O kp − p? k2 .
Proposition 3.4.2 shows that the cumulative regret is the same order of the
expected value of sum of squared price vectors kp − p∗ k2 . This term is equal to
expected value of sum of the squared difference between the estimated and true
2
parameters β̂t − β0 .
We focus on a simple case when link functions are canonical, i.e., ḣ(·) = v(h(·)),
then (3.4.1) becomes
t
X 1
>
`t (β̂t ) = p
2 i
y i − h p i β̂ t = 0.
i=1
σ
52
In our case, we consider the GLM pricing problem under assumptions A1, A2 and
A3, thus we can apply (3.4.3) directly. Recall that
2 log λmax (t)
E β̂t − β0 =O a.s.
λmin (t)
The necessary and sufficient condition for the consistency of β̂t is λmin (t) → ∞ as
t → ∞, since log λmax (t) = O (log(t)).
Proof of Theorem 3.4.1. By Proposition 3.4.2, the regret over selling horizon T can
2
be derived in terms of β̂t−1 − β0 , there exists some k1 , such that
(3.4.6)
for all t ≥ t0 . The first inequality is an immediate result from Proposition 3.4.2. The
second inequality is obtained by the fact that (a + b)2 ≤ 2a2 + 2b2 for all a, b > 0.
Summing over t = t0 , . . . , T , by condition (ii) we have an upper bound
" T #
X 2
2E pt − p(β̂t−1 ) ≤ 2k0 L(T ) . (3.4.7)
t=t0
53
We now demonstrate that our GLM Pricing Algorithm 1 satisfies the conditions
in the Theorem 3.4.1. The condition (i) is necessary because it provides a sufficient
condition for the price dispersion. At each time t, we choose a function L(t) such
that tr(P (t)−1 )−1 ≥ L(t) for all t ≥ 2. Since λmin (t) ≥ tr(P (t)−1 )−1 , then condition
(i) is satisfied. Furthermore, condition (ii) follows from Proposition 3.4.1 and implies
how to define the next price. Recall that if β̂t−1 exists and condition (i) is satisfied,
we set pt+1 = p(β̂t−1 ). Otherwise, we set pt+1 = p(β̂t−1 ) + φt , where φt is defined
in (3.4.4). By choosing kφt k2 = L̇(t) (1 + maxp∈P kpk2 ), where maxp∈P kpk is a
constant, and using Tt=1 L̇(t) = L(T ), then condition (ii) is satisfied.
P
54
process f (x) by
The kernel determines how observations influence the prediction of nearby or sim-
ilar inputs. There are two commonly used kernels: the squared exponential kernel,
denoted by κσ,` , and the Matérn kernel, denoted by κν,` , which have the forms
0 1 0 2
κσ,` (x, x ) = exp − 2 |x − x | ,
2`
√ ν √
2 ν|x − x0 | 2 ν|x − x0 |
0 1
κν,` (x, x ) = ν−1 Bν ,
2 Γ(ν) ` `
where parameters l is the length-scale, ν controls the smoothness of the process, Γ(·)
is the Gamma function and Bν (·) is the modified Bessel function of the second kind
of order ν. Note that when ν = 1/2 the Matérn kernel reduces to the exponential
kernel, κν,` (r) = exp (−r`), and when ν → ∞ it reduces to the squared exponen-
tial kernel. The mean function, in fact, depends on prices chosen in our case, for
instance a linear or polynomial function of prices. Non-constant mean functions
provide the gaussian process with a global structure, but it won’t affect the final
results. Moreover, zero-mean is always possible by subtracting the sample mean.
Without loss of generality, we assume that the mean is a constant (e.g. zero) and
the covariance function is strictly bounded.
55
where kt (p) = [k(p1 , p), . . . , k(pt , p)]> and the covariance matrix Kt is the positive
definite matrix whose entries are Ki,j = k(pi , pj ) for i, j = 1, . . . , t.
We denote the function of expected demand at price pt by fd (pt ), and the function
of expected log of total claims by fc (pt ). Functions fd (·), fc (·) are independently
sampled from GPs with known means µd , µc and kernels κd (p, p0 ), κc (p, p0 ). We can
write fd ∼ GP(µd , κd ) and fc ∼ GP(µc , κc ). The posteriors over fd and fc are GPs
and also follow the GP posterior update in (3.5.1).
The expected revenue function, or revenue function for short, given a determined
price pt at time t, can be written as
where the noise term εr ∼ N (0, σr2 ) is a combination of demand noise and claims
noise. Since the sum of GPs is a GP and the kernel has the form of a direct
sum, the revenue function also follows a GP with an additive kernel. We can write
r ∼ GP(µr , κr ), where µr = p · µd − µc and κr = p2 · κd + κc are known.
The cumulative regret in (3.3.3) over time horizon T can be expressed as
" T #
X
? ? ?
Rg(T ) = E (p · fd (p ) − fc (p )) − (pt · fd (pt ) − fc (pt )) .
t=1
Note that our work is slightly different from Srinivas et al. (2012), who considered
only one function f . In our case, we treat the revenue function r(·) as an additive
function, i.e., r(·) contains two components p · fd (·) and fc (·) and is sampled with
an additive kernel. Furthermore we assume that the samplings of fd (·) and fc (·) are
independent of each other.
In the GP setting, we determine our pricing policy by the UCB rule. We determine
the next price by
√
pt = arg max µrt−1 (p) + r
ϕt σt−1 (p) ,
p∈P
where µrt−1 r
is the additive mean and σt−1 (p) is the additive kernel given by
56
Algorithm 2: GP Pricing Algorithm
Input: GP Prior µ0 , σ0 , k
for t = 1, 2, . . . do
√
Select price: pt ← arg maxp∈P µrt−1 (p) + ϕt σt−1
r
(p) ;
Sample the revenue function: rt ← pt · fd (pt ) − fc (pt ) + εrt ;
Update estimates:
• µdt , µct , σtd , and σtc by performing the GP posterior update (3.5.1);
• Obtain µrt and σtr .
end
Our main result in the GP setting is shown in Theorem 3.5.1. It follows the work
of Srinivas et al. (2012), but we extend their model to an additive form since the
insurance product has demand and claims.
57
With probability greater than 1 − δ, we obtain
p
Rg(T ) = O γT T log T ,
√
Theorem 3.5.1 shows that the regret is in the order of O γT T log T . Here γT
describes the maximum amount of information that the GP pricing algorithm can
learn from the demand and log of total claims, which is the combination of additive
kernels in our case as shown in Lemma 3.5.3. Note that the key to bounding the
regret is to bound the information gain.
We start by defining the information gain. Suppose there is a finite subset
D = {p1 , . . . , pT } ⊂ P. We denote the observations by yD = f (pD ) + εD and the
function values at these points by fD = f (pD ). The Shannon Mutual Information,
denoted by I(·), is defined as
where H(·) is the entropy. After T rounds, the maximum information gain between
yD and fD is
In the multivariate Gaussian case, H(N (µ, Σ)) = 21 log |2πeΣ|, so that I (yD ; fD ) =
1
2
log |I + σ −2 KD |, where KD = [κ(p, p0 )]p,p0 ∈D is the Gram matrix of k evalu-
ated on D. Thus, bounds on γT (κ) depend on the chosen kernels. For example,
γT (κ) = O(d log T ) under a linear regression, γT (κ) = O((log T )d+1 ) under a squared
exponential kernel, and γT (κ) = O T d(d+1)/(2ν+d(d+1)) (log T ) under a Matérn kernel
with ν > 1. We refer readers to Srinivas et al. (2012) for more details.
To bound γT (κd +κc ), we use the following two results from Srinivas et al. (2012).
Lemma 3.5.1 gives the expression of information gain for the points selected in terms
of the predictive variances. Lemma 3.5.2 provides the finite bound on γT (κ) via the
eigendecay of the kernel, which will be useful to bound the regrets. For the proofs
of both lemmas we refer readers to Theorem 8 in Srinivas et al. (2012).
58
Lemma 3.5.1. In a Gaussian process, the information gain on fT = (f (pt )) ∈ R>
the points selected by the algorithm can be written as
T
1X
log 1 + σ −2 σt−1 (pt )2 ,
I (yT ; fT ) =
2 t=1
where σ 2 is the variance of Gaussian noise and σt−1 (pt )2 is the posterior variance
after t − 1 observations.
To derive the regret bound, we also need Assumption 3.5.1. Assume that both
demand and total claims functions fd , fc satisfy the following assumption.
Assumption 3.5.1. Let f be sampled from a GP with kernel k(p, p0 ) and function f
be almost surely continuously differentiable. Assume that partial derivatives ∂f /∂pj
of this sample path for j = 1, . . . , T satisfying the following condition with high
probability. For some constants a, b > 0, we have
∂f 2
P sup > J ≤ ae−(J/b) .
p∈P ∂pj
+ O(T 1−t/d ) .
Lemma 3.5.3. Let κd = κd (p, p0 ) and κc = κc (p, p0 ) be kernel functions for demand
and claims, respectively. Then
This lemma shows that γT (κd + κc ) has an additive form. We can find bounds on
γT (κd ) and γT (κc ) separately, and then sum them to obtain the bound on γT (κd +κc ).
Proposition 3.5.1 shows the bound on the single-period regret at each time t.
The cumulative regret for the additive combination kernel is the sum of single-period
regrets over time horizon T .
59
Proposition 3.5.1. Pick δ ∈ (0, 1) and choose
Proof of Theorem 3.5.1. For any price p ∈ P, we consider functions of demand and
log of total claims, fd , fc : P → R, are sampled from GPs and satisfy Assump-
tion 3.5.1. The cumulative regret is
" T
#
X
Rg(T ) = E r(p? ) − r(pt ) .
t=1
We know that the price set P is bounded, and expected demand and claims functions
fd , fc are independently multivariate Gaussian distributed with bounded variances.
Without loss of generality, we assume that κr ≤ 1. By Proposition 3.5.1 we have
p
? √ r ub log(2a/δ) (3.5.4)
r(p ) − r(pt ) ≤ 2 ϕt σt−1 (pt ) + ,
t2
r d c
where σt−1 (p) = pt−1 σt−1 (p)+σt−1 (p). By the convexity of the logarithm function, we
know that w2 ≤ v 2 log (1 + w2 )/ log (1 + v 2 ) for w ≤ v. Since σt−1
r
(pt )2 ≤ k(pt , pt ) =
1 and σ −2 σt−1
r
(pt )2 ≤ σ −2 , letting w2 = σ −2 σt−1
r
(pt )2 and v 2 = σ −2 gives
1
r
(pt )2 ≤ log 1 + σ −2 σt−1
r
(pt )2 .
σt−1 −2
log (1 + σ )
60
where c1 = 8/ log (1 + σ −2 ) and γT = γT (κd + κc ). The last inequality is obtained
by Lemma 3.5.1 and definition of γT in (3.5.3).
Summing (3.5.4) over t = 1, . . . , T and using (3.5.5), we have with probability
greater than 1 − δ,
T T p !
X X √ ub log(2a/δ)
r(p? ) − r(pt ) ≤ 2 r
ϕt σt−1 (pt ) +
t=1 t=1
t2
T T p !
X √ r
X ub log(2a/δ)
= 2 ϕt σt−1 (pt ) +
t=1 t=1
t2
p
≤ c1 T ϕT γT + c2 ,
P p
where c2 = Tt=1 ub log(2a/δ)/t2 is a constant (since a, b, u, δ are constants) and
P 2
1/t = π 2 /6. Hence, we obtain the stated regret bound.
61
3.6.1 Model and assumptions
time t − 1 for t = 1, . . . , T , where 1{·} denotes the indicator function. Since the
company determines the next selling price by the latest observed claim, we have
pt = p̃N (t)+1 .
Let τ̃s = s − 1 − N (ρ(s)), now s − 1 is the number of claims that can be observed
if there is no delay and N (ρ(s)) is the actual number of claims observed by time
ρ(s) − 1. Therefore, τ̃s is the number of delayed claims that have not been updated
62
between choosing price pρ(s) and observing corresponding revenue rρ(s) . This gives
pρ(s) = p̃s−τ̃s . The regret can be expressed as
T
X X T
X
?
r(p ) − r(pt ) = rρ(s) (p? ) − rρ(s) (pρ(s) )
t=1 t0 ∈DC t s=1
T
X
= r̃s (p? ) − r̃s (p̃s−τ̃s )
s=1
XT T
X
?
= (r̃s (p ) − r̃s (p̃s )) + (r̃s (p̃s ) − r̃s (p̃s−τ̃s )) .
s=1 s=1
We can see that the regret with delayed claims has two components: the non-delayed
regret and an additional regret caused by delays. We are going to bound each of
these two terms separately.
In this section, we show the main result in Theorem 3.6.1. In the GLM setting
with delays, parameters are also unknown. We use maximum quasi-likelihood esti-
mation to infer the unknown parameters, which is similar to the non-delayed case.
Specifically, the MQLEs, denoted by β̂tD , D for delay, are solutions to the equation
t X
X 1
> D
`t (β̂tD ) = pi ci0 +τi0 − h pi β̂t = 0.
i=1 i0 ∈C
σ2
i
Theorem 3.6.1. Suppose there exists t0 ∈ N and choose L(t) ∈ L for all t ≥ t0 ,
such that the MQLE β̂tD is strongly consistent. We denote the sum of all delays by
SD := Tt=1 τt . If the following conditions are satisfied
P
63
Theorem 3.6.1 shows the regret bound for the GLM pricing algorithm with un-
PT log(t)
known delays. We can see that the obtained O (SD + 1) L(T ) + t=1 L(t) suffers
an additional regret with delayed claims, where SD denotes the accumulated de-
p p
lays. Letting L(t) = c t log(t) gives the regret bound O T log(T ) , which is
consistent to prior work.
We now prove the main result of this section, the proof of which is similar to
that of Theorem 3.4.1.
Proof of Theorem 3.6.1. We consider the cumulative regret over selling horizon T
2
D
in terms of β̂t−1 − β0 , there exists some k5 , such that
" T
# " T
#
X X
? 2
E r(p, β0 ) − r(p? , β0 ) ≤ E k5 kpt − p k
t=1 t=1
" T
#
X 2
D
≤ E 2k5 pt − p(β̂t−1 ) (3.6.1)
t=1
" T
#
X 2
D
+ E 2k5 p(β̂t−1 ) − p? ,
t=1
for all t ≥ t0 and t0 is the smallest natural number. The first inequality is an
immediate result from Proposition 3.4.2. The second inequality is obtained by the
fact that (a + b)2 ≤ 2a2 + 2b2 for all a, b > 0, that is
2
kpt − p? k2 = pt − p(β̂t−1
D D
) + p(β̂t−1 ) − p?
2 2
D D
≤ 2 pt − p(β̂t−1 ) + 2 p(β̂t−1 ) − p? .
The first term in (3.6.1) is taking expectation of the first term in the above equation
and summing over t = t0 , . . . , T , which gives
" T # " T #
X 2 X 2
D D
E pt − p(β̂t−1 ) ≤E p̃s−τ̃s − p̃s + p̃s − p(β̂s−1 )
t=t0 s=t0
" T
# " T
#
X X 2
≤ 2E kp̃s−τ̃s − p̃s k2 + 2E D
p̃s − p(β̂s−1 ) .
s=t0 s=t0
64
due to the fact that kp̃s − p̃s+1 k2 ≤ k6 L̇(s) where k6 = 1 + maxp∈P kpk2 > 0
is a constant. Since L(s) is an increasing concave function, by the Mean Value
Theorem we have L(s + 1) − L(s) = L̇(s̃) for any s̃ ∈ (s, s + 1). This implies
L(s + 1) − L(s) ≥ L̇(s + 1) and
T −1
X T −1
X
L̇(s + 1) ≤ L(s + 1) − L(s) = L(T ) − L(t0 ) ≤ L(T ) .
s=t0 s=t0
2
According to the condition (ii), we know that the term Ts=t0 p̃s − p̃(β̂s−1
D
P
) ≤
k4 L(T ). Using the result that Ts=1 τ̃s = Tt=1 τt in Lemma 3.6.1, which is stated
P P
immediately after this theorem, we obtain the bound on the first term
" T
#
X 2
D
2E pt − p(β̂t−1 ) ≤ 2 (k6 SD + k4 ) L(T ) . (3.6.2)
t=t0
For the second term in the (3.6.1), using (3.4.3), there exists some k7
" T
# T
X
D
2 X log(t)
2E β̂t−1 − β0 = 2k7 , (3.6.3)
t=t0 t=t0
L(t)
T
X log(t)
Rg(T ) ≤ 2k5 (k6 SD + k4 ) L(T ) + 2k5 k7 .
t=t0
L(t)
p
Let L(t) = c t log(t) for some c > 0, we obtain
p p
Rg(T ) ≤ 2k5 (k6 SD + k4 ) T log(T ) + 2k5 k7 T log(T ) .
T
X T
X
τ̃s = τt .
s=1 t=1
This lemma shows that the sum of τ̃s equals the sum of all delays τt over time
T.
65
Algorithm 3: GP Pricing Algorithm with Delayed Claims
Input: GP Pricing Algorithm 2
for t = 1, 2, . . . do
Collect the set of delay times DC t received by time t ;
for t0 ∈ DC t do
Select price: pt0 ← GP Pricing Algorithm 2 ;
Update GP with pt0 , ftd0 (pt0 ) and ftc0 +τt0 (pt0 );
end
end
Currently there are no proven bounds for GP with delays. In this section, we
present the implementation of the GP pricing algorithm in the delayed-claim case.
The pseudocode for pricing a new released insurance product with delays via the
GP algorithm is shown in Algorithm 3. At each time t, we set a price by the GP
Pricing Algorithm 2 and collect the set of start times for delayed claims DC t = {t0 :
t0 = t − τt0 }. For each time t0 ∈ DC t , the premium is observed at time t0 , while
claims are received at time t = t0 + τt0 . We let the premium and the log of delayed
claims be pt0 · ftd0 (pt0 ) and ftc0 +τt0 (pt0 ) respectively, where f d and f c denote demand
and claims functions. After receiving the delayed claims, we update the GP for each
claim and its revenue. This then determines the next selling price.
In this section, we show simulation results for algorithms discussed in the previous
sections. i.) The GLM policy defined in Algorithm 1; ii.) The GP-UCB policy
defined in Algorithm 2; iii.) The D-GLM and D-GP-UCB policies are designed for
cases with unknown delays, where D for delay. The performance of these policies is
measured by the regret.
We consider a simple example of our problem, where both demand and log of total
claims are normally distributed with constant variance (σ = 0.05) and expectations
66
are linear functions of price.
We assume the true parameters are β0 = ((11, −0.8), (3, 0.25)), but they are
not known to the company. We let initial price vectors be p(1) = (1, 3)> and
p(2) = (1, 3.3)> and feasible price set to be [p` , ph ] = [1, 10]. Then we write demand
and total claims models as
E[D(p)] = 11 − 0.8p ,
E[C(p)] = 3 + 0.25p .
In the expected demand function, we choose a negative parameter, e.g., −0.8, be-
cause demand is strictly decreasing in price. While in the expected log of total
claims function, we choose a positive parameter, e.g., 0.25, because claims normally
increase in price.
We let T = 2000. Figure 3.1 shows price dispersion and convergence of parameter
p
estimates. As shown in Figure 3.1a, λmin (t)/ t log(t) is asymptotically bounded
p
around 0.01. This means that λmin (t) grows at the rate of t log(t). Recall that
p
this is a key condition in Theorem 3.4.1. Thus we set L(t) = 0.01 t log(t). This can
guarantee the price dispersion in our pricing problem. Figure 3.1b shows the value
2
of β̂t − β0 , which is the squared norm of the difference between the parameter
estimates β̂t and the true parameters β0 . We can see that the difference converges
to zero as t becomes large enough, i.e., β̂t → β0 as t → ∞. This shows the strong
consistency of parameter estimates.
In the GP setting, we sample demand and claims functions, fd and fc , from GPs
with Matérn kernels k3/2 (r) and k5/2 (r) respectively,
√ ! √ !
3r 3r
k3/2 (r) = 1 + exp − ,
` `
√ √ 2! √ !
5r 5r 5r
k5/2 (r) = 1 + + 2
exp − ,
` 3` `
and we let the length scale parameter be ` = 1. In addition, we set the sample noise
to be σ = 0.05 or σ 2 = 0.0025. Similar to the GLM setting, we consider the feasible
price set is [p` , ph ] = [1, 10]. We run the algorithm for T = 100 with δ = 0.1.
67
3.7.3 Adaptive GLM and GP algorithms with delays
In order to observe the effect of delays, we consider the same settings as in the non-
delayed examples. We assume that delays, τt , are unknown non-negative integers,
that are generated randomly on interval [0, m], where m = 10. The last few delays
are slightly modified to ensure that all claims and revenues are received at time T .
The performance of the four pricing policies is shown in Figure 3.2 and Figure 3.3.
Figure 3.2a plots the cumulative regret incurred by GLM pricing algorithms with
and without delays. We can see that regret increases with t in both cases. To make
p
results more precisely, we re-scale regret to Rg(T )/ T log(T ). Figure 3.2b shows
p
that Rg(T )/ T log(T ) is around a constant in both cases. This means the regret
p
has an order of T log(T ). In addition, we observe that there is a gap between
GLM and D-GLM pricing algorithms due to the delays. The delays cause larger
regret, which means that the company suffers from an extra loss.
Figure 3.3 shows the regret obtained by GP pricing algorithms with and without
delays. Figure 3.3a shows regret quickly converges to fixed values in both cases.
Moreover, the regret in the delayed setting is larger than that in the non-delayed
p
setting. Figure 3.3b shows Rg(T )/ T log(T ) is converging to a constant, which
p
verifies that the order of regret is T log(T ). The gap between GP and D-GP
pricing algorithms is due to the delays, and the company suffers from an extra loss.
Comparing Figure 3.2a and Figure 3.3a, we see that the regret obtained by the
GP pricing algorithm converges quicker and has a smaller regret. Comparing Figure
3.2b and Figure 3.3b, we see that both algorithms achieve the same order of regret
bound. Overall, the GP pricing algorithm outperforms the GLM pricing algorithm.
68
3.8 Conclusions and future directions
69
learning to solve online revenue management problems, which will be discussed in
Chapter 5. After all, bandit problems, as considered in this chapter, are simple
reinforcement learning routines. However, more complex future market interactions
might be considered instead and this problem may be modeled as a Markov decision
process.
70
(a) Price dispersion
Figure 3.1: Price dispersion and convergence of parameter estimates obtained for the GLM
pricing policy.
71
(a) Cumulative regret (b) Rate of convergence
Figure 3.2: Cumulative regret and convergence rate for the GLM pricing algorithm. GLM
denotes the non-delayed case and D-GLM denotes the delayed case.
Figure 3.3: Cumulative regret and convergence rate for the GP pricing algorithm. GP
denotes the non-delayed case and D-GP denotes the delayed case.
72
3.9 Appendices
Proof of Proposition 3.4.1. If β̂t exists and tr(P (t)−1 )−1 ≥ L(t), we first set the next
price to be pt+1 = pce , and check the condition (3.4.5). If this condition does not
hold, which is equivalent to
−1 −1
tr P (t) + pce p>
ce < L(t + 1) , (3.9.1)
we then let the next price be p0 = pce + φt . We will show that there exists φt , such
that −1 −1
0 0>
tr P (t) + p p ≥ L(t + 1) . (3.9.2)
kP (t)−1 p0 k
−1
1 d 1
tr P (t) − 0 > −1 0
≤ + ,
1 + p P (t) p L(t) dt L(t)
then we can say that (3.9.2) is satisfied. Moreover, given tr (P (t)−1 ) ≤ 1
L(t)
, our aim
is to show
kP (t)−1 p0 k
d 1
0 > −1 0
≥− ,
1 + p P (t) p dt L(t)
73
Here we discuss a general case. Let t > n + 1 and λ1 ≥ . . . ≥ λn+1 > 0 be the
eigenvalues of P (t) ⊂ Rn+1 . Assume that v1 , . . . , vn+1 are associated eigenvectors,
which form an orthonormal basis of Rn+1 . Define the optimal price as a linear
combination of these unit eigenvectors, given by pce = n+1
P
i=1 αi vi . Let the next price
be p0 = pce + (vn+1,1 pce − vn+1 ) for some , where vn+1,1 is the first component of
vn+1 . We know that kvi k = 1 and |vn+1,i | ≤ 1 for all i. Then we have
0 2 2 2 2 2
kp − pce k = kvn+1,1 pce − vn+1 k ≤ 1 + max kpk .
p∈P
2 2 2
kφt k ≤ 1 + max kpk .
p∈P
−1
2 −2 −1 2 (3.9.4)
L̇(t) ≤ (n + 1) 1 + L(n + 1) max kpk .
p∈P
Write kp0 k2P (t)−1 = p0 > P (t)−1 p0 . Since λmax (P (t)−1 ) = λmin (P (t))−1 and λmin (P (t)) ≥
L(t), for t > n + 1 we have
74
Moreover,
n+1 n+1
! !! 2
−1 0 2
X X
P (t) p = P (t)−1 αi vi + vn+1,1 αi vi − vn+1
i=1 i=1
n+1 n
! !! 2
X X
−1
= P (t) αi vi + vn+1,1 αi vi + αn+1 vn+1 − vn+1
i=1 i=1
n
! 2
X
= P (t)−1 (αn+1 + (vn+1,1 αn+1 − 1)) vn+1 + (1 + vn+1,1 ) αi vi
i=1
n 2
X
= (αn+1 + (vn+1,1 αn+1 − 1)) λ−1
n+1 vn+1 + (1 + vn+1,1 ) λ−1
i αi vi
i=1
n
2 X 2
= (αn+1 + (vn+1,1 αn+1 − 1)) λ−1
n+1 kvn+1 k + 2
(1 + vn+1,1 ) λ−1
i αi kvi k2
i=1
≥ 2 λ−2
n+1 .
(3.9.6)
75
The last inequality is obtained by (3.9.4). Then we obtain
kP (t)−1 p0 k2 K 2 L̇(t)
≥ .
1 + kp0 k2P (t)−1 L(t + 1)2
1
Due to the convexity of L(t)
, there exists K > 1, such that KL(t) ≥ L(t + 1), and
∂r(p? ,β0 )
Proof of Proposition 3.4.2. Since p(β0 ) ∈ P and ∂pi
= 0, by the Taylor series
expansion, for all p ∈ P,
|r(p, β0 ) − r(p? , β0 )| ≤ k1 kp − p? k2 ,
∂ 2 r(p,β0 )
where k1 := supp∈P p2i
< ∞. By the implicit function theorem in Duistermaat
& Kolk (2004), V can be chosen such that the function β → p(β) is continuously
differentiable with bounded derivatives. Thus for all β ∈ V and some non-random
constant k2 > 0, by the Taylor expansion,
kp(β) − p(β0 ) k≤ k2 k β − β0 k .
≤ γT (κd ) + γT (κc ) ,
where Kd and Kc are the Gram matrices for κd (p, p0 ) and κc (p, p0 ), respectively.
76
√
Proof of Proposition 3.5.1. By the definition that pt = arg maxp∈P µrt−1 (p) + r
ϕt σt−1 (p),
we have
√ √
µrt−1 (pt ) + r
ϕt σt−1 (pt ) ≥ µrt−1 ([p∗ ]t ) + r
ϕt σt−1 ([p∗ ]t ) .
To bound the right hand side of the above inequation, we need Lemma 3.9.1 (stated
and proven below), which implies
√ p
µrt−1 ([p∗ ]t ) + r
ϕt σt−1 ([p∗ ]t ) ≥ r(p∗ ) − ub log(2a/δ)/t2 .
Therefore
√ p
µrt−1 (pt ) + r
ϕt σt−1 (pt ) ≥ r(p∗ ) − ub log(2a/δ)/t2 .
The first term in the above inequality is bounded as below. By assumption 3.5.1
2
and the union bound, we have P supp∈P |∂f /∂pj | > J ≤ ae−(J/b) . Therefore, there
77
2
Then, with probability greater than 1 − ae−(J/b) , for all p ∈ P, we have
|r(p) − r(p0 )| ≤ J kp − p0 k .
u
kp − [p]t k ≤ ,
ζt
where constant u = ph − p` is the length of the price set and [p]t is the closest price
2
to p in Pt . Let δ/2 = ae−(J/b) and we have with probability greater than 1 − δ/2,
p
|r(p) − r([p]t )| ≤ b log(2a/δ) kp − [p]t k
p
≤ ub log(2a/δ)/ζt .
Choosing ζt = t2 yields
p
|r(p) − r([p]t )| ≤ ub log(2a/δ)/t2 .
1/2
r([p]t ) − µrt−1 ([p]t ) ≤ ϕt σt−1
r
([p]t ) .
Lemma 3.9.2. Pick δ ∈ (0, 1) and set ϕt = 2 log(|P| πt /δ). Then with probability
greater than 1 − δ, for any p ∈ P and t ≥ 1, we have
√
r(p) − µrt−1 (p) ≤ r
ϕt σt−1 (pt ) , (3.9.9)
r d c
where σt−1 (p) = pt−1 σt−1 (p) + σt−1 (p).
Proof. Conditioned on yt−1 = (y1 , . . . , yt−1 ), {p1 , . . . , pt−1 } are deterministic and the
2
marginals follow f (p) ∼ N (µt−1 (p), σt−1 (p)) for any fixed t ≥ 1 and p ∈ P. By the
2 /2
following tail bound, we know P(z > c) ≤ (1/2)e−c for c > 0 if z ∼ N (0, 1). Let
1/2
z = (f (p) − µt−1 (p)) /σt−1 (p) and c = ϕt , then
|f (p) − µt−1 (p)| 1/2
P > ϕt ≤ e−ϕt /2 .
σt−1 (p)
1/2
|f (p) − µt−1 (p)| ≤ ϕt σt−1 (p) .
78
Given r(p) = p · fd (p) − fc (p), with probability greater than 1 − |P| e−ϕt /2 , we have
d 1/2 c 1/2
≤ pϕt σt−1 (p) + ϕt σt−1 (p) .
r
Let σt−1 d
(p) = pt−1 σt−1 c
(p) + σt−1 (p) and choose |P| e−ϕt /2 = δ/πt . By the union
bound on all t, we obtain the results.
i=1 1{i +
{1, . . . , T }. The fourth equality derives from the definition that N (t) = t−1
P
τi ≤ t − 1}.
79
Chapter 4
Perturbed pricing
4.1 Introduction
When a company sells a new product, the objective is to select prices that maximize
revenue. However, there is often little information about demand, so over time the
company must choose prices that maximize long-run revenue and efficiently estimate
the distribution of demand. Moreover, when the product is sold online, the demand,
and thus prices, depend on the context in which the items are sold, such as who is
viewing and what search criteria they used.
A widely-used pricing strategy is the certainty equivalent pricing rule or myopic
pricing (Broder & Rusmevichientong, 2012; Keskin & Zeevi, 2014). Here, the man-
ager statistically estimates the demand over the set of prices (and contexts) and then
selects the revenue optimal price for this estimated demand, i.e., prices are chosen
as if statistical estimates are the true underlying parameters. This approach is very
appealing as it cleanly separates the statistical objective of estimating demand from
the managerial objective of maximizing revenue. However, it is well-known in both
statistics and management science literature that a certainty equivalent rule may
not explore prices with sufficient variability. This leads to inconsistent parameter
estimation and thus sub-optimal revenue.
Overcoming this difficulty is an important research challenge for which numerous
different methods have been proposed. We review these shortly, but, to summarize,
there are two principal approaches: to modify the statistical objective, or to modify
80
the revenue objective. For instance, if the statistical objective is a maximum likeli-
hood objective, then likelihood function can be modified via regularization to induce
consistency. For the revenue objective, confidence intervals and taboo regions can
be used to restrict the set of available prices and thus ensure exploration. These
modifications involve additional bespoke calculations that couple the statistical and
revenue objectives.
In this article, we advocate a different approach: pricing according to a certainty
equivalent rule with a decreasingly small random perturbation. Estimation can then
be conducted according to a standard maximum likelihood estimation. We call this
perturbed certainty equivalent pricing, or perturbed pricing, for short. From a man-
agerial perspective, the key advantage is its simplicity—we perturb the data not
the optimization. The statistical objective and revenue objective remain unchanged
and can be treated separately. Thus parameters can be estimated using conven-
tional statistical tools, and prices can be chosen with standard revenue optimization
techniques. If the magnitude of the perturbation is chosen well, our results prove
that perturbed pricing performs comparably with the best pricing strategies.
4.1.1 Overview
We present a brief summary of our model, pricing strategy, results and contributions.
A formal description and mathematical results are given in Sections 4.2 and 4.3,
respectively.
Model. We let r(p, c ; β0 ) be the revenue for a set of products at prices p under
context c given parameters β0 . We assume that context c is i.i.d. bounded random
variables and the covariance matrix of c by Σc = E cc> > 0. The revenue objective
81
Here µ is an increasing function and ε has mean zero. The response y can be
interpreted as the number (or volume) of items sold given the prices and context.
Given data ((xs , ys ) : s = 1, ..., t), the statistical objective is a maximum likelihood
optimization:
t
X
β̂t ∈ arg max ys β > xs − m(β > xs ) , (4.1.1)
β s=1
Notice p? (c) = pce (c ; β0 ). For the maximum likelihood estimate at time t − 1, β̂t−1 ,
and new context, ct , the perturbed certainty equivalent price is
Results. The regret measures the performance of the perturbed pricing strategy
compared to the revenue optimal strategy:
T
X
Rg(T ) = r(p? (ct ), ct ; β0 ) − r(pt , ct ; β0 ) .
t=1
√
It is known that Rg(T ) = Ω( T ) for any asymptotically consistent pricing policy.
1
In Theorem 4.3.1 we prove that for αt = t− 4 the regret of the perturbed pricing
satisfies
√
Rg(T ) = O T log(T ) .
4.1.2 Contributions.
In Theorem 4.3.1 we show that the convergence of the perturbed certainty equivalent
pricing is optimal up to logarithmic factors. The speed of convergence is competitive
with the best existing schemes. However, as we have already discussed, the main
82
contribution is the simplicity of the policy. Current schemes often require additional
matrix inversions, eigenvector calculations, or introduce non-convex constraints to
the pricing optimization. Furthermore the scheme is flexible in leveraging contextual
information, which is an important best-practice in many online market places and
recommendation systems.
In forming perturbed certainty equivalent prices, the manager can estimate pa-
rameters as a standard maximum likelihood estimator (4.1.1), and can price accord-
ing the revenue maximization objective (4.1.2). The only change is to introduce a
small amount of randomization (4.1.3). This is appealing as the statistical optimiza-
tion and revenue optimization remain unchanged and the perturbation of prices is
simple, intuitive, and requires negligible computational overhead.
In addition to this managerial insight, there are a number of mathematical con-
tributions in this paper. The eigenvalue lower bound used in Proposition 4.3.2 is
new and also crucial to our analysis of design matrices. We build on the work of Lai
& Wei (1982) to clarify results on the rate of convergence and strong consistency
of generalized linear models in Proposition 4.3.1. Further, as we will review, much
of the current revenue maximization literature on pricing applies to revenue opti-
mization of a single item in a non-contextual setting. To this end, we include the
important generalizations of contextual information.
83
classical revenue management problem of Gallego & van Ryzin (1994b) is recast as a
statistical learning problem and shortfall in revenue induced by statistical estimation
is analyzed. Subsequently there have been a variety of studies jointly applying
optimization and estimation techniques in revenue management, see den Boer (2015)
for a survey of this literature.
84
The revenue management literature discussed so far does not incorporate contex-
tual information on the customer, the query and the product. To this end, we first
discuss the contextual multi-arm bandit problem and then feature-based pricing.
85
& Bayati (2016) assumed the demand follows a linear function of the prices and
covariates, and applied a myopic policy based on least-square estimations which
achieved a regret of O(log(T )). Javanmard & Nazerzadeh (2019) considered a
structured high-dimensional model with binary choices and proposed a regular-
ized maximum-likelihood policy which achieves regret O(s0 log(d) log(T )). These
prior models achieve a logarithmic rather than square root regret because demand
feedback is a deterministic function of some unknown parameter. See Kleinberg
& Leighton (2003) for an early discussion on this distinction and lower-bounds in
both settings. Ban & Keskin (2020) were the first to introduce random feature-
√
dependent price sensitivity and achieved the expected regret of O(s T log(T )) and
√
O(s T (log(d)+log(T ))) in linear and generalized linear demand models. Chen et al.
(2015) considered statistical learning and generalizations of feature-based pricing.
86
paper. Also, as discussed, bounds on the design matrix of our GLM are required for
convergence. Given the Schur Decomposition and Ostroswki’s Theorem (see Horn
& Johnson (2012)), we develop a new eigenvalue bound, Proposition 3, that when
combined with covariance estimation results from Vershynin (2018) enables us to
incorporate random contextual information.
The remainder of this chapter is structured as follows. In Section 4.2, we in-
troduce the model and formulate the problem. In Section 4.3 and specifically in
Theorem 4.3.1, we prove the main result of this chapter, the convergence of the
perturbed certainty equivalent pricing. We show the numerical results in Section
4.4. In Section 4.5, we present some concluding remarks and discussion of the work.
Proofs of additional results are deferred to appendices at the end of this chapter.
87
The solution p? (c) is the optimal price for context c. Given β0 , we assume there is
a unique optimal price p? (c) for each c ∈ C. We place one of two assumptions on
the reward function and the set of contexts,
A1a). The set C is finite and the Hessian of p 7→ r(p, c ; β0 ) is positive definite at
p? (c) for each c ∈ C.
A1b). The set C is convex and p 7→ r(p, c ; β0 ) is α-strongly concave for some α > 0.
y = µ β0> x + ε .
When it exists, the solution to this equation is unique (since µ is strictly increasing).
When the distribution of ys given xs , for s = 1, ..., t, are independent and each
belongs to the family of exponential distributions with mean µ β̂0> xs , then (4.2.2)
is the condition on β for maximizing the log-likelihood. In this case, β̂t is the
maximum likelihood estimator. However in our case, we don’t assume (yt | xt ) is
from an exponential family, so instead, as in Wedderburn (1974), we refer to β̂t as
maximum quasi-likelihood estimators.
88
Typically β̂t can be found with standard software packages using Newton meth-
ods such as Iteratively Reweighted Least Squares. For better time complexity, in
the case of linear regression, the Sherman–Morrison formula can be applied to yield
an online algorithm with O(td2 ) complexity.
A sequence of estimators β̂, is said to be strongly consistent if, as t → ∞ with
probability 1,
β̂t → β0 .
For adaptive designs it is often possible to prove even stronger results, specifically
that with probability 1,
2 log(t)
β̂t − β0 =O . (4.2.3)
λmin (t)
Pt
where λmin (t) is the minimum eigenvalues of the design matrix s=1 xs x>
s and k·k
A2.
0< min µ̇(β > x) .
x,β:kxk≤xmax ,
kβk≤βmax
Here xmax and βmax are the largest values of kxk and kβk for t ≥ 1.
The above assumption holds for linear regressions and also for any model where
the parameters β̂t remain bounded. We note that boundedness can be enforced
through projection, for instance Filippi et al. (2010) took this approach. For the
convergence rate (4.2.3), there are several alternative proofs on the rate of conver-
gence of adaptive GLMs designs. These are discussed in the literature review. Here,
any convergence result of the form (4.2.3) can be used in place of Assumption A2
and Proposition 4.3.1. We provide our own proof under Assumption A2 in order to
present a short self-contained treatment.
Time and Regret. Our goal to optimize revenue, (4.2.1), remains. However, re-
grettably, we will always fall short of this goal. This is because the parameters
89
β0 are unknown to us. Instead, we must simultaneously estimate β0 and choose
prices that converge on p? (c) for each c ∈ C. The variability in x = (p, c) required
to estimate β0 will inevitably be detrimental to convergence towards p? (c), while,
rapid convergence in prices may inhibit estimation and lead to convergence to sub-
optimal prices. Stated more generally, there is a trade-off between exploration and
exploitation which is well-known for bandit problems.
We let T ∈ N be the time horizon of our model. For each time t = 1, . . . , T , we
receive a vector of context ct ∈ C. Then for xt = (pt , ct ), we are given response
yt = µ β0> xt + εt ,
(4.2.4)
T
X
Rg(T ) := rt (x?t , yt? ; β0 ) − rt (xt , yt ; β0 ) .
t=1
where x?t := (p? (ct ), ct ) and yt? is the response under given x? . Recall that p? (ct )
is the optimal price for context ct given parameter β0 is known, see (4.2.1). The
regret is the expected revenue loss from applying prices pt rather than the optimal
price p? when β0 is known. Thus as β0 is unknown, we instead look to make the
regret as small as possible. As we discussed in the literature review, the best possible
√
bounds on the regret for this class of problem are of the order O( T ) (see Kleinberg
& Leighton (2003)), and any policy that achieves regret o(T ) can be considered to
have learned the optimal revenue.
90
4.2.2 Perturbed certainty equivalent pricing
The certainty equivalent price is the price that treats all estimated parameters as
if they were the true parameters of interest, and then chooses the optimal price
given those estimates. Specifically, for parameter estimates β̂ ∈ Rd , we define the
certainty equivalent price pce (c ; β̂) to be
Notice this is exactly our optimization objective (4.2.1), with estimate β̂ in place of
the true parameter β0 .
For some control problems, the certainty equivalent rule can be optimal, e.g.
linear quadratic control, and the certainty equivalent rule is a widely used scheme
in such as model predictive control. However, in general, it will lead to inconsis-
tent learning and thus sub-optimal rewards and revenue (Lai & Robbins, 1982).
Nonetheless, many companies will opt to use a certainty equivalent rule as it cleanly
separates the problem of model learning from revenue optimization.
With this in mind, we propose a simple implementable variant that maintains
this separation. For parameter estimates β, we choose prices
p = pce (c ; β) + αu ,
where ut are i.i.d. bounded, mean zero random vectors, β̂t−1 is our current maxi-
mum (quasi-)likelihood estimate (4.2.2). Moreover, αt is a deterministic decreasing
1
sequence. Shortly we will argue that taking αt = t− 4 is a good choice for achieving
statistical consistency while achieving a good regret bounds.
Pseudo-code for the perturbed pricing algorithm is given in Algorithm 4, below.
Here we split the procedure into 4 steps: context, where we receive the context
91
of the query and items to be sold; price, where we select the perturbed certainty
equivalent price; response, where we receive information about the items sold and
their revenue; and, estimate, where we update our MQLE parameter. As discussed
in the introduction, a key advantage of this scheme is its simplicity. Conventional
algorithms involve deterministic penalties or confidence ellipsoids for choices close
to the optimum. This in turn requires additional calculations such as matrix inver-
sions and eigenvalue decomposition which modify the task of maximizing revenue
and finding maximum likelihood estimators in a potentially non-trivial way. The
proposed approach is appealing that the proposed algorithm maintains the statisti-
cal maximum likelihood objective and the revenue objective and the randomization
added is a minor adjustment to the certainty equivalent rule.
where pce (c ; β̂) ∈ argmaxp∈P r(p, c ; β̂), αt = t−η , ut is i.i.d. mean zero
& covariance Σu 0.
Response:
For input xt = (pt , ct ),
receive response yt ,
receive reward rt (xt , yt ; β0 ).
Estimate:
Calculate the MQLE β̂t :
t
X
xs ys − µ β̂t> xs = 0.
s=1
end
92
4.3 Main results
In this section, we present our main result, in Theorem 4.3.1, an upper bound on
the regret under our policy. Its proof is provided in Section 4.3.3.
Theorem 4.3.1. If αt = t−η for η ∈ [1/4, 1/2) then, with probability 1, the regret
over time horizon T is
T !
p 1−2η
X log(t)
Rg(T ) = O T log T + T + .
t=1
t1−2η
1
Choosing η = 4
gives
√
Rg(T ) = O T log(T ) .
The order of the regret bound above is consistent with prior results such as den
Boer & Zwart (2014b) and Keskin & Zeevi (2014) which achieve a bound of the
same order.
We now describe the results that are required to prove Theorem 4.3.1, along with
some notation.
For a vector x ∈ Rd , we let kxk denote its Euclidean norm and kxk∞ denote its
supremum norm. Because we wish to consider the design matrix ts=1 xs x>
P
s , we
add some notation needed to re-express the effect of perturbation on the price.
Specifically, we re-express the perturbed certainty equivalent rule (price) in terms
of the full input vector x = (p, c) rather than just in terms of p. Specifically, we let
xt = x̂t + αt zt ,
where
x̂t = pce ct ; β̂t , ct ,
and
zt = (ut , 0) ∈ Rd .
Given our boundedness assumptions, we apply the notation pmax , cmax ∈ R+ where
kpt k∞ ≤ pmax and kct k∞ ≤ cmax for all t ∈ Z+ . Recall that ut are vectors of i.i.d.
93
bounded random variables, and therefore so are zt . We assume that kut k∞ ≤ umax
for all t ∈ Z+ where umax ∈ R+ , thus there exists zmax ∈ R+ such that kzt k∞ ≤ zmax
for all t ∈ Z+ . We denote the covariance matrix of zt by Σz = E zt zt> . Further-
more, we let λmax (t) and λmin (t) denote the maximum and minimum eigenvalues of
Pt >
s=1 xs xs .
Lemma 4.3.1. For p? ∈ argmaxp∈P r (p, c ; β0 ), there exists K0 > 0 such that, for
all p ∈ P and c ∈ C
If either Assumption A1a) or A1b) holds, then there exists K1 > 0 such that
Along with Lemma 4.3.1, Proposition 4.3.1 establishes that the key quantity to
determining the regret is λmin (t). As discussed in Section 4.1.3, under Assumption
94
A2, there are a number of similar results which can be used in place of Proposition
4.3.1. The argument presented initially takes the approach of Chen et al. (1999)
and Li et al. (2017) and then applies Lemma 1iii) of Lai & Wei (1982).
We now construct a lower bound for the smallest eigenvalue λmin (t) as follows.
t
!
X
λmin (t) = Ω αs2 .
s=1
The above proposition applies a new bound on minimum eigenvalue of the design
matrix ts=1 xs x>
P
s , which is critical to our proof. This algebraic eigenvalue bound is
given in Proposition 4.6.1, and proof of which is in Appendix C. This new eigenvalue
bound is obtained based on decomposition according to the Schur Complement. A
related eigenvalue bounds is Ostrowski’s Theorem, see Horn & Johnson (2012).
We now present a proof of Theorem 4.3.1. The main results required are Lemma
4.3.1, Proposition 4.3.1 and Proposition 4.3.2 as stated in the body of the paper.
We also require two more standard lemmas, Lemma 4.3.2 and Lemma 4.3.3, which
are stated and proved immediately after the proof of Theorem 4.3.1.
Proof of Theorem 4.3.1. Using Lemma 4.3.1 and Lemma 4.3.2, we can derive the
95
2
regret over selling horizon T in terms of β̂t − β0 , that is
T
X
Rg(T ) = rt (x?t , yt? ; β0 ) − rt (xt , yt ; β0 )
t=1
XT
= r (p? (ct ), ct ; β0 ) − r (pt , ct ; β0 )
t=1
XT T
X
+ r (pt , ct ; β0 ) − rt (xt , yt ; β0 ) + rt (x?t , yt? ; β0 ) − r (p?t (c?t ), ct ; β0 )
t=1 t=1
XT
≤ |r (pt , ct ; β0 ) − r (p? (ct ), ct ; β0 )|
t=1
p
+8 2rmax T log T
T
X
kpt − p? (ct )k2
p
≤ 8 2rmax T log T + K0
t=1
T T
p X 2 X 2
?
≤ 8 2rmax T log T + K0 2 pt − p (ct , β̂t ) + K0 2 p? (ct , β̂t ) − p? (ct )
t=1 t=1
T T
p X X 2
≤ 8 2rmax T log T + 2K0 u2max αt2 + 2K0 K1 β̂t − β0 . (4.3.3)
t=1 t=1
The first inequality above is immediate from Lemma 4.3.2. The second inequality
applies (4.3.1) proven in Lemma 4.3.1. The third inequality follows (a + b)2 ≤
2a2 + 2b2 for any a, b ∈ R. The last inequality follows by the definition of the
perturb price in (4.2.5) and also from Lemma 4.3.1.
Applying Proposition 4.3.1, we obtain, for some constant K2 ,
2 log(t)
β̂t − β0 ≤ K2 .
λmin (t)
By Proposition 4.3.2, we have, for some K3 and some the constant T0 > 0, for any
t ≥ T0 ,
t
X
λmin (t) ≥ K3 αs2 . (4.3.4)
s=1
T T
p 2
X
2 2K0 K1 K2 X log(t)
Rg(T ) ≤ 8 2rmax T log T + 2K0 umax αt + 2K0 K1 βmax T0 + Pt ,
t=1
K 3 t=1 s=1 α 2
s
96
Thus from the form of the above, we can see that there exist constants A0 , A1 and
A2 , such that
T T
p X X log(t)
Rg(T ) ≤ A0 T log T + A1 αt2 + A2 Pt 2
.
t=1 t=2 s=1 αs
We now cover the additional lemmas required above. Lemma 4.3.2 applied above
is an application of the Azuma-Hoeffding Inequality.
Lemma 4.3.2. With probability 1, for any sequence xt , it eventually holds that
T
X p
rt (xt , yt ; β0 ) − r(xt ; β0 ) ≤ 4 2rmax T log(T )
t=1
Thus by the Borel-Cantelli Lemma (see (Williams, 1991, S2.7)), with probability 1,
it holds that eventually |Mt | ≤ zt , which gives the result.
97
Lemma 4.3.3 is a standard integral test result.
Proof. This can be obtained by simple calculations. Since s−γ is decreasing for
γ > 0, then
t t
t1−γ − 1
X Z
−γ
s ≤1+ s−γ ds = 1 + ,
s=1 1 1−γ
and
t t
t1−γ − 1
X Z
−γ
s ≥ s−γ ds = .
s=1 1 1−γ
98
15 × 15 identity matrix. Then, we obtain the feature vector xt = (pt , ct ) ∈ R17 ,
where pt = (1, pt ). We assume that the true values of unknown parameters of price
vector are (β0 , β1 ) = (1, −0.5)> ∈ R2 , and ct and true coefficients β2 , ..., β16 are
drawn i.i.d. from a multivariate Gaussian distribution N (0, I) ∈ R15 . We model
the revenue as the price times the demand responses, i.e. rt (xt , yt ) = pt yt and
r(x̂t ; β) = E[rt (xt , yt ) | xt ]. We simulate our policy for T = 2000 with αt = t−1/4
and measure the performance of the policy by regret.
We first consider that demand follows a linear function, i.e. a GLM with an
identity link function µ(x) = x. Figure 4.1a shows the value of kβ̂t − β0 k2 , which is
the squared norm of the difference between the parameter estimates β̂t and the true
parameters β0 . We can see that the difference converges to zero as t becomes large,
which demonstrates the strong consistency of parameter estimates. To show the
√
order of regret, we re-scale it to Rg(T )/ T log(T ) in Figure 4.1b. It shows that the
√
Rg(T )/ T log(T ) converges to a constant value as T becomes large, which verifies
√
that the regret has, at most, an order of T log(T ) under our policy. The constant
in this case is quite small close to 0.14. In practice, the response that seller receives
may be a zero-one response corresponding to the item being unsold or sold. Here
a logistic regression model is appropriate, with link function µ(x) = 1/(1 + e−x ).
Figures 4.2a shows that kβ̂t − β0 k2 converges to zero as t becomes sufficiently large
√
and Figure 4.2b shows that regret is of order T log(T ). Again the constant in the
regret is small, around 0.01, suggesting good dependence on the size of this problem.
99
√
(a) kβ̂t − β0 k2 (b) Rg(T )/ T log(T )
Figure 4.1: Convergence of parameter estimates and regret with linear demand function.
Time period T = 2000 and αt = t−1/4 . Problem parameters used are [pl , ph ] = [0.5, 2],
m = 15, true parameters of price vector [1, −0.5]> ∈ R2 , ct and its true coefficients drawn
i.i.d. from N (0, I) ∈ R15 .
√
(a) kβ̂t − β0 k2 (b) Rg(T )/ T log(T )
Figure 4.2: Convergence of parameter estimates and regret with logistic regression for
demand. Time period T = 2000 and αt = t−1/4 . Problem parameters used are [pl , ph ] =
[0.5, 2], m = 15, true parameters of price vector [1, −0.5]> ∈ R2 , ct and its true coefficients
drawn i.i.d. from N (0, I) ∈ R15 .
100
of future research is to consider the setting where parameter estimates β̂t and prices
pt are updated online according to a Robbins–Monro rule. We focus on i.i.d. con-
texts, but it should be clear from the proofs that we may allow contexts to evolve in
a more general adaptive manner, so long variability in contextual information dom-
inates variability in prices. Also, we focus on i.i.d. perturbation, but other forms
of perturbation could be considered. For instance, Quasi-Monte-Carlo methods can
reduce variance while more systematically exploring the space perturbations. We
allow the size of perturbations to decrease uniformly over each context. However, in
practice one might want to let this decrease depending on the number of times that
a context has occurred. Thus the asynchronous implementation of the approach is
an important consideration. We consider a single retailer selling one item over mul-
tiple contexts. However, adversarial competition is often important. For instance
the interplay between large sets of contexts and a variety of sellers occurs in online
advertising auctions. Here an understanding of regret in both the stochastic and
adversarial setting would be an important research direction.
Finally to summarize, certainty equivalent pricing is commonly applied in man-
agement science. Random perturbation around the certainty equivalent price is sim-
ple and when combined with standard statistical parameter estimation it achieves
revenue comparable with more technically sophisticated learning algorithms.
101
4.6 Appendices
In this section, we prove the supporting results Lemma 4.3.1, Proposition 4.3.1 and
Proposition 4.3.2. Lemma 4.3.1 is an application of the Implicit Function Theorem.
Proposition 1 develops of the results of Lai & Wei (1982) to the case of GLMs
and follows similar lines to Chen et al. (1999) and Li et al. (2017). The proof
of Proposition 4.3.2 is a little more involved and requires an additional eigenvalue
bound Proposition 4.6.1, which is proven using a Schur Decomposition. We can
then use Proposition 4.6.1 along with random matrix theory bounds from Vershynin
(2018) to prove Proposition 2.
kM xk
kM kop = max = max kM xk = max x> M y ,
x∈Rd kxk x∈S d−1 x,y∈S d−1
x6=0
where S d−1 is the unit sphere in Rd . We recall by the spectral theorem, a symmetric
matrix has real valued eigenvalues and that its eigenvectors can be chosen as an
orthogonal basis of Rd . For a square matrix M , we let λmin (M ) and λmax (M )
denote the minimum and maximum eigenvalues of M . We use notations M 0 to
say that M is positive definite and M 0 to say that M is positive semi-definite.
If M is a positive semi-definite matrix then
102
and thus
We now prove Lemma 4.3.1 which requires the Implicit Function Theorem (Theorem
9.28 of Rudin (1976)).
Proof of Lemma 4.3.1. We first prove (4.3.1). Since ∇p r (p? (c), c ; β0 ) = 0. A Tay-
lor expansion w.r.t. p gives
1
r (p, c ; β0 ) − r (p? , c ; β0 ) ≤ (p − p? )> ∇2 r (p̃, c ; β0 ) (p − p? )> ,
2
1 2
|r (p, c ; β0 ) − r (p? , c ; β0 )| ≤ max max ∇2 r (p̃, c ; β0 ) op
p − p? .
2 c∈C p̃∈P
Define
K0 := max max ∇2 r (p̃, c ; β0 ) op
.
c∈C p̃∈P
Notice K0 < ∞, since r (p, c ; β0 ) is twice continuously differentiable and the sets C
and P are compact. The result in (4.3.1) is proved.
We now apply the Implicit Function Theorem to bound kp − p? k2 . We first
consider the case under Assumption A1a), and then under Assumption A1b).
Under Assumption A1a), the Implicit Function Theorem implies that for each
c ∈ C, there exists a neighborhood Vc ⊆ Rd such that p? (c ; β) is uniquely defined
and is continuously differentiable in B. Thus taking V = ∩c∈C Vc and applying the
Taylor expansion, for all β ∈ V give
103
where k1 := supc∈C supβ∈V |∇β p? (c ; β)|. The constant k1 is finite since ∇β p? (c ; β)
is continuous, C is finite and V can be chosen to be contained in a compact set.
Under Assumption A1b), we know by strict concavity that for each B and C,
p? (c ; β) is unique. Further by Assumption A1b), ∇2p r (p, c ; β0 ) is invertible. So
the Implicit Function Theorem applies to ∇p r (p, c ; β0 ). For all c ∈ C, β ∈ B, again
applying the Taylor expansion gives
−1
∇β p? = − ∇2p r (p? , c ; β) ∇p,β r (p? , c ; β) .
where we use the fact that r (p, c ; β) is twice continuously differentiable and
since r(p, c ; β) is α-strongly concave. Further k2 := supp,c,β k∇p,β r (p? , c ; β)kop <
∞ because r is continuously differentiable and because of the compactness of P, C
and B. Thus
for K1 = k2 /α.
We have both (4.6.2) and (4.6.3) holding. In other-words, under both Assump-
tions A1a) and A1b), we have that
104
Appendix B: Proof of Proposition 4.3.1
Proposition 4.3.1 develops of the results of Lai & Wei (1982) to the case of GLMs.
The proof follows similar lines to Chen et al. (1999) and Li et al. (2017) and then
applies Lemma 1iii) of Lai & Wei (1982). We note there are some reported errors
in the proofs of Chen et al. (1999). So like Li et al. (2017), we must take some care
to work around these issues and make sure the proof method is applicable in our
setting.
Pt
Proof of Proposition 4.3.1. We define Zt = s=1 s xs , where s = ys − µ(β0> xs ).
Our proof proceeds by bounding ||Zt ||Vt−1 above and below.
The bound in Proposition 4.3.1 is trivial if λmin (t) = 0. Thus we assume that
the increasing function λmin (t) is positive at t. We define
t
X
xs µ β > xs − µ β0> xs .
Gt (β) :=
s=1
The function µ(·) is continuously differentiable and strictly increasing. Thus by the
Mean Value Theorem, there exists β̃ on the line segment between β̂t and β0 (β̃
must depend on xs ) such that
t
X
Gt (β̂t ) − Gt (β0 ) = xs µ β̂t> xs − µ β0> xs
s=1
Xt
= µ̇ β̃ > xs xs x>
s β̂ t − β 0 = ∇Gt (β̃) β̂ t − β 0 . (4.6.4)
s=1
105
Given Assumption A2, we define
Since Zt = Gt (β̂t ),
2
kZt k2Vt−1 ≥ κ2 λmin (Vt ) β̂t − β0 .
To obtain the upper bound on kZt k2Vt−1 , we use Lemma 4.6.1 (stated below), that
almost surely
The following lemma is a restatement of Lemma 1iii) in Lai & Wei (1982). For
its proof, we refer to Lai & Wei (1982).
Qt = Zt> Vt−1 Zt ,
106
Appendix C: Proof of Proposition 4.3.2
To prove Proposition 4.3.2, we derive a new eigenvalue bound that is critical to our
proof. This algebraic eigenvalue bound is given in Proposition 4.6.1.
A B
Proposition 4.6.1. For any symmetric matrix of the form M = , we
B> C
have
λmin (C)2 −1 >
λmin (M ) ≥ λ min A − BC B ∧ λmin (C) .
(kBkop + λmin (C))2 + λmin (C)2
To prove Proposition 4.3.2, we apply Proposition 4.6.1 along with a random
matrix bound from Vershynin (2018). We state and prove this result (as Lemma
4.6.3) after the proof of Proposition 4.3.2. Also we require Lemma 4.6.2 which is a
standard eigenvalue bound stated and the concentration bound Lemma 4.6.3 after
the proof of Proposition 4.3.2. Now, we give the proof of Proposition 4.3.2.
Proof of Proposition 4.3.2. Applying the shorthand ps = pce cs , β̂s . We expand
the design matrix as follows
t
X t
X
xs x>
s = (ps + αs zs , cs ) (ps + αs zs , cs )>
s=1 s=1
Xt
= (ps , cs ) (ps , cs )> + αs2 zs zs> + αs zs (ps , cs )> + αs (ps , cs ) zs> .
s=1
107
We now provide lower-bounds on the various terms given above. First, since sets P
and C are bounded, we have
t
X
ps c >
s ≤ t pmax cmax . (4.6.7)
s=1 op
t
! t
X X
cs c> c
cs c> c
λmin s ≥ tλmin (Σ ) − s −Σ , (4.6.8)
s=1 s=1 op
where
t
r 2
X
cs c> c
s −Σ ≤ 16t log(t) c2max + kΣc kop ,
s=1 op
and also
t
! t
!
X X
λmin c s c>
s ≤ λmax cs c>
s ≤ tc2max . (4.6.9)
s=1 s=1
t
!
X
ps p> 2 >
≥ λmin αs2 zs zs>
λmin s + αs zs zs
s=1
t
X t
X
αs2 λmin (Σz ) αs2 zs zs> − Σz
≥ −
s=1 s=1 op
v
t
X
u
u t
2 X
2 z
≥ α λmin (Σ ) − t16 log(t) z 2
s
z
max + kΣ kop αs4 .
s=1 s=1
(4.6.10)
Pt
The first inequality is obtained since s=1 ps p>
s is positive definite matrix and
t
!
X
λmin ps p>
s > 0.
s=1
108
Applying (4.6.7), (4.6.8), (4.6.9) and (4.6.10) to (4.6.6) gives the result
t
!
X
λmin (ps , cs ) (ps , cs )> + αs2 zs zs>
s=1
p 2
λmin (Σc ) − log(t)/t
≥
(pmax cmax + c2max )2 + c2max
v
Xt u
u 2 Xt
z 2 2 z 4
× λmin (Σ ) αs − 16 log(t) zmax + kΣ kop
t αs
s=1 s=1
( r )
2
∧ tλmin (Σc ) − 16t log(t) c2max + kΣc kop .
t
X λmin (Σz )
≥ αs2 ,
s=1
2
for all t ≥ T0 .
Now for the 2nd term in (4.6.5). By triangle inequality, we have
t
X t
X t
X
> >
αs zs (ps , cs ) + αs (ps , cs ) zs> ≤ αs zs (ps , cs ) + αs (ps , cs ) zs>
s=1 op s=1 op s=1 op
t
X
≤2 αs zs (ps , cs )> .
s=1 op
Combining the above two inequalities, we obtain that for sufficiently large t,
! v
Xt t
X λmin (Σz ) u
u
Xt
> 2 2
λmin xs xs ≥ αs − 16 log(t) (zmax xmax )
t αs2 .
s=1 s=1
2 s=1
q
Notice that, for αt ∝ 1/t for η < 1/2, the term log(t) ts=1 αs2 , above, is domi-
η
P
109
The following is a well-known eigenvalue bound.
Lemma 4.6.2. Let A, B be symmetric positive definite matrices of size d×d. Then,
We can write
Then we have
Therefore,
λmin (A) ≥ λmin (B) − kA − Bkop .
Lemma 4.6.3 (Vershynin (2018)). If xs ∈ Rd are bounded such that, for all s ≥ 1
E xs x> x
E [xs | Fs−1 ] = 0 and s | Fs−1 = Σs .
We assume that there exists xmax ∈ R+ , such that bound kxs k∞ ≤ xmax with proba-
bility 1 and Σxs is positive definite. Then,
t
(ε/2)2
X
> x 2d
P xs xs − Σs ≥ ε ≤ 2 · 9 exp − P 2 .
s=1 op
2 t
x2max + kΣxs kop
s=1
110
PT
Proof. We show this argument in two steps. We first control t=1 xt x>
t over a ε-net,
and then extend the bound to the full supremum norm by a continuity argument.
1
Using Lemma 4.6.5 (stated below) and choosing ε = 4
and, we can find an ε-net
N of the unit sphere S d−1 with cardinality
|N | ≤ 9d .
that is
t
* t
! +
X X
xs x> x
xs x> >
s − Σs ≤ 2 max s − E xs xs v, w
v,w∈N
s=1 op s=1
t
!
X
≤ 2 max v > xs x> >
s − E xs xs w . (4.6.11)
v,w∈N
s=1
1
We note that Vershynin applies a Hoeffding bound. This is the only substantive difference in
the proof here.
111
Together with (4.6.11), we have
! !
X t t
X
xs x> x
≥ ε = P 2 max v > xs x> >
P s − Σs s − E xs xs | Fs−1 w ≥ε
v,w∈N
s=1 op s=1
2
ε
≤ 2 · 92d exp − P 2 .
8 t
x 2 + kΣ xk
s=1 max s op
then,
t
X 2 · 92d
xs x> x
P s − Σs ≥ εt ≤ .
s=1
t2
op
Lemma 4.6.4.
( )
t
X (ε/2)2
P αs zs x>
s
2d
≥ ε ≤ 2 · 9 exp − .
2 (zmax xmax )2 ts=1 αs2
P
s=1 op
Proof. Similar to the proof of Lemma 4.6.3, we show this argument in two steps. We
first control ts=1 αs zs x>
P
s over a ε-net, and then extend the bound to the full supre-
mum norm by a continuity argument. Notice that the summands of ts=1 αs zs x>
P
s
112
1
Using Lemma 4.6.5 (stated below) and choosing ε = 4
and, we can find an ε-net
N of the unit sphere S d−1 with cardinality
|N | ≤ 9d .
By Lemma 4.6.6 (stated below), the operator norm can be bounded by terms on N ,
that is
t
* t
! +
X X
αs zs x>
s ≤ 2 max αs zs x>
s v, w
v,w∈N
s=1 op s=1
t
!
X
≤ 2 max v > αs zs x>
s w . (4.6.13)
v,w∈N
s=1
Next, we unfix v, w ∈ N using a union bound. Since that N has cardinality bounded
by 9d , we obtain
t
! ! t
! !
>
X ε X X ε
P max v αs zs x>
s w ≥ ≤ P v >
αs zs x>
s w ≥
v,w∈N
s=1
2 v,w∈N s=1
2
( )
2
(ε/2)
≤ |N |2 · 2 exp − Pt
2 s=1 (αs zmax xmax )2
( )
2
ε
≤ 92d · 2 exp − Pt .
8 s=1 (αs zmax xmax )2
Thus, we obtain the result. For (4.6.12), the argument follows in an identical manner
to Corollary 4.6.1.
113
Lemmas 4.6.5 and 4.6.6 are stated below. For the proofs of Lemmas 4.6.5 and
4.6.6, we refer to Section 4 in Vershynin (2018).
Lemma 4.6.5 (Covering Numbers of the Euclidean ball). The covering numbers of
the unit Euclidean ball is such that, for any ε > 0
d d
1 2
≤N ≤ +1 .
ε ε
The same upper bound is true for the unit Euclidean sphere S d−1 .
1
sup |hAx, xi| ≤ kAkop ≤ sup |hAx, xi| .
x∈N 1 − 2ε x∈N
114
Applying Lemma 4.6.7 gives
(4.6.15)
Given the vector applied above, we now lower bound of the vector (w1 , w1 + BC −1 w2 ).
2 2
w1 , w2 + BC −1 w1 = kw1 k2 + w2 + BC −1 w1
2
≥ kw1 k2 + kw2 k − BC −1 w1 ∨ 0
2
2 −1
≥ kw1 k + kw2 k − BC op
kw1 k ∨ 0 .
Since kwk2 = kw1 k2 + kw2 k2 = 1, we can write the above bound in the form of
p √ 2
f (p) = p + 1−p−b p ∨0 ,
2 1
w1 , w1 + BC −1 w2
≥ 2 . (4.6.16)
kBC −1 kop + 1 +1
For any c ≥ 0,
A − BC −1 B > 0
>
v = cλmin A − BC −1 B > ∧ λmin (C) .
min v (4.6.17)
v:kvk=c
0 C
115
Thus, applying (4.6.16) and (4.6.17) to (4.6.15) gives
−1
A − BC B> 0 w1
−1
λmin (M ) = min w1 , w1 + BC w2
w:kwk=1
0 C w1 + BC −1 w2
A − BC −1 B > 0
>
≥ min v v
v:kvk≥c
0 C
λmin A − BC −1 B > ∧ λmin (C)
≥ 2 .
−1
kBC kop + 1 + 1
Substituting,
kBkop
BC −1 op
= ,
λmin (C)
gives the required result.
The following is a well known lemma for matrices under a Schur decomposition.
A − BC −1 B > ,
obviously the Schur Complement is symmetric. Notice that we can use this to
decompose M as
>
I BC −1 A − BC −1 B > 0 I BC −1
M = .
0 I 0 C 0 I
A block diagonal matrix is positive definite if and only if each diagonal block is
positive definite, which concludes the proof.
116
√ √ 2
Lemma 4.6.8. For a function f (p) = p + 1 − p − b p ∨ 0 , we have
1
min f (p) ≥ .
p∈[0,1] (b + 1)2 + 1
√ √
Proof. Since 1−p≥1− p, and then
√
f (p) = p + ((1 − (b + 1) p) ∨ 0)2 .
We define
g(x) = x2 + ((1 − (b + 1)x) ∨ 0)2 .
1
g(x) ≥ .
(b + 1)2 + 1
We can obtain this by finding the minimum of g(x) and showing that x? is the
minimum value such that 1 − (1 + b)x > 0. Note that function g(·) is convex with
g(0) = g(1) = 1, so finding a local minimum is sufficient. Now, we let g 0 (x) = 0,
that is
d 2 d
x + (1 − (b + 1)x)2 = ((b + 1)2 + 1)x2 − 2 ((b + 1)x) + 1 = 0 .
dx dx
This implies
b+1
x? = ,
(b + 1)2 + 1
and
1
(x? )2 + (1 − (b + 1)x? )2 = .
(b + 1)2 + 1
Further notice that 1 − (1 + b)x? > 0, so this point is a minimum of the function
g(x). Thus, we have
1
min f (p) ≥ min g(x) ≥ .
p∈[0,1] x∈[0,1] (b + 1)2 + 1
117
Chapter 5
Reinforcement learning in
insurance
5.1 Introduction
118
revenue becomes negative, we say that the company is ruined and has to stop its
business. The objective of the company is to set prices so as to maximize the
expected discounted dividend payout over a given time horizon (or until ruin).
119
5.2 Related literature
120
per unit time. Collins & Thomas (2012) compared three different RL approaches
for the airline pricing game and showed that learning policies eventually converge
on the expected solution. Rana & Oliveira (2014) applied RL with and without
eligibility traces to learn and optimize pricing perishable products with real-time
selling demand. We also refer to Könönen (2006), Han et al. (2008) and Kutschinski
et al. (2013) for dynamic pricing by multi-agent RL.
121
Agent
st+1 Environment
at ∈ A(st ), where A(st ) is the set of all available actions that the agent may take in
state st . Then at the next time step t + 1, the agent receives a reward rt+1 from the
environment and moves to a new state st+1 .
We let Gt be the expected return, which is the sum of discounted rewards, at
time t. The goal of agent is to select actions to maximize the expected return over
time horizon T (which needs not be finite)
T
X
Gt = γ t−1 rt ,
t=1
122
Value functions and optimality
vπ (s) := Eπ[Gt | st = s]
for all s ∈ S and Eπ denotes the expected value of a random variable given that the
agent follows policy π over episodes. The last equality is the Bellman equation for vπ ,
which shows that the value function can be decomposed into two parts: immediate
reward rt+1 and the discounted value of its possible successor state γvπ (st+1 ). Note
that the value of the terminal state, if any, is always zero.
Similarly, we can define the value of taking a particular action in a given state.
Starting from a state s, if we take action a, and thereafter under a policy π, we have
qπ (s, a) := Eπ[Gt | st = s, at = a]
123
Similarly, the optimal action-value function, denoted by q ? (s, a), is defined by
Here rt+1 + γv(st+1 ) is called the TD target. The term δt := rt+1 + γv(st+1 ) − v(st )
is called the TD error, which is the difference between the value in state st and
the value at the subsequent state st+1 plus the reward rt+1 accumulated along the
way. The errors decreases as value functions are repeatedly updated. The step-size
parameter α ∈ (0, 1] controls the learning rate.
There are two main classes of TD methods: off-policy and on-policy. In on-
policy methods, the agent learns the optimal policy and behaves using the same
policy. Sarsa is an example of an on-policy method. In off-policy methods, the
agent learns about the optimal policy by using two policies, one that is used for
updating (the update policy) and another that is used for exploration (the behavior
policy). Q-learning is an example of an off-policy method. Sarsa and Q-learning
will be discussed in next section.
Sarsa
Sarsa was introduced by Rummery & Niranjan (1994) and named by Sutton (1996)
because it uses all of the values st , at , rt+1 , st+1 , at+1 for each update. The algorithm
tries to learn the action-value function. More specifically, at every time step, we
estimate qπ (s, a) for the same behavior policy π and for all states s and actions a.
We write Q(s, a) to refer to its approximation. In Sarsa, the update rule for the
124
Algorithm 5: Sarsa Algorithm
Initialize: Discount factor γ, step size α ∈ (0, 1], small > 0;
Q(s, a), ∀s ∈ S, a ∈ A(s) arbitrarily, and Q(terminal − state, ·) = 0;
Repeat
Initialise s;
Choose action a from s using -greedy policy derived from Q ;
Repeat
Take action a and observe r and next state s0 ;
Choose action a0 from s0 using -greedy policy derived from Q ;
Q(s, a) ← Q(s, a) + α (r + γQ(s0 , a0 ) − Q(s, a)) ;
s ← s 0 , a ← a0 ;
Until s is a terminal state;
Until all episodes are observed;
action-value function is
If st+1 is the terminal, then we let Q(st+1 , at+1 ) = 0. A complete description of Sarsa
is given in Algorithm 5. If all states are visited infinitely often and with appropriate
choice of step-size α, Sarsa converges to the optimal policy and action-value function
with probability 1 (Singh et al., 2000).
Q-learning
Unlike Sarsa, which chooses all actions according to the same fixed policy, Q-learning
(Watkins, 1989) chooses actions through another policy, i.e., the action that maxi-
mizes the Q-value at the new state st+1 . Q-learning’s update rule is given by
Q(st , at ) ← Q(st , at ) + α rt+1 + γ max Q(st+1 , a) − Q (st , at ) .
a
125
Algorithm 6: Q-learning Algorithm
Initialize: Discount factor γ, step size α ∈ (0, 1], small > 0;
Q(s, a), ∀s ∈ S, a ∈ A(s) arbitrarily, and Q(terminal − state, ·) = 0;
Repeat
Initialise s;
Repeat
Choose action a from s using -greedy policy derived from Q;
Take action a and observe r and next state s0 ;
Q(s, a) ← Q(s, a) + α (r + γ maxa Q(s0 , a) − Q(s, a)) ;
s ← s0 ;
Until s is a terminal state;
Until all episodes are observed;
Expected Sarsa
126
5.3.2 Function approximation
For large MDPs, there may be too many states and/or actions to store in memory,
and it is also very slow to learn the value of each state individually. Therefore
a function approximator is often used to estimate the true state-value function,
q̂(s, a, w) ≈ q(s, a), using a weight vector w. The goal is to find weights, w, that
minimize the mean-squared error L(w) defined by
1
∆w = α∇w L(w) = αEπ [(q(s, a) − q̂(s, a, w)) ∇w q̂(s, a, w)] ,
2
f (x) = tanh(x) ,
f (x) = max(0, x) .
127
Input Hidden Hidden Output
layer layer layer layer
I1
Hl1 Hl2
I2 O1
..
I3
.. .. .
.. . . Om
.
In
Figure 5.2: A feedforward NN with n input units, m output units, and two hidden layers.
NNs consist of an input layer, an output layer and often one or more “hidden”
layers between the two. Those with one hidden layer are called shallow networks,
and those with more than one hidden layer are called deep networks. A feedforward
neural network is a popular NN, which is a directed acyclic process. In this net-
work, the information moves from the input layer, goes through the hidden layer(s),
and finally leaves through the output layer. If the network has loops, it is called a
recurrent NN. A specialized kind of feed-forward NNs is convolutional neural net-
works, which have been tremendously successful in practical applications such as
facial recognition (Lawrence et al., 1997), image classification (Krizhevsky et al.,
2012), speech recognition (Abdel-Hamid et al., 2014; Amodei et al., 2016). Figure
5.2 illustrates a feed-forward NN architecture, which has an input layer with n input
units, two hidden layers and m output units.
128
network. The information is sent alternately backwards through the network. The
use of deep NNs for function approximation in RL is known as deep reinforcement
learning, and has achieved impressive success in recent years. In deep RL, NNs
can use TD errors to learn value functions. For example, DeepMind’s Atari game-
playing agent used a deep convolutional network to approximate Q-values (Mnih
et al., 1995).
Any NN used to approximate Q-values is called a deep Q-network or DQN for short,
such that
q(s, a, w) ≈ q ? (s, a) .
The network is updated according to the loss function in Q-values. We use the
mean-squared error L(·) to compute the loss, which is the difference between the
approximated and true Q-values. In practice, the true value function is unknown.
Thus, we substitute the optimal target r + γ maxa0 q ? (s0 , a0 ) with the approximate
target r + γ maxa0 q(s0 , a0 , w). At each iteration i, we have
2
0 0
L(wi ) = Eπ r + γ max 0
q(s , a , wi ) − q(s, a, wi ) .
a
All TD algorithms face a trade-off between exploitation and exploration, i.e., choos-
ing the best actions based the current knowledge (i.e., exploitation), and choosing
129
actions that potentially yield higher rewards in the future (i.e., exploration).
An -greedy policy is one of the most popular action-selection rules for balancing
exploration with exploitation. This action-selection rule is to choose the best action
with probability 1 − or choose a random action with probability , where is
called the exploration rate. The best action is the action (or one of the actions) that
gives the highest estimated action value so far, and called as the greedy action. We
normally let be small, which guarantees that we can take the best prices most of
the time. Assume there are m actions that have been tried, then the -greedy policy
can be written as
/m + 1 − , if a? = arg maxa∈A q(s, a) ,
π(a | s) =
/m,
otherwise .
Another approach is a decaying -greedy policy, where slowly decays over time. A
simple way to obtain the decay is multiplying by a real number less than 1, denoted
by decay rate, then we can write × decay rate. The agent tends to explore more
when it does not have enough information about the environment at the beginning,
and eventually the agent gains more of an understanding of the environment in
order to exploit more. A typical implementation may start with = 0.1, and set a
decay rate = 0.995 or 0.999 and a minimum = 0.01.
The cross-entropy (CE) method is a simple and versatile technique for optimization,
based on the CE minimization, which has asymptotic convergence properties. For a
more detailed introduction to foundation and various other applications of the CE
method, we refer to de Boer et al. (2003). Let X be a finite set of states and f (x)
be a performance function over all x ∈ X . We want to find the maximum of f over
X , and we denote the maximum value by `∗ , such that
`∗ = max f (x) .
x∈X
130
i.e., the probability that the value f exceeds some fixed `.
Recall that an MDP is defined by a tuple (S, A, P, R), where S = {1, . . . , n} is
a finite set of states; A = {1, . . . , m} is a finite set of possible actions; P is a state
transition probability matrix with elements P (s0 | s, a) representing the transition
probability from state s to state s0 when action a is chosen; and R is the set of all
possible rewards, r(s, a), received by choosing action a in state s.
Assume there exits a stopping time τ , at which the process terminates. We
consider a policy that does not depend on the time t and denote the policy as a
vector x = (x1 , . . . , xn ) with xi ∈ A being the action taken in state i. Then we can
write the total reward as
τ
X
f (x) = Ex r(st , at ) ,
t=1
starting from some fixed state s1 . Here r(st , at ) is the reward by choosing action
at in state st a time t, and Ex denotes the expectation under the policy x. We
assume that the process starts from a specific initial state s1 = sstart and ends at an
absorbing state (or terminal state), denoted by ster , with zero reward. Furthermore,
we assume that the terminal state is always reached at time τ . We want to calculate
the associated sample reward function f at each state.
Now define an auxiliary m × n probability matrix P = {Psa } with elements
Psa where s = 1, . . . , n and a = 1, . . . , m. Note that Psa is short for π(a | s) or
P(at = a | st = s), which denotes the probability of taking action a in state s and
Pm
a=1 Psa = 1 for any s. Assume that the matrix P is initialized to a uniform matrix
Psa = 1/m. Once this matrix P is defined, the CE algorithm comprises the following
two phases at each iteration t:
2. Updating parameters of the matrix P on the basis of the data collected at the
first phase.
131
Algorithm 8: CE Method in MDPs
Input: Auxiliary policy matrix P .
For ( i = 1, . . . , N ):
Start from some given initial state s1 = sstart and set t = 1 (iteration
counter).
Repeat
Generate an action at by Psa in (5.3.1);
Observe a reward rt = r(st , at ) and a new state st+1 ;
Set t = t + 1;
Until st = ster ;
Obtain a trajectory
(i) (i) (i) (i) (i) (i)
Xi = {s1 , a1 , r1 , . . . , sτ −1 , aτ −1 , rτ −1 , s(i)
τ },
Output: f .
cumulative reward from each state until termination. For each state s = sj , the
(i) (i)
reward is fsj (Xi ) = τt=j r(st , at ). Each state s is updated separately according
P
to the performance fs (Xi ) obtained from state s onwards. Thus, at iteration t, the
parameter matrix, denoted by Pt,sa , is updated by
1{fs (Xi )≥`t,s } 1{Xi ∈Xsa }
PN
Pt,sa = Pi=1 , (5.3.1)
N
i=1 1{fs (Xi )≥`t,s } 1{Xi ∈Xs }
where the space set Xs contains all trajectories to state s and the space set Xsa
contains all trajectories to state s where action a is taken. We consider a (1−ρ)100%
quantile of the rewards, denoted by `t,s , which is also the best rewards in each
iteration t. This can be obtained by ordering rewards fs (X1 ), . . . , fs (XN ) from
smallest to largest, fs(1) ≤ · · · ≤ fs(N ) , and let `t,s be
`t,s = fs(d(1−ρ)N e) .
Here ρ is chosen to be not very small, say ρ ≥ 10−2 , such that the event {fs (X) ≥
`t,s } is not too small. Note that a different threshold parameter `t,s is used for each
state s at each iteration t.
The CE method is an iterative method, and the update of the parameters here
is based on the best performing samples. In the setting of MDPs, the CE method
132
takes advantage of the Markov property, i.e., only considering the reward from the
visit to that state onward. The choice of action in a given state only affects the
reward from the state onward, not the past. This enables the algorithm to speed up
and reduce bias.
τ := inf{t ∈ N : rt < 0} .
133
The performance of policies measured by expected reward is defined by
" T ∧τ #
X
t−1
Eπ γ Divt , (5.4.3)
t=1
where γ ∈ [0, 1] is the discount factor and T ∧ τ is the optimal stopping time. If
the revenue is always positive before selling horizon T , the company stops at time
T ; otherwise, it stops at ruin time τ .
E[D(p)] = 11 − 0.8p ,
E[C(p)] = 3 + 0.25p ,
and the feasible price set is [p` , ph ] = [1, 10]. At time t, the insurer has to make
decisions on paying dividends to shareholders based on their revenue. Without loss
of generality, we set c = 10% and ` = 1 in (5.4.2), then the dividend that the insurer
pays is
0.1 · (rt − 1), if rt > 1 ;
Divt = (5.5.1)
0, if rt ≤ 1 .
That is, if rt > 1, the insurer pays shareholders 10% of the excess of rt − 1, otherwise
nothing. Thus, at time t, the insurer chooses a selling price pt , and then observes
demand dt and total claims ct . Claims are only paid when insured events happen.
However insured events do not always occur, so neither do claims. We introduce
a parameter wt and set it to be 0 or 1. Furthermore we assume that there is only
a 50% chance that the insured events will happen. We generate a random number
and if it is less than or equal to 0.5, we let wt = 1 which means that the company
needs to pay claims; and if is greater than 0.5, we let wt = 0 which implies that the
company does not pay claims. Then the revenue at time t + 1 is
134
5.5.1 Temporal-difference methods
135
(a) Q-learning (b) Sarsa
Figure 5.3: Performance of Q-learning, Sarsa and Expected Sarsa. The curves display
the T -period regret for different step-sizes α = 0.01, 0.03, 0.1, 0.3, 1.0, where periods T are
106 , 106 and 5 × 105 respectively. In all cases, the problem parameters used are γ = 0.9
and = 0.1.
136
Figure 5.4: A comparison of Q-learning, Sarsa and Expected Sarsa as a function of α.
The three curves display the T -period regret in period 105 , respectively. In all cases, the
problem parameters used are γ = 0.9 and = 0.1.
Figure 5.5: Performance of Q-learning, Sarsa and Expected Sarsa for different exploration
rates = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. The curves display the T -period regret in period 105 ,
respectively. In all cases, the problem parameters used are γ = 0.9 and α = 0.5.
137
Figure 5.6: A comparison of T -period regret of Q-learning, Sarsa and Expected Sarsa.
The three curves display the T -period regret in period 105 , respectively. In all cases, the
problem parameters used are γ = 0.9, = 0.1 and α = 0.5.
Since the insurance pricing problem is multi-armed bandit, we can use the cross-
entropy (CE) method to search for the optimal policy.
In Figure 5.8, we plot the average reward obtained by the CE algorithm with
ρ = 1%, 5% and 10%. We can see that CE with ρ = 10% converges faster than
that with ρ = 5%, and both are faster than that with ρ = 1%. This is because
we considered a (1 − ρ)-quantile of the performances `t,s and counted how many of
the samples X1 , . . . , XN have a performance greater than or equal to `t,s in state s
at iteration t. A larger `t,s is closer to the best of the performances in state s at
iteration t, which leads to a quicker convergence to the optimal performance. The
value of ρ does not affect the optimal policy, and CE for different ρ all converges to
138
Figure 5.7: Pricing Policy. The curve is the average reward for varying prices, and the
green area is the CE convergence zone. The problem parameters are N = 100, T = 100
and ρ = 10%.
Figure 5.8: A comparison of average reward for CE policies with 90%, 95% and 99%
percentiles. In all cases, the problem parameters are N = 100 and T = 100.
The insurance pricing problem can be viewed as a simple MDP with one-dimensional
state space, and each revenue is a distinct state. Therefore, we can apply deep
Q-learning with network (DQN). In our case, the input layer receives 1 piece of
information, i.e., only 1 input node. There are 2 hidden layers, each has 24 nodes
and a sigmoid activation. In the output layer, there is a separate output unit for
139
Figure 5.9: Performance of DQN using Keras and Q-learning. The curves display the
T -period regret in period 20000, respectively. In both cases, the problem parameters used
are constant γ = 0.9, α = 0.5 and decaying .
each possible price, thus there are 10 nodes. The model is compiled using a mean-
squared error loss function and Adam optimizer with step-size α. We used Keras to
implement the DQN.
We set parameters α = 0.5 and γ = 0.9. Prices were selected using a decaying
-greedy policy with decaying from 1.0 to 0.01 at a rate of 0.995 and fixing at 0.01
thereafter. Figure 5.9 compares the performance of Q-learning with and without
function approximation with the same parameters. We performed 20,000 episodes
on 10 different occasions and averaged the reward over 100 episodes. We can see
that Q-learning converges quicker and shows the more stable learning, and DQN
achieves higher average reward but its learning process is unstable.
In this chapter, we showed that the insurance pricing problem can be modeled as
an MDP. We described all the elements of an MDP: the state (i.e., revenue) and
action (i.e., price) spaces, and the reward (i.e., dividend pay-out). We presented
the different RL-based methodologies, the basic algorithms and their modifications,
and discussed their applications to the insurance pricing problem problem. Our
140
results showed that RL techniques could be useful for solving pricing problems in
the insurance context, where available information is limited.
There are several directions for future work. First, we showed the potential
application of RL in insurance but it is still theoretically not well understood. For
example, the convergence of the methods mentioned above deserves further study
since the numerical results does not provide the guidance on convergence rates. Thus
the next step might be to find bounds of convergence rates, at least in some particular
cases. The lack of theoretical support and heavy regulations or government in
insurance practices will affect the use of RL in the insurance industry.
When we apply RL to a real-world business, there might be hundreds of products
to consider, and thousands of factors on how to price them. We have considered
a single monopolistic company that sells one product. Therefore, the next imme-
diate step to consider would be a company that sells a huge number of products
or a huge number companies in the market. In addition, we may consider a con-
textual case by introducing features. Possible features are age and gender of the
customer, consumer behavior, geography, market size and etc. The insurance indus-
try is different from other finance services industries because insurance companies
must price its products without knowing the actual costs. Given the uncertainty
of pricing insurance products, companies need to predict both demand and claims
for their products in the future in order to estimate premiums at the beginning.
Adding more information can not only improve the demand and claims estimation,
but enable insurers to know their customers and provide personalized insurance that
matches each individual customer’s requirements.
141
Bibliography
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014).
Convolutional neural networks for speech recognition. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 22 , 1533–1545. doi:10.1109/TASLP.
2014.2339736.
Albrecher, H., & Kortschak, D. (2009). On ruin probability and aggregate claim
representations for pareto claim size distributions. Insurance Mathematics and
Economics, 45 , 362–373. doi:10.1016/j.insmatheco.2009.08.005.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case,
C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen,
142
Z., Chrzanowski, M., Coates, A., Diamos, G., & et al (2016). Deep speech 2 :
End-to-end speech recognition in english and mandarin. In Proceedings of The
33rd International Conference on Machine Learning (pp. 173–182). New York,
New York, USA: PMLR volume 48 of Proceedings of Machine Learning Research.
URL: http://proceedings.mlr.press/v48/amodei16.pdf.
Anderson, T. W., & Taylor, J. B. (1979). Strong consistency of least squares es-
timates in dynamic models. The Annals of Statistics, 7 , 484–489. doi:10.1214/
aos/1176344670.
Antonio, K., & Valdez, E. A. (2012). Statistical concepts of a priori and a posteriori
risk classification in insurance. AStA Advances in Statistical Analysis, 96 , 187–
224. doi:10.1007/s10182-011-0152-7.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the
multiarmed bandit problem. Machine Learning, 47 , 235–256. doi:10.1023/A:
1013689704352.
Ban, G.-Y., & Keskin, N. B. (2020). Personalized dynamic pricing with ma-
chine learning: High dimensional features and heterogeneous elasticity. Manage-
ment Science, . URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_
id=2972985. Forthcoming.
143
Bartlett, M. S. (1951). An inverse matrix adjustment arising in discriminant analysis.
Annals of Mathematical Statistics, 22 , 107–111. doi:10.1214/aoms/1177729698.
Baxter, J., Bartlett, P. L., & Weaver, L. (2001). Experiments with infinite-horizon,
policy-gradient estimation. Journal of Artificial Intelligence Research, 15 , 351–
381. doi:10.1613/jair.807.
Bertsekas, D. P., & Tsitsiklis, N., John (1996). Neuro-Dynamic Programming. (4th
ed.). Athena Scientific. URL: http://athenasc.com/ndpbook.html.
Besbes, O., & Zeevi, A. (2009). Dynamic pricing without knowing the demand
function: Risk bounds and near-optimal algorithms. Operations Research, 57 ,
1407–1420. URL: http://www.jstor.org/stable/25614853.
den Boer, A. (2013). Dynamic Pricing and Learning. Ph.D. thesis VU University of
Amsterdam. URL: http://dare.ubvu.vu.nl/bitstream/handle/1871/39660/
dissertation.pdf.
den Boer, A. (2015). Dynamic pricing and learning: Historical origins, current
research, and new directions. Surveys in operations research and management
science, 20 , 1–18. doi:10.1016/j.sorms.2015.03.001.
den Boer, A., & Zwart, B. (2014a). Mean square convergence rates for maxi-
mum quasi- likelihood estimators. Stochastic Systems, 4 , 375–403. doi:10.1287/
12-SSY086.
den Boer, A., & Zwart, B. (2014b). Simultaneously learning and optimizing using
controlled variance pricing. Management Science, 60 , 770–783. doi:10.1287/
mnsc.2013.1788.
144
de Boer, P., Kroese, P., D., Mannor, S., & Rubinstein, R. Y. (2003). A tutorial
on thde boere cross-entropy method. Annals of Operations Research, 134 , 19–67.
doi:10.1007/s10479-005-5724-z.
Botev, I. Z., Kroese, D., P., Rubinstein, Y. R., & L’Ecuyer, P. (2013). The cross-
entropy method for optimization. In Handbook of Statistics – Machine Learn-
ing: Theory and Applications (pp. 35–59). Elsevier volume 31. doi:10.1016/
B978-0-444-53859-8.00003-5.
Brochu, E., Cora, V. M., & de Freitas, N. (2010). A tutorial on bayesian op-
timization of expensive cost functions, with application to active user mod-
eling and hierarchical reinforcement learning. CoRR, abs/1012.2599 . URL:
http://arxiv.org/abs/1012.2599.
Brockman, M., & Wright, T. (1992). Statistical motor rating: making efficient use
of your data. Journal of the Institute of Actuaries, 119 , 457–543. doi:10.1017/
S0020268100019995.
Broder, J., & Rusmevichientong, P. (2012). Dynamic pricing under a general para-
metric choice model. Operations Research, 60 , 965–980. doi:10.1287/opre.1120.
1057.
Brooks, C. H., Fay, S. A., Das, R., MacKie-Mason, J. K., Kephart, J. O., & Durfee,
E. H. (1999). Automated strategy searches in an electronic goods market: learning
and complex price schedules. In Proceedings of the First ACM Conference on
Electronic Commerce (EC-99), Denver, CO, USA (pp. 31–40). URL: https:
//doi.org/10.1145/336992.337000. doi:10.1145/336992.337000.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochas-
tic multi-armed bandit problems. CoRR, 5 , 1–122. doi:10.1561/2200000024.
Business Insider (2018). Amazon changes prices on its products about every 10
minutes—here’s how and why they do it. https://www.businessinsider.com/
amazon-price-changes-2018-8.
145
Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data.
Econometric Society Monographs (second edition ed.). Cambridge: Cambridge
University Press. doi:10.1017/CBO9781139013567.
Chen, K., Hu, I., & Ying, Z. (1999). Strong consistency of maximum quasi-likelihood
estimators in generalized linear models with fixed and adaptive designs. The
Annals of Statistics, 27 , 1155–1163. doi:10.1214/aos/1017938919.
Chen, X., Owen, Z., Pixton, C., & Simchi-Levi, D. (2015). A statistical learning
approach to personalization in revenue management. Available at SSRN 2579462 ,
.
Chow, Y. S. (1965). Local convergence of martingales and the law of large num-
bers. The Annals of Mathematical Statistics, 36 , 552–558. doi:10.1214/aoms/
1177700166.
Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear
payoff functions. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics (pp. 208–214). volume 15. URL: http://
proceedings.mlr.press/v15/chu11a.html.
Cohen, M., Lobel, I., & Paes Leme, R. (2016). Feature-based dynamic pricing.
ACM Conference on Economics & Computation (EC), 15 , 40–44. URL: http:
//dx.doi.org/10.2139/ssrn.2737045.
146
Collins, A., & Thomas, L. (2012). Comparing reinforcement learning approaches for
solving game theoretic models: a dynamic airline pricing game example. Journal
of the Operational Research Society, 63 , 1165–1173. doi:10.1057/jors.2011.94.
Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization
under bandit feedback. In 21st Annual Conference on Learning Theory (COLT)
(pp. 355–366). URL: http://colt2008.cs.helsinki.fi/papers/80-Dani.pdf.
Drygas, H. (1976). Weak and strong consistency of the least squares estimators
in regression models. Z. Wahrscheinlichkeitstheorie verw Gebiete, 34 , 119–127.
doi:10.1007/BF00535679.
Embrechts, P., Klüppelberg, C., & Mikosch, T. (1997). Modelling extremal events
for insurance and finance. Applications of mathematics, 33. Berlin ; London:
Springer.
Filippi, S., Cappe, O., Garivier, A., & Szepesvári, C. (2010). Parametric ban-
dits: The generalized linear case. In Advances in Neural Information Processing
147
Systems 23 (NIPS 2010) (pp. 586–594). URL: https://sites.ualberta.ca/
~szepesva/papers/GenLinBandits-NIPS2010.pdf.
Gallego, G., & van Ryzin, G. (1994a). Optimal dynamic pricing of inventories with
stochastic demand over finite horizons. Management Science, 40 , 999–1020. URL:
http://www.jstor.org.manchester.idm.oclc.org/stable/2633090.
Gallego, G., & van Ryzin, G. (1994b). Optimal dynamic pricing of inventories with
stochastic demand over finite horizons. Management science, 40 , 999–1020.
Gosavi, A., Bandla, N., & Das, K. T. (2002). A reinforcement learning approach to
a single leg airline revenue management problem with multiple fare classes and
overbooking. IIE Transactions, 34 , 729–742. doi:10.1023/A:1015583703449.
Gupta, M., Ravikumar, K., & Kumar, M. (2002). Adaptive strategies for price
markdown in a multi-unit descending price auction: a comparative study. In
IEEE International Conference on Systems, Man and Cybernetics (pp. 373–378).
volume 1. doi:10.1109/ICSMC.2002.1168003.
Haberman, S., & Renshaw, A. E. (1996). Generalized linear models and actuarial
science. Journal of the Royal Statistical Society. Series D (The Statistician), 45 ,
407–436. doi:0.2307/2988543.
Han, W., Liu, L., & Zheng, H. (2008). Dynamic pricing by multiagent reinforce-
ment learning. In Proceedings of the 2008 International Symposium on Electronic
Commerce and Security ISECS ’08 (pp. 226–229). Washington, DC, USA: IEEE
Computer Society. doi:10.1109/ISECS.2008.179.
Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
148
Javanmard, A., & Nazerzadeh, H. (2019). Dynamic pricing in high-dimensions. Jour-
nal of Machine Learning Research, 20 , 1–49. URL: http://jmlr.org/papers/
v20/17-357.html.
de Jong, P., & Heller, G. Z. (2008). Generalized Linear Models for Insurance Data.
Cambridge: Cambridge University Press.
Joulani, P., György, A., & Szepesvári, C. (2013). Online learning under de-
layed feedback. In Proceedings of the 30th International Conference on Ma-
chine Learning (ICML) (pp. 1453–1461). Atlanta, Georgia, USA. URL: http:
//proceedings.mlr.press/v28/joulani13.pdf.
Joulani, P., György, A., & Szepesvári, C. (2016). Delay-tolerant online convex
optimization: Unified analysis and adaptive-gradient algorithms. In Proceedings of
the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-2013) (pp. 1744–
1750). Phoenix, Arizona, USA. URL: https://sites.ualberta.ca/~pooria/
publications/AAAI16-Extended.pdf.
Keskin, N. B., & Zeevi, A. (2014). Dynamic pricing with an unknown demand
model: Asymptotically optimal semi-myopic policies. Operations Research, 62 ,
1142–1167. doi:10.1287/opre.2014.1294.
Kleinberg, R., & Leighton, T. (2003). The value of knowing a demand curve: Bounds
on regret for online posted-price auctions. In 44th Annual IEEE Symposium on
Foundations of Computer Science, 2003. Proceedings. (pp. 594–605). IEEE.
Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A
survey. International Journal of Robotics Research, 32 , 1238–1274. doi:10.1177/
0278364913495721.
Koller, M. (2012). Stochastic Models in Life Insurance. EAA Series. Berlin, Heidel-
berg: Springer Berlin Heidelberg.
149
Krizhevsky, A., Ilya, S., & Geoffrey, H., E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing
Systems 25 (pp. 1097–1105). Curran Associates, Inc. doi:10.1061/(ASCE)GT.
1943-5606.0001284.
Kutschinski, E., Uthmann, T., & Polani, D. (2013). Learning competitive pricing
strategies by multi-agent reinforcement learning. Journal of Economic Dynamics
and Control , 27 , 2207–2218. doi:10.1016/S0165-1889(02)00122-7.
Lai, T., & Robbins, H. (1982). Iterated least squares in multiperiod control. Ad-
vances in Applied Mathematics, 3 , 50–73. doi:10.1016/S0196-8858(82)80005-5.
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.
Advances in Applied Mathematics, 6 , 4–22. URL: http://dx.doi.org/10.1016/
0196-8858(85)90002-8.
Lai, T. L., Robbins, H., & Wei, C. Z. (1979). Strong consistency of least squares
estimates in multiple regression. Journal of Multivariate Analysis, 9 , 343–361.
doi:10.1016/0047-259X(79)90093-9.
Lai, T. L., & Wei, C. Z. (1982). Least squares estimates in stochastic regression
models with applications to identification and control of dynamic systems. The
Annals of Statistics, 10 , 154–166. doi:10.1214/aos/1176345697.
Lawrence, S., Giles, C. L., Ah Chung Tsoi, & Back, A. D. (1997). Face recognition: a
convolutional neural-network approach. IEEE Transactions on Neural Networks,
8 , 98–113. doi:10.1109/72.554195.
150
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit ap-
proach to personalized news article recommendation. In Proceedings of the 19th
International Conference on World Wide Web WWW’10 (p. 661–670). Associa-
tion for Computing Machinery. doi:10.1145/1772690.1772758.
Li, L., Lu, Y., & Zhou, D. (2017). Provably optimal algorithms for generalized
linear contextual bandits. In Proceedings of the 34th International Conference
on Machine Learning (pp. 2071–2080). volume 70. URL: http://proceedings.
mlr.press/v70/li17c.html.
Lobo, M. S., & Boyd, S. (2003). Pricing and learning with uncertain demand. In
INFORMS Revenue Management Conference.
Mannor, S., Rubinstein, R. Y., & Gat, Y. (2003). The cross entropy method for
fast policy search. In Proceedings of the Twentieth International Conference on
International Conference on Machine Learning ICML’03 (pp. 512–519). AAAI
Press. URL: https://www.aaai.org/Papers/ICML/2003/ICML03-068.pdf.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. (2nd ed.).
London: Chapman and Hall.
McCulloch, S., W., & Pitts, W. (1943). A logical calculus of the ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5 , 115–133. doi:10.1007/
BF02478259.
151
McKinsey & Company (2018). Digital insurance in 2018: Driv-
ing real impact with digital and analytics. https://www.
mckinsey.com/industries/financial-services/our-insights/
digital-insurance-in-2018-driving-real-impact-with-digital-and-analytics.
Homem-de Mello, T., & Rubinstein, R. Y. (2002). Rare event estimation for static
models via cross-entropy and importance sampling. In Winter Simulation Con-
ference (pp. 1–35). San Diego, CA, USA: IEEE. URL: http://citeseerx.ist.
psu.edu/viewdoc/download?doi=10.1.1.15.6923&rep=rep1&type=pdf.
Miller, B. L. (1968). Finite state continuous time markov decision processes with
an infinite planning horizon. Journal of Mathematical Analysis and Applications,
22 , 552–569. doi:10.1016/0022-247X(68)90194-7.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., &
Hassabis, D. (1995). Human-level control through deep reinforcement learning.
Nature, 518 , 529–533. doi:10.1038/nature14236.
Ohlsson, E., & Johansson, B. (2010). Non-Life Insurance Pricing with Generalized
Linear Models. Berlin, Heidelberg: Springer.
152
Phillips, R. L. (2005). Pricing and revenue optimization. Stanford, Calif.: Stanford
Business Books.
Pike-Burke, C., Agrawal, S., Szepesvári, C., & Grunewalder, S. (2018). Bandits with
delayed, aggregated anonymous feedback. In Proceedings of Machine Learning
(ICML) (pp. 4105–4113). volume 80. URL: http://proceedings.mlr.press/
v80/pike-burke18a/pike-burke18a.pdf.
Qiang, S., & Bayati, M. (2016). Dynamic Pricing with Demand Covariates. Working
paper Stanford University Graduate School of Business Stanford, CA. URL:
http://web.stanford.edu/~bayati/papers/dpdc.pdf.
Qiu, Q., & Pedram, M. (1999). Dynamic power management based on continuous-
time markov decision processes. In Proceedings 1999 Design Automation Confer-
ence (Cat. No. 99CH36361) (pp. 555–561). doi:10.1109/DAC.1999.781377.
Raju, C., Narahari, Y., & Ravikumar, K. (2006). Learning dynamic prices in elec-
tronic retail markets with customer segmentation. Annals of Operations Research,
143 , 59–75. doi:10.1007/s10479-006-7372-3.
Ramsay, C. M. (2003). A solution to the ruin problem for pareto distributions. In-
surance Mathematics and Economics, 33 , 109–116. doi:10.1016/S0167-6687(03)
00147-1.
153
Rothstein, M. (1968). Hotel overbooking as a markovian sequential decision process.
Journal of Mathematical Analysis and Applications, 5 , 389–404. doi:10.1111/j.
1540-5915.1974.tb00624.x.
Rubinstein, R. Y., & Kroese, D., P. (2004). The Cross-Entropy Method . Information
Science and Statistics. New York, NY: Springer-Verlag New York. doi:10.1007/
978-1-4757-4321-0.
van Seijen, H., van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical
and empirical analysis of expected sarsa. In 2009 IEEE Symposium on Adaptive
154
Dynamic Programming and Reinforcement Learning (pp. 177–184). doi:10.1109/
ADPRL.2009.4927542.
Silver, D., Huang, A., Maddison, C., & et al. (2016). Mastering the game of go
with deep neural networks and tree search. Nature, 529 , 484–489. doi:10.1038/
nature16961.
Singh, S., Jaakkola, T., Littman, L. M., & Szepesvári, C. (2000). Convergence results
for single-step on-policy reinforcement-learning algorithms. Machine learning, 38 ,
287–308. doi:10.1023/A:1007678930559.
Singh, S., Litman, D., Kearns, M., & Walker, M. (2002). Optimizing dialogue
management with reinforcement learning: Experiments with the njfun system.
Journal of Artificial Intelligence Research, 16 , 105–133. doi:10.1613/jair.859.
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2012). Information-theoretic
regret bounds for gaussian process optimization in the bandit setting. IEEE Trans-
actions on Information Theory, 58 , 389–434. doi:10.1109/TIT.2011.2182033.
Sutton, R. S., McAllester, D., Singh, S., & Yishay, M. (2000). Pol-
icy gradient methods for reinforcement learning with function approxima-
tion. In Advances in Neural In- formation Processing Systems (NIPS) (pp.
1057–1063). URL: https://homes.cs.washington.edu/~todorov/courses/
amath579/reading/PolicyGradient.pdf.
Talluri, K., & van Ryzin, G. (2005). The theory and practice of revenue management.
Boston, MA: Springer. doi:10.1007/b139000.
155
The Wall Street Journal (2015). Now prices can change
from minute to minute. https://www.wsj.com/articles/
now-prices-can-change-from-minute-to-minute-1450057990.
Vershynin, R. (2012). How close is the sample covariance matrix to the actual
covariance matrix? Journal of Theoretical Probability, 25 , 655–686.
~rvershyn/papers/HDP-book/HDP-book.pdf.
Widrow, B., & Hoff, M., E. (1960). Adaptive switching circuits. 1960 IRE WESCON
Convention Record , (pp. 96–104). doi:10.5555/65669.104390.
Wüthrich, M. V., & Buser, C. (2017). Data analytics for non-life insurance pricing.
Swiss Finance Institute Research Paper No. 16-68 , . URL: https://ssrn.com/
abstract=2870308.
156
Zhang, S. G., & Liao, Y. (1999). On some problems of weak consistency of quasi-
maximum likelihood estimates in generalized linear models. Science in China
Series A: Mathematics, 51 , 1287–1296. doi:10.1007/s11425-007-0172-7.
157
ProQuest Number: 28526381
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA