Bayesian Q Learning Algo PDF

2010 Second International Conference on Computer Modeling and Simulation
A Dynamic Pricing Algorithm by

Bayesian Q- Learning
Wei Han
Information Engineering College
Nanjing University of Finance and Economics, 210046
Nanjing, China
dallashw@163.com
AbstractIn electronic marketplaces automated and environment. In principle, RL techniques allow an
dynamic pricing is becoming increasingly popular. agent to become competent simply by exploring its
Agents that perform this task can improve themselves environment and observing the resulting percepts and
by learning from past observations, possibly using rewards, gradually converging on estimates of the
reinforcement learning techniques. several papers value of actions or states that allow it to behave
studied the use of Q-learning for modeling the problem optimally. Because of its accessibility, RL is widely
of dynamic pricing in electronic marketplaces. But The used in many areas. Gerald and Waltman respectively
extension of reinforcement learning (RL) to large state reports the convergence of Q-learning algorithm when
space has inevitably encountered the problem of the applied to the problem of pricing[4],[5].
curse of dimensionality. Improving the learning
In addition to ensuring more robust behavior
efficiency of the agent is much more important to the
across the state space, exploration is crucial in
practical application of RL. To address the problem of
allowing the agent to discover the reward structure of
dynamic pricing, we take a Bayesian model-based
the environment and to determine the optimal policy.
approach, framing transition function and reward
Without sufficient incentive to explore, the agent may
function of MDP as distributions, and use sampling
quickly settle on a policy of low utility simply
technique for action selection. The Beyesian approach
because it looks better than leaping into the unknown.
accounts for the general problem of exploration vs
On the other hand, the agent should not keep
exploitation tradeoff. Simulations show our dynamic
exploring options that it already has good reason to
pricing algorithm improves the profits compares with
believe are suboptimal. Thus, a good exploration
other pricing strategies based on the same pricing
method should balance the expected gains from
model.
exploration against the cost of trying possibly
Keywords: Q-learning; Electronic Marketplaces;
suboptimal actions when better ones are available to
Dynamic Pricing
be exploited.
I. INTRODUCTION
The model-free RL algorithms, which directly
Electronic marketplaces provide the possibility
learn and optimal policy, tend to have slow
of exploring new and more complex pricing schemas,
convergence and require too many trials to be used for
due to the accessibility of more information of sellers
online learning. Because of the amount of costly
and buyers. The dynamic nature of electronic
exploration of RL, it is mostly used in simulation to
marketplaces requires dynamic pricing strategies and
learn offline. This shortcoming prohibits its effective
tools. Several papers proposed algorithms by using
[1],[2],[3]
using in dynamic pricing applications.
Reinforcement Learning , which was originally
proposed as a framework to allow agents to learn in In contrast, model-based approaches, especially
an online fashion as they interact with their Bayesian ones, can also optimally tradeoff
exploration and exploitation. The Bayesian
Sponsored by :National Science Fund,No.70802025
978-0-7695-3941-6/10 $26.00 2010 IEEE 511

515
DOI 10.1109/ICCMS.2010.240
model-based RL treats Q values as distribution over market. U i : S Ai [0,+] is the utility of agent
every state and every action and using Bayesian i at present price.
method for representing, updating, and propagating Definition 4. The pricing policy of seller agent i is
probability distributions. This approach effectively function : S Ai [0,1] , which let agent choose a
solves the problem of exploration vs. exploitation. price stochastically according to market and his
opponents.
That is : Exploiting what one knows about the effects
Definition 5. The utility function of seller i is
of actions and their rewards by executing the action
p
thatgiven current knowledgeappears best; and defined as U i (d , pi , ci , k i ) = ( pi ci )Yi i (d ) ,
p
exploring to gain further information about actions where d is the demand of present market. Yi i (d )
and rewards that has the potential to change the action is the quantity of products that agent i sold at price
that appears best. pi .We assume here that the products with lower
Section II describes the pricing model in price be sold first, that is buyers prefer cheap
p
electronic marketplaces. Section III introduces the products. So Yi i (d ) is influenced by the price of
classical Q learning method and two approaches to other agents.
explorations. Section IV provides our Beyesian Though we model the electronic marketplaces as
Q-Learning pricing algorithm. multiagent systems, we treat the pricing strategy as
II. THE PRICING MODEL single learning algorithm. In fact, the influence of
other agents can be reflected by the quantity of the
Electronic marketplaces are essentially
products sold by agent i. by this single agent view,
multiagent systems, the self-interested seller agents
we eliminates the complexity introduced by
interacts with each other through the market [6],[7]
stochastic games .
environments, which varies according to the supply
and demand. The decision factors in sellers pricing III. MODEL-FREE Q-LEARNING
include cost, capacity, market demand function and Q-learning is driven from the MDP theory, An
the pricing polices of other sellers.
Definition 1. the market demand function is defined MDP is a 4-tuple < S , A, T , R > where S is a set
as an linear function of market average price level, of states, A is a set of actions,
that is D ( p ) = max{0, (q hp )} , q, h >0.
T ( s, a, s ' ) = pr ( s ' | s, a ) is a transition model that
Definition 2. A seller agent is 3-tuple
Selleri = ( p i , ci , k i ) , where p i is the price of certain captures the probability of reaching state > after we
product, ci is the cost, k i is the production capacity. execute action at state s, R ( s, a, s ' ) is a reward
Definition 3. A electronic marketplaces is described
model that captures the probability of getting reward
as (n, s, A1......n , T ,U 1......n ) , where n stands for the
when executing action a at state s . A policy
number of sellers who sell certain kind of : S A is a mapping from states to actions.
homogenous product. s = ( p, c, k , q, h ) ,
The problem of RL consists of finding an optimal
p = ( p1 ... p n ) , k = (c1 ...cn ) , c = (c1 ...c n ), q h are policy for an MDP with a partially or completely
parameters of market demand function. unknown transition function. In this paper, we
Ai = {a1 ......a m2 } is the set of all possible price of analytically derive a simple parameterization of the
agent i . Ai = A1 A2 ... Ai 1 Ai +1 An is the optimal value function for the Bayesian model-based
set of joint actions of the other sellers. approach. Bayesian learning proceeds as follows. Pick
a prior distribution encoding the learner's initial belief
T : S A S [0,1] is the transfer function of the
516
512
about the possible values of each unknown parameter. has been observed. ni is also called
Then, whenever a sampled realization of the unknown s
hyper-parameters or virtual samples. let n be a
a
parameter is observed, update the belief to react the vetor of hyper-parameters n s,s'
, the prior can be
a
observed data. In the context of reinforcement
presented as
learning, each unknown transition probability
p ( ) = s , a Dir ( as ; nas ) (4)
T ( s, a, s ' ) is an unknown parameter s ,s '
a . Since
And the posterior after transition Dt = ( s, a , s' ) is

these are probabilities, the parameters as ,s ' take
values in the [0; 1]-interval. p ( | Dt ) = k as , s ' s , a Dir ( as ; nas )
The agents aim is to maximize the expected

*
= s,a
Dir ( as ; nas + s , a , s ' ( s, a, s ' )) (5)
discounted cumulative reward. V ( s) letting
denotes the optimal expected discounted reward Where, s ,a , s ' ( s, a, s ' ) is a Kronecker delta
achievable from states and Q* ( s, a) denotes the that returns 1 when s = s, a = a , s ' = s' , and 0
value of executing at s . We have the standard otherwise.
Bellman equation B. Distribution over rewards

* *
V ( s ) = max a Q ( s, a ) (1) In the Bayesian framework, we need to consider
prior distributions over Q-values, and then update
Q * (s, a) = p(s'| s, a)[R(s, a, s' ) + V (s' )] (2) * these priors based on the agents experiences.
s'
Formally, let Rs , a be a random variable that denotes
IV. BAYESIAN REINFORCEMENT LEARNING
the discounted cumulative reward received when
A. Distribution over transition action is executed in state and an optimal policy is
In our Bayesin RL model, we view the function followed thereafter. According to central limit
T ( s, a, s ' ) as conditional distribution of a Bayesian theorem, we assume Rs , a has a normal distribution,
networks. Thus, we can learn the transition model to model our uncertainty about the distribution of
by belief updating. At each time t , conditional
Rs , a , it suffices to model a distribution over the
distribution over all states and actions is updated
based on the observed transition Dt = ( s, a, s' ) mean s, a and the variation s2,a .Dearden[6] uses a
using Bayes theorem: normal-gamma distribution to model the probability
p ( ) p ( ) pr ( s '| , s, a ) (3) p( , ) ~ NG ( 0 , , , ) where = 1 / s2, a ,
In practice, this updating is easily performed when

and 0 , , , is hyper-parameters.
prior and posterior distribution belong to the same
1
family. Dirichlet distribution has this property and ( 0 ) 2
p( , ) 1 / 2e 2
1e (6)
usually be adopted. A Dirichlet distribution
Dir ( p; n) = k i pi
n i 1
is parameterized by Theorem 1[6]: let p( , ) ~ NG ( 0 , , , )
positive numbers ni ,such that ni 1 can be views be a prior distribution over the unknown parameters
as the number of times that the pi -probability event for a normally distributed variable R, and let
517
513
r1 ,..rn be n independent samples of R with information available in presents belief (e.g.,
1 1 exploitation), s well as the information gained in the

ri
2
M1 = and M2 = r .
i i
Then
future by observing the outcome of the actions
n i n
selected (e.g., exploration).
p( , | r1 ,..rn ) ~ NG ( '0 , ' , ' , ' ) where
D. Action selection by sampling
+ nM 1 1
0 ' = 0 , ' = + n, ' = + n , Q-value sampling was first described by Wyatt [7]
+n 2
for exploration in multi-armed bandit problems. The
1 2 n ( M 1 0 ) 2
' = + n( M 2 M 1 ) + . idea is to select actions stochastically, based on our
2 2( + n)
current subjective belief that they are optimal. That is,
That is, given a single normal-gamma prior, the
action a is performed with probability given by
posterior after any sequence of independent
observations is also a normal gamma distribution. pr (a = arg max a ' Q( s, a' ))
C. Tradeoff between exploration and exploitation From section A and B, we have got the
In model-based RL, policy mapping from distribution of transition function and reward
function of MDP, so we can use this knowledge to
belief states b to actions. The value V of given
simulate a series of samples from those two
policy is measured by the expected discounted
distributions. At each state s , the sample algorithm
cumulative rewards when executing it.
randomly sample n values Q( s, a ) over each states,
V (b) = t = 0 t R(bt , (bt ), bt +1 ) .

(7)
each values is computed by look forward t steps.
According to bellmans equation, in our Bayesian Though, the sampling approaches tend to be more
RL model, the optimal pricing policy has the highest complicated and computationally sensitive,
value in all belief states . performing actions is more expensive than
computation time when addressing pricing situations.
Vs ( ) = max a ' p ( s ' | , s, a)[ R( s, a, s ' )

*
E. The pricing algorithm
(8)
Step 1. initialize the hype parameters of
+ V * ( ' )]
normal-gamma distribution and Dirichlets
distribution according the domain knowledge of
Q * ( , a ) =

p ( s'| , s, a )[ R( s, a, s' ) + V
'
*
( ' )] pricing problem.
Setp 2. observe a market (demand) transition
Here, ' is the revised conditional distribution of (s,a,s) and reward R(s,a,s)
Bayesian network according to Eq.4 Setp 3. update normal-gamma distribution
Comparing Eq.6 and Eq.2, we can see why we say according to theorem 1 and Dirichlets distribution
model-based RL eliminating the dilemma of according to Eq.4

Step 4. sample n Q values for each action at given
exploration and exploitation. Eq.6 achieves this since
state s , each values is got by look t step ahead.
all possible updated belief states as ,s ' are
Step 5. select the action a stochastically, based
considered with probabilities corresponding to the on the sampled values Q(s,a)
likelihood of reaching s ' .Hence, it optimizes the Step 6 go step 2
sum of the reward that can be derived based on the
518
514
V. SIMULATIONS
To test our algorithm, we developed a simulation

platform using SQL server and C++ builder. Because,
in the pricing model described in section 1, the utility
of each pricing period of agent depends on the
parameters of buyer agents, such as buy quantity and
buyers reserve prices. This simulator platform can
randomly product the sample buyer agents according
to multi-normal distribution . we tested our algorithm
and other pricing strategies in certain kind of
situations.
REFERENCES
Han[8] proposed a pricing algorithm under the [1] V.Kononen. Dynamic pricing based on asymmetric
same model as described in section 1, which is based multiagent reinforcement learning.International Journal of
Intelligent Systems.2006,21(1):73-98.
on Multiagent Reinforcement Learning (MARL). We [2] Erich K,Thomas U,Daniel P. Learning competitive pricing
strategies by multi-agent reinforcement learning. Journal of
compared the our Bayesian RL (BRL) with MARL in Economic Dynamics & Control. 2003,27: 2207 2218.
the simulation platform. And the results indicates that [3] G.Tesauro, J.O Kephart .Pricing in Agent Economies Using
Multi-Agent Q-Learning. Autonomous Agents and
BRL achieves better pricing profits than MARL. This Multi-Agent Systems. 2002.5(3):289-304
mainly due to the prior domain knowledge as well as [4] TESAURO Gerald .Pricing in agent economies using neural
networks and multi-agentQ-learning
the better using of present beliefs. [5] Ludo Waltman,Uzay Kaymak. O-learing agents in a counot
oligopoly model.
[6] M. H. Degroot. Proability and Statistics. Addison-Wesley,
1986.
[7] J. Wyatt. Exploration and Inference in Learning from
Reinforcement. PhD thesis, Department of Artificial
Intelligence, University of Edinburgh, 1997.
[8] Wei Han, Lingbo Liu, Huaili Zheng.Dynamic Pricing by
Multiagent Reinforcement Learning.Proceedings of the
international symposium on electronic commerce and
security, ISECS, 2008,Guangzhou, china
519
515

Bayesian Q Learning Algo PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Q Learning Algo PDF

Uploaded by

Copyright:

Available Formats

2010 Second International Conference on Computer Modeling and Simulation

A Dynamic Pricing Algorithm by

978-0-7695-3941-6/10 $26.00 2010 IEEE 511

And the posterior after transition Dt = ( s, a , s' ) is

values in the [0; 1]-interval. p ( | Dt ) = k as , s ' s , a Dir ( as ; nas )

The agents aim is to maximize the expected

Bellman equation B. Distribution over rewards

T ( s, a, s ' ) as conditional distribution of a Bayesian theorem, we assume Rs , a has a normal distribution,

using Bayes theorem: normal-gamma distribution to model the probability

p ( ) p ( ) pr ( s '| , s, a ) (3) p( , ) ~ NG ( 0 , , , ) where = 1 / s2, a ,

In practice, this updating is easily performed when

1 1 exploitation), s well as the information gained in the

Vs ( ) = max a ' p ( s ' | , s, a)[ R( s, a, s ' )

model-based RL eliminating the dilemma of according to Eq.4

To test our algorithm, we developed a simulation

You might also like