Professional Documents
Culture Documents
Bayesian Q Learning Algo PDF
Bayesian Q Learning Algo PDF
II. THE PRICING MODEL single learning algorithm. In fact, the influence of
other agents can be reflected by the quantity of the
Electronic marketplaces are essentially
products sold by agent i. by this single agent view,
multiagent systems, the self-interested seller agents
we eliminates the complexity introduced by
interacts with each other through the market [6],[7]
stochastic games .
environments, which varies according to the supply
and demand. The decision factors in sellers pricing III. MODEL-FREE Q-LEARNING
include cost, capacity, market demand function and Q-learning is driven from the MDP theory, An
the pricing polices of other sellers.
Definition 1. the market demand function is defined MDP is a 4-tuple < S , A, T , R > where S is a set
as an linear function of market average price level, of states, A is a set of actions,
that is D ( p ) = max{0, (q hp )} , q, h >0.
T ( s, a, s ' ) = pr ( s ' | s, a ) is a transition model that
Definition 2. A seller agent is 3-tuple
Selleri = ( p i , ci , k i ) , where p i is the price of certain captures the probability of reaching state > after we
product, ci is the cost, k i is the production capacity. execute action at state s, R ( s, a, s ' ) is a reward
Definition 3. A electronic marketplaces is described
model that captures the probability of getting reward
as (n, s, A1......n , T ,U 1......n ) , where n stands for the
when executing action a at state s . A policy
number of sellers who sell certain kind of : S A is a mapping from states to actions.
homogenous product. s = ( p, c, k , q, h ) ,
The problem of RL consists of finding an optimal
p = ( p1 ... p n ) , k = (c1 ...cn ) , c = (c1 ...c n ), q h are policy for an MDP with a partially or completely
parameters of market demand function. unknown transition function. In this paper, we
Ai = {a1 ......a m2 } is the set of all possible price of analytically derive a simple parameterization of the
agent i . Ai = A1 A2 ... Ai 1 Ai +1 An is the optimal value function for the Bayesian model-based
set of joint actions of the other sellers. approach. Bayesian learning proceeds as follows. Pick
a prior distribution encoding the learner's initial belief
T : S A S [0,1] is the transfer function of the
516
512
about the possible values of each unknown parameter. has been observed. ni is also called
Then, whenever a sampled realization of the unknown s
hyper-parameters or virtual samples. let n be a
a
parameter is observed, update the belief to react the vetor of hyper-parameters n s,s'
, the prior can be
a
observed data. In the context of reinforcement
presented as
learning, each unknown transition probability
p ( ) = s , a Dir ( as ; nas ) (4)
T ( s, a, s ' ) is an unknown parameter s ,s '
a . Since
achievable from states and Q* ( s, a) denotes the that returns 1 when s = s, a = a , s ' = s' , and 0
value of executing at s . We have the standard otherwise.
networks. Thus, we can learn the transition model to model our uncertainty about the distribution of
by belief updating. At each time t , conditional
Rs , a , it suffices to model a distribution over the
distribution over all states and actions is updated
based on the observed transition Dt = ( s, a, s' ) mean s, a and the variation s2,a .Dearden[6] uses a
517
513
r1 ,..rn be n independent samples of R with information available in presents belief (e.g.,
In model-based RL, policy mapping from distribution of transition function and reward
function of MDP, so we can use this knowledge to
belief states b to actions. The value V of given
simulate a series of samples from those two
policy is measured by the expected discounted
distributions. At each state s , the sample algorithm
cumulative rewards when executing it.
randomly sample n values Q( s, a ) over each states,
V (b) = t = 0 t R(bt , (bt ), bt +1 ) .
(7)
each values is computed by look forward t steps.
According to bellmans equation, in our Bayesian Though, the sampling approaches tend to be more
RL model, the optimal pricing policy has the highest complicated and computationally sensitive,
value in all belief states . performing actions is more expensive than
computation time when addressing pricing situations.
518
514
V. SIMULATIONS
519
515