You are on page 1of 30

Reinforcement Learning

Unit - 5
Reinforcement Learning

1. N-step returns
2. TD(λ) algorithm
3. Need for generalization in practice
4. Linear function approximation and
geometric view
5. Linear TD(λ)
6. Tile coding
7. Control with function approximation
8. Policy search
9. Policy gradient methods
10. Experience replay
11. Fitted Q iteration
12. Case studies
Reinforcement Learning

N-step returns

● In standard TD(0) learning, only one time step is considered for


updates, i.e., the difference between the current state's value and the
next state's value.
● N-step returns extend this idea by considering a sequence of N time
steps. The N-step return is the sum of rewards received over N time
steps, plus an estimate of the value of the state N time steps ahead.

N-step return = R_t+1 + γ * R_t+2 + ... + γ^(N-1) * R_t+N + γ^N *


V(S_t+N)

● R_t+1, R_t+2, ..., R_t+N are the rewards received at time steps t+1 to
t+N.
● V(S_t+N) is the estimated value of the state S_t+N (the state N time
steps ahead).
● γ is the discount factor, which determines the importance of future
rewards.
Scope of N-step Returns:
● N-step returns can provide a balance between short-term and
long-term rewards. They allow for a deeper look into the
consequences of actions taken in the environment.
● They can help improve the stability of learning algorithms, as they
reduce the variance in value estimates and can provide a more
accurate estimation of the value of states or state-action pairs.
Reinforcement Learning

Choosing N:
● The choice of N in N-step returns is a hyperparameter. Selecting the
right value for N depends on the specific problem and the
characteristics of the environment.
● Smaller values of N (e.g., N=1 or N=2) are more similar to TD(0)
methods and focus on shorter-term rewards.
● Larger values of N (e.g., N=3 or more) consider a more extended
sequence of rewards and can provide a better balance between
exploration and exploitation.
Reinforcement Learning
Reinforcement Learning

TD(λ) algorithm

TD(λ) improves over the offline λ return algorithm in three ways. First
it updates the weight vector on every step of an episode rather than only at
the end, and thus its estimates may be better sooner. Second, its
computations are equally distributed in time rather than all at the end of the
episode. And third, it can be applied to continuing problems rather than just
to episodic problems. In this section we present the semi-gradient version of
TD(λ) with function approximation.

With function approximation, the eligibility trace is a vector with the


same number of components as the weight vector wt. Whereas the weight
vector is a long-term memory, accumulating over the lifetime of the system,
Reinforcement Learning

the eligibility trace is a short-term memory, typically lasting less time than
the length of an episode. Eligibility traces assist in the learning process;
their only consequence is that they affect the weight vector, and then the
weight vector determines the estimated value.
Reinforcement Learning

Need for generalization in practice


Generalization is a fundamental concept in machine learning and
artificial intelligence, and it plays a crucial role in various practical
applications for several reasons:

1. Adaptability to New Data: Generalization enables machine learning


models to make predictions or decisions on unseen or future data. In
practice, it is often impossible to train models on all possible data
points. Generalization allows models to learn from the training data
and apply that knowledge to new, previously unseen examples.

2. Efficiency: Generalization allows models to process and make


inferences on a wide range of data efficiently. Without generalization,
models would have to memorize every data point they encounter,
which would be computationally infeasible and result in overfitting.

3. Robustness: A well-generalized model is less susceptible to noise


or minor variations in the data. It can filter out irrelevant details and
focus on the underlying patterns. This robustness is crucial when
dealing with real-world data, which is often noisy or imperfect.

4. Scalability: Generalization is essential for scaling machine learning


systems. Models that generalize effectively can handle larger and
more diverse datasets, making them applicable to a wide range of
domains and problems.

5. Transfer Learning: Generalization allows models to transfer


knowledge from one task or domain to another. Pre-trained models
can be fine-tuned for specific tasks, saving time and resources and
often achieving better performance than training from scratch.
Reinforcement Learning

6. Privacy and Data Limitations: In some scenarios, sensitive or limited


data may be available. Generalization can help in creating models that
respect privacy (by not memorizing specific data points) and can still
provide valuable insights or services.

7. Model Interpretability: A well-generalized model is often more


interpretable. It captures the essential features and relationships in
the data, making it easier to understand why it makes specific
predictions or decisions.

8. Resource Efficiency: Generalization often leads to smaller, more


resource-efficient models. In applications with limited computational
resources, such as mobile devices or edge computing, generalization
is crucial.

9. Adaptation to Changing Environments: In dynamic or evolving


environments, models must adapt to new data and changes in the
underlying data distribution. Generalization allows models to adapt
and continue to make accurate predictions as conditions change.

10. Reduction in Overfitting: One of the primary goals of generalization


is to avoid overfitting, where a model fits the training data too closely
and performs poorly on new data. Well-generalized models strike a
balance between fitting the data and avoiding overfitting.

In practice, the ability to generalize effectively is a key measure of a


machine learning model's quality. Achieving the right level of
generalization is often a challenging problem, and it depends on
various factors, including the choice of algorithms, hyperparameters,
data preprocessing, and the amount and quality of training data.
Reinforcement Learning

Linear function approximation and


geometric view

Linear Function Approximation (LFA) is a technique used to approximate


the value function or action-value function in a more general and scalable
way compared to tabular methods. Combining LFA with n-step methods,
such as TD(n), can be a powerful approach to estimate the value of
states or state-action pairs in Markov Decision Processes (MDPs). Here,
we'll discuss how Linear Function Approximation can be applied to TD(n)
learning.

Linear Function Approximation (LFA):


In LFA, we approximate the value function (V(s)) or action-value function
(Q(s, a)) using a linear combination of features:

- V(s) ≈ θᵀ * φ(s)
- Q(s, a) ≈ θᵀ * φ(s, a)

Where:
- V(s) is the estimated value of state s.
- Q(s, a) is the estimated value of taking action a in state s.
- θ is a weight vector.
- φ(s) and φ(s, a) are feature vectors that represent state s or state-action
pair (s, a).

TD(n) with LFA:


Now, when applying TD(n) with LFA, you are updating the weight vector θ
using n-step bootstrapping. The general update rule for TD(n) with LFA is
as follows:
Reinforcement Learning

- For state s and its successor state s', you can update the weight vector
θ using the n-step return:

θ ← θ + α * (Gₙ - V(s)) * ∇θ V(s)

Where:
- Gₙ is the n-step return, which is a sum of rewards and n-step estimates
of the value function.
- V(s) is the current estimate of the value of state s.
- α is the learning rate.
- ∇θ V(s) is the gradient of the value function with respect to the weight
vector θ.

In practice, you will approximate the n-step return Gₙ as:

- Gₙ = Rₜ₊₁ + γ * Rₜ₊₂ + ... + (γⁿ * V(Sₜ₊ₙ₋₁)) - V(Sₜ)

The update rule involves computing the temporal difference error (TD
error) as (Gₙ - V(s)), and then using this error to update the weight vector
θ. The update of θ is proportional to the TD error and the gradient of the
value function with respect to the weights.

Applying Linear Function Approximation to TD(n) allows you to estimate


value functions in environments with a large state or state-action space
more efficiently than tabular methods. However, it's important to be
mindful of potential issues like function approximation errors, stability,
and the choice of features and learning rates. Careful tuning and
empirical evaluation are often required to make TD(n) with LFA effective
in practice.
Reinforcement Learning

Geometric View

The geometric view of TD(n)-learning with function approximation


provides an intuitive way to understand how the learning algorithm
operates in a high-dimensional space. In TD(n), we aim to approximate
the value function (V(s) or Q(s, a)) using linear function approximation
and n-step returns. The geometric perspective helps us see this
process in terms of vectors and projections in a high-dimensional
space.

Here's a simplified explanation of the geometric view of TD(n) with


linear function approximation:

1. Vector Space Representation:


- Think of each state (s) or state-action pair (s, a) as a point in a
high-dimensional vector space. The dimensionality of this space is
determined by the number of features used for function approximation.

2. Basis Vectors:
- Consider each basis function or feature as a basis vector in this
high-dimensional space.
- Each feature can be thought of as a dimension in the space.

3. Linear Combination of Basis Vectors:


- The value function is approximated as a linear combination of these
basis vectors with associated weights.
- In vector space terms, this is like forming a vector by scaling and
summing the basis vectors according to the weight vector.

4. TD Error as Vector Difference:


Reinforcement Learning

- The TD error (temporal difference error) represents the difference


between the current value estimate (the current vector) and the n-step
return (the target vector).
- This difference can be thought of as a vector in the same
high-dimensional space.

5. Weight Update as Projection:


- The weight update is equivalent to adjusting the weight vector to
minimize the TD error (vector) between the current estimate (vector)
and the n-step return (vector).
- This is akin to projecting the target vector onto the current vector to
minimize the error.

6. Optimal Weights as Projection:


- The optimal weight vector is the one that minimizes the TD error
(vector).
- This weight vector aligns the current estimate (vector) and the
n-step return (vector) as closely as possible in the high-dimensional
space.

In this geometric view, TD(n) with linear function approximation


involves adjusting the weights to align the vectors that represent the
current value estimate and the n-step return. The learning process is
similar to finding the best projection of the target vector onto the
current estimate, ensuring that the approximation moves closer to the
true value function.
Reinforcement Learning

Linear TD(λ)
Linear TD(λ), often referred to as TD(lambda) or Temporal
Difference lambda, is a reinforcement learning algorithm that
combines linear function approximation with eligibility traces to
estimate the value function (V(s) or Q(s, a)) in Markov Decision
Processes (MDPs). It is an extension of the TD(0) algorithm, which
uses a one-step TD error, and it offers a way to balance between
one-step and multi-step updates. Here's an overview of Linear TD(λ):

Key Components:

1. Linear Function Approximation:


- Linear TD(λ) approximates the value function using linear
combinations of features, similar to other linear function
approximation methods.
- The value function is represented as V(s) approx theta^T . phi(s) or
Q(s, a) approx theta^T . phi(s, a), where theta is the weight vector and
phi(s) or phi(s, a) are feature vectors.

2. TD Error:
- The TD error for Linear TD(λ) at time step t is computed as:
δ_t = R_{t+1} + γ . V(S_{t+1}) - V(S_t)
- It represents the difference between the actual reward received at
time step t+1 and the estimated value of the current state S_t.

3. Eligibility Traces:
- Linear TD(λ) uses eligibility traces to keep track of the eligibility of
each feature for updates. Eligibility traces are a way of assigning credit
to features that contributed to the TD error.
Reinforcement Learning

- At each time step t, an eligibility trace vector z_t is updated as


follows:
z_t = γ . λ . z_{t-1} + ∇θ V(S_t)

4. Weight Updates:
- The weight vector theta is updated based on the eligibility trace and
TD error at each time step:
theta leftarrow theta + α . δ_t . z_t
where α is the learning rate.

Lambda (λ) Parameter:


- The lambda parameter, often referred to as the eligibility trace decay
rate, controls the balance between one-step updates (TD(0)) and
multi-step updates (TD(n)).
- When λ = 0, the algorithm is equivalent to TD(0), meaning it only
considers one-step updates.
- When λ = 1, the algorithm considers multi-step returns and updates
based on the full trajectory of observed rewards.
- Intermediate values of λ allow for a mix of one-step and multi-step
updates.

Advantages of Linear TD(λ):

- Linear TD(λ) combines the advantages of both TD(0) and multi-step


TD methods, providing a flexible approach for value function
estimation.
- It offers improved sample efficiency compared to TD(0) when using
function approximation.
- The lambda parameter allows you to adjust the trade-off between
bootstrapping and sampling in the updates.
Reinforcement Learning

NOTE:

- The choice of the lambda parameter can impact the algorithm's


performance, and finding the optimal value often requires
experimentation.
- Linear TD(λ) can be sensitive to the choice of features and the
function approximation method used.

Linear TD(λ) is a valuable reinforcement learning algorithm,


especially when working with linear function approximation. It strikes a
balance between exploring a range of n-step returns and making
efficient updates to the value function. The choice of lambda and other
hyperparameters can significantly affect its performance in practical
applications.

Tile coding

Tile coding, also known as coarse coding, is a technique used in


reinforcement learning and function approximation to discretize and represent
continuous state spaces. It's a method for mapping continuous-valued
features into a lower-dimensional, discrete representation, making it easier for
algorithms to learn and generalize from data. Tile coding is particularly useful
in applications where the state space is continuous and needs to be converted
into a format suitable for tabular methods or linear function approximation.
Reinforcement Learning

Overview

1. Tiling: Tile coding divides the continuous state space into a set of
overlapping regions or "tiles." Each tile represents a discrete region of the
continuous state space.

2. Tiles and Tilings: A single set of tiles is referred to as a "tiling." Multiple


tilings can be used, where each tiling provides a different discretization of the
state space.

3. Coding Scheme: In each tiling, a coding scheme is used to determine which


tiles are "active" or "on" for a given state. The coding scheme is usually based
on the values of the state variables.

4. Combining Tilings: The active tiles from multiple tilings are often combined
to create a binary feature vector that represents the state. This feature vector
is used in learning algorithms.

Advantages of Tile Coding:

1. Discretization of Continuous State Spaces: Tile coding allows you to


discretize a continuous state space, which is useful for tabular reinforcement
learning algorithms like Q-learning.

2. Generalization: Tile coding provides a form of generalization. Similar states


in the continuous space will activate the same or overlapping tiles, allowing
the learning algorithm to generalize across these states.

3. Memory Efficiency: It can be more memory-efficient compared to


representing the state space explicitly as a large table.
Reinforcement Learning

4. Suitable for Linear Function Approximation: When using linear function


approximation methods, such as linear regression, tile coding can convert the
state space into a format suitable for linear models.

Considerations:

1. Tiling Parameters: Designing tile coding involves setting parameters such


as the number of tiles, tile width, and overlap between tiles. The choice of
parameters can significantly impact the performance of the algorithm.

2. Tile Coding Library: There are libraries and tools available that can assist in
creating and managing tile codings, helping to set up the tiling parameters.

3. Interactions and Features: Tile coding may need to be combined with


feature engineering or other techniques to capture complex interactions
between state variables.

4. Dimensionality Reduction: Tile coding is a form of dimensionality reduction,


and some information about the continuous state space may be lost. Careful
design and tuning of the tiling parameters are necessary to minimize this loss.

Tile coding is a versatile technique used in various reinforcement learning


applications, especially when dealing with environments that have continuous
state spaces.

It provides a way to balance the benefits of discretization and


generalization, allowing learning algorithms to work effectively in these types
of settings.
Reinforcement Learning

Control with function approximation

It is a technique used in reinforcement learning to estimate and optimize policies


for decision-making tasks in environments with large or continuous state and action
spaces. Traditional tabular methods, which represent the entire state-action space
explicitly, become impractical in such settings.
Function approximation allows the approximation of value functions or policies
using parameterized functions, such as neural networks or linear models. Here's an
overview of control with function approximation:

● Parameterized Function: In control with function approximation, we use a


parameterized function (e.g., a neural network or linear model) to represent the
value function (V(s), Q(s, a)) or policy (π(a|s)).

● Continuous or Large State and Action Spaces: This approach is particularly


valuable when dealing with continuous state and action spaces, where it is
infeasible to maintain a table for each state or state-action pair.

● Approximation Quality: The quality of the function approximation is essential. It


should provide a reasonably accurate estimate of the value function or policy,
enabling effective decision-making.

Control with function approximation is a critical area of research in reinforcement


learning and has led to breakthroughs in tackling complex decision-making problems,
including playing complex games, robotic control, and autonomous driving.
However, the application of function approximation in control tasks requires
careful consideration of algorithm design, stability, and robustness.
Reinforcement Learning

Policy search
Policy search is a class of reinforcement learning methods that focuses on directly
optimizing a parameterized policy (policy being a mapping from states to actions) in
order to find an optimal or near-optimal policy for a given task. Unlike value-based
methods that estimate the value function and derive policies from it, policy search
methods work by iteratively adjusting policy parameters to maximize expected rewards.
Policy search methods are particularly useful in situations where the state and action
spaces are large, continuous, or unknown, making it challenging to use traditional
value-based methods like Q-learning. Here are some key concepts and approaches
related to policy search:

1. Policy Parameterization:
- In policy search, the policy is often parameterized by a set of learnable parameters,
usually denoted as θ. These parameters determine the behavior of the policy.

2. Objective Function:
- The goal of policy search is to find the optimal policy parameters (θ) that maximize
the expected return or cumulative reward. This is typically formulated as an objective
function, such as the expected return or the expected sum of rewards. The objective is to
maximize this function.

3. Policy Optimization Methods:


- There are various techniques to optimize the policy parameters to maximize the
objective function. Common methods include gradient ascent, natural policy gradients,
and evolutionary strategies.
- Many policy search methods use stochastic policies, meaning the policy generates a
probability distribution over actions, and the optimization aims to increase the likelihood
of good actions.

4. Exploration vs. Exploitation:


- Policy search methods need to balance exploration (trying different policies to
discover the best one) and exploitation (making use of the current best policy).
Techniques like adding noise to policy parameters can encourage exploration.

5. Batch vs. Online Methods:


- Policy search can be done in batch mode, where data is collected and used for policy
updates, or online, where the policy is updated after each interaction with the
environment.
Reinforcement Learning

6. Sample Efficiency:
- One of the challenges with policy search is sample efficiency. It can require a
significant number of samples (interactions with the environment) to find a good policy,
making it impractical in situations where data collection is costly or risky.

7. Types of Policies:
- Policy search methods can be used with various types of policies, including
deterministic policies (output a single action) and stochastic policies (output a distribution
over actions).

8. Applications:
- Policy search has been applied in various domains, including robotics, autonomous
vehicles, game playing, and natural language processing. It's particularly useful in
situations where the agent has a lot of degrees of freedom, complex actions, or an
unknown environment model.

Each of these methods has its own strengths and weaknesses, and the choice of
algorithm depends on the specific requirements of the task and the available resources.
Policy search methods continue to be an active area of research in reinforcement
learning and artificial intelligence.

Policy gradient methods


Policy gradient methods are a class of reinforcement learning
techniques that focus on directly optimizing a parameterized policy to find
an optimal or near-optimal policy for a given task.
Unlike traditional value-based methods that estimate the value
function and derive policies from it, policy gradient methods work by
iteratively adjusting policy parameters to maximize expected returns. They
are particularly useful in situations where the state and action spaces are
large, continuous, or unknown, making it challenging to use traditional
value-based methods like Q-learning.
Reinforcement Learning

Methods:

1. Policy Parameterization:
- In policy gradient methods, the policy is often parameterized by a set of
learnable parameters, typically denoted as θ. These parameters determine
the behavior of the policy.

2. Objective Function:
- The primary goal of policy gradient methods is to find the optimal policy
parameters (θ) that maximize the expected return or cumulative reward.
This is typically formulated as the expected sum of rewards (or return), and
the objective is to maximize this function.

3. Stochastic Policies:
- Many policy gradient methods use stochastic policies, which means the
policy generates a probability distribution over actions. The optimization
aims to increase the likelihood of selecting actions that lead to higher
expected returns.

4. Policy Optimization:
- Policy optimization is the process of finding the optimal policy
parameters θ. Common optimization techniques include gradient ascent,
where the policy parameters are updated in the direction of the gradient of
the expected return with respect to θ.

5. Exploration vs. Exploitation:


- Balancing exploration (trying different policies to discover the best one)
and exploitation (making use of the current best policy) is a key challenge.
Some methods incorporate exploration strategies, such as adding noise to
policy parameters.

6. Sample Efficiency:
- One of the challenges with policy gradient methods is sample efficiency.
They often require a significant number of samples (interactions with the
Reinforcement Learning

environment) to find a good policy, which can be a limitation in domains


where data collection is expensive or risky.

7. Baseline and Advantage Estimation:


- Policy gradient methods often use baselines to reduce the variance of
the gradient estimates. A common choice is to subtract a baseline (e.g., the
estimated value function) from the returns. This is known as advantage
estimation.

8. Types of Policies:
- Policy gradient methods can be used with various types of policies,
including deterministic policies (output a single action) and stochastic
policies (output a distribution over actions).

Policy gradient methods are widely used in a range of applications,


including robotics, natural language processing, and game playing, where
complex policies are required, and traditional value-based methods face
challenges. The choice of a specific policy gradient algorithm depends on
the task's requirements, the available resources, and the desired balance
between exploration and exploitation.

Experience replay
Experience replay is a fundamental technique used in reinforcement
learning, particularly in the context of deep reinforcement learning, to
improve the stability, efficiency, and effectiveness of training algorithms.
Experience replay involves storing and randomly sampling experiences from
an agent's interactions with the environment. These experiences, often
represented as tuples of (state, action, reward, next state), are stored in a
replay buffer, and they are sampled and replayed during training. The main
idea behind experience replay is to break the temporal correlations between
consecutive samples and make the learning process more data-efficient and
stable.
Reinforcement Learning

How Experience Replay Works:

1. Data Collection: During interactions with the environment, the agent


collects experiences, such as (s, a, r, s').

2. Replay Buffer Storage: These experiences are stored in the replay buffer.

3. Training: At regular intervals during training, the agent samples a


mini-batch of experiences from the replay buffer.

4. Learning: The agent updates its policy or value function using the
experiences in the mini-batch, typically through gradient-based methods.

Popular algorithms that incorporate experience replay include Deep


Q-Networks (DQN), where it plays a crucial role in training deep neural
networks to approximate the Q-function efficiently.

The experience replay mechanism has become a standard


component in deep reinforcement learning frameworks, contributing to their
stability and success in a wide range of applications, including game
playing, robotics, and autonomous systems.

Fitted Q iteration
Fitted Q-Iteration (FQI) is an iterative reinforcement learning algorithm that is used for
approximating the optimal Q-function in a model-free manner. The algorithm is designed
for environments with large or continuous state-action spaces where traditional tabular
methods are impractical. Fitted Q-Iteration leverages function approximation to estimate
the Q-values more efficiently.
Reinforcement Learning

● Fitted Q-Iteration is particularly well-suited for high-dimensional and continuous


state and action spaces, where traditional tabular methods are infeasible.
● The choice of function approximation method, such as neural networks, can
affect the success and stability of FQI. Deep Q-Networks (DQNs) are a popular
choice for this purpose.
● Like many reinforcement learning methods, Fitted Q-Iteration requires careful
tuning of hyperparameters, such as the learning rate and discount factor.
● FQI can be computationally expensive because it involves solving optimization
problems in each iteration, which can be a challenge for very large datasets.

Fitted Q-Iteration is a foundational algorithm in deep reinforcement learning and has


paved the way for more advanced methods like Deep Q-Networks (DQNs) and Double
Deep Q-Networks (DDQNs). It's often used as a building block for solving complex tasks
in reinforcement learning, such as playing video games, robotic control, and autonomous
systems.
Reinforcement Learning

Case studies
Case 1: Watson’s Daily-Double Wagering

IBM Watson1 is the system developed by a team of IBM researchers to play


the popular TV quiz show Jeopardy!. 2 It gained fame in 2011 by winning
first prize in an exhibition match against human champions. Although the
main technical achievement demonstrated by Watson was its ability to
quickly and accurately answer natural language questions over broad areas
of general knowledge, its winning Jeopardy! performance also relied on
sophisticated decision-making strategies for critical parts of the game.
Tesauro, Gondek, Lechner, Fan, and Prager (2012, 2013) adapted
Tesauro’s TD-Gammon system described above to create the strategy used
by Watson in “Daily-Double” (DD) wagering in its celebrated winning
performance against human champions. These authors report that the
e↵ectiveness of this wagering strategy went well beyond what human
players are able to do in live game play, and that it, along with other
advanced strategies, was an important contributor to Watson’s impressive
winning performance. Here we focus only on DD wagering because it is the
component of Watson that owes the most to reinforcement learning.

Jeopardy! is played by three contestants who face a board showing 30


squares, each of which hides a clue and has a dollar value. The squares are
arranged in six columns, each corresponding to a di↵erent category. A
contestant selects a square, the host reads the square’s clue, and each
contestant may choose to respond to the clue by sounding a buzzer
(“buzzing in”). The first contestant to buzz in gets to try responding to the
clue. If this contestant’s response is correct, their score increases by the
dollar value of the square; if their response is not correct, or if they do not
respond within five seconds, their score decreases by that amount, and the
other contestants get a chance to buzz in to respond to the same clue. One
or two squares (depending on the game’s current round) are special DD
Reinforcement Learning

squares. A contestant who selects one of these gets an exclusive


opportunity to respond to the square’s clue and has to decide—before the
clue is revealed—on how much to wager, or bet. The bet has to be greater
than five dollars but not greater than the contestant’s current score. If the
contestant responds correctly to the DD clue, their score increases by the
bet amount; otherwise it decreases by the bet amount. At the end of each
game is a “Final Jeopardy” (FJ) round in which each contestant writes down
a sealed bet and then writes an answer after the clue is read. The
contestant with the highest score after three rounds of play (where a round
consists of revealing all 30 clues) is the winner. The game has many other
details, but these are enough to appreciate the importance of DD wagering.
Winning or losing often depends on a contestant’s DD wagering strategy.

Whenever Watson selected a DD square, it chose its bet by


comparing action values, qˆ(s, bet), that estimated the probability of a win
from the current game state, s, for each round-dollar legal bet. Except for
some risk-abatement measures described below, Watson selected the bet
with the maximum action value. Action values were computed whenever a
betting decision was needed by using two types of estimates that were
learned before any live game play took place. The first were estimated
values of the afterstates (Section 6.8) that would result from selecting each
legal bet. These estimates were obtained from a state-value function,
vˆ(·,w), defined by parameters w, that gave estimates of the probability of a
win for Watson from any game state. The second estimates used to
compute action values gave the “in-category DD confidence,” pDD, which
estimated the likelihood that Watson would respond correctly to the as-yet
unrevealed DD clue.
Tesauro et al. used the reinforcement learning approach of
TD-Gammon described above to learn vˆ(·,w): a straightforward
combination of nonlinear TD() using a multilayer ANN with weights w trained
by backpropagating TD errors during many simulated games. States were
represented to the network by feature vectors specifically designed for
Jeopardy!. Features included the current scores of the three players, how
many DDs remained, the total dollar value of the remaining clues, and other
information related to the amount of play left in the game. Unlike
Reinforcement Learning

TD-Gammon, which learned by self-play, Watson’s vˆ was learned over


millions of simulated games against carefully-crafted models of human
players. In-category confidence estimates were conditioned on the number
of right responses r and wrong responses w that Watson gave in
previously-played clues in the current category. The dependencies on (r, w)
were estimated from Watson’s actual accuracies over many thousands of
historical categories.

Why was the TD-Gammon method of self-play not used to learn the
critical value function vˆ? Learning from self-play in Jeopardy! would not
have worked very well because Watson was so di↵erent from any human
contestant. Self-play would have led to exploration of state space regions
that are not typical for play against human opponents, particularly human
champions. In addition, unlike backgammon, Jeopardy! is a game of
imperfect information because contestants do not have access to all the
information influencing their opponents’ play. In particular, Jeopardy!
contestants do not know how much confidence their opponents have for
responding to clues in the various categories. Self-play would have been
something like playing poker with someone who is holding the same cards
that you hold.
Reinforcement Learning

As a result of these complications, much of the e↵ort in developing


Watson’s DDwagering strategy was devoted to creating good models of
human opponents. The models did not address the natural language aspect
of the game, but were instead stochastic process models of events that can
occur during play. Statistics were extracted from an extensive fan-created
archive of game information from the beginning of the show to the present
day. The archive includes information such as the ordering of the clues, right
and wrong contestant answers, DD locations, and DD and FJ bets for nearly
300,000 clues. Three models were constructed: an Average Contestant
model (based on all the data), a Champion model (based on statistics from
games with the 100 best players), and a Grand Champion model (based on
statistics from games with the 10 best players). In addition to serving as
opponents during learning, the models were used to assess the benefits
produced by the learned DD-wagering strategy. Watson’s win rate in
simulation when it used a baseline heuristic DD-wagering strategy was
61%; when it used the learned values and a default confidence value, its
win rate increased to 64%; and with live in-category confidence, it was 67%.
Tesauro et al. regarded this as a significant improvement, given that the DD
wagering was needed only about 1.5 to 2 times in each game.

Because Watson had only a few seconds to bet, as well as to select


squares and decide whether or not to buzz in, the computation time needed
to make these decisions was a critical factor. The ANN implementation of vˆ
allowed DD bets to be made quickly enough to meet the time constraints of
live play. However, once games could be simulated fast enough through
improvements in the simulation software, near the end of a game it was
feasible to estimate the value of bets by averaging over many Monte-Carlo
trials in which the consequence of each bet was determined by simulating
play to the game’s end. Selecting endgame DD bets in live play based on
Monte-Carlo trials instead of the ANN significantly improved Watson’s
performance because errors in value estimates in endgames could seriously
a↵ect its chances of winning. Making all the decisions via Monte-Carlo trials
might have led to better wagering decisions, but this was simply impossible
given the complexity of the game and the time constraints of live play.
Reinforcement Learning

Although its ability to quickly and accurately answer natural language


questions stands out as Watson’s major achievement, all of its sophisticated
decision strategies contributed to its impressive defeat of human
champions. According to Tesauro et al. (2012):

It is plainly evident that our strategy algorithms


achieve a level of quantitative precision and real-time
performance that exceeds human capabilities This is particularly
true in the cases of DD wagering and endgame buzzing, where
humans simply cannot come close to matching the precise equity
and confidence estimates and complex decision calculations
performed by Watson.

Refer to Chapter 16 for further case studies in the textbook

You might also like