This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Reinforcement Learning

Introduction

In which we examine how an agent can learn from success and failure, reward and punishment.

1

4/28/2012

Introduction

**Learning to ride a bicycle:
**

The goal given to the Reinforcement Learning g g g system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right

Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html

Introduction

**Learning to ride a bicycle:
**

RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT Result: CRASH!!! Receives negative reinforcement

2

4/28/2012

Introduction

**Learning to ride a bicycle:
**

RL system has learned that the “state” of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over

Reinforcement Learning

A trial-and-error learning paradigm

Rewards and Punishments

**Not just an algorithm but a new paradigm in itself Learn about a system –
**

behaviour control from minimal feed back

Inspired by behavioural psychology

6

Intro to RL

3

4/28/2012

RL Framework

Environment

evaluation State

Agent

Action

7

Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance

**Not Supervised Learning!
**

Input Output Error Target

Agent

Very sparse “supervision” No target output provided g p p No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning

8 Intro to RL

4

10 st rt +1 at st +1 at +1 rt +2 st +2 at +2 rt +3 s t +3 ... 1..4/28/2012 Not Unsupervised Learning Input I t Agent Activation A ti ti Evaluation Sparse “supervision” available Pattern detection not primary goal P d l 9 Intro to RL The Agent-Environment Interface Agent state reward st rt action at rt+1 st+1 Environment Agent and environment interact at discrete time steps: t = 0. at +3 5 .. K Agent b A t observes state at step t st ∈S t t t t t: S produces action at step t : at ∈ A(st ) gets resulting reward: rt +1 ∈ℜ and resulting next state: st +1 . 2.

12 Intro to RL 6 . 11 Intro to RL Goals and Rewards Is a scalar reward signal an adequate notion of a g q goal?—maybe not. a) = probability that at = a when st = s Reinforcement learning methods specify how the g g p y p agent changes its policy as a result of experience. π t : a mapping from states to action probabilities π t (s.4/28/2012 The Agent Learns a Policy Policy at step t. The Th agent must b able to measure success: be bl explicitly. frequently during its lifespan. A goal should specify what we want to achieve. the agent’s goal is to get as much reward as it can over the long run. Roughly. but it is surprisingly flexible. not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent.

K What do we want to maximize? In general. e g episodes e. where T is a final time step at which a terminal state is reached. 7 . trips through a maze. rt +3 . Episodic tasks: interaction breaks naturally into episodes. . Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits. .4/28/2012 Returns Suppose the sequence of rewards after step t is : rt +1 . we want to maximize the expected return E{Rt } for each stept. ending an episode. game maze Rt = rt +1 + rt + 2 + L + rT . rt + 2 .. plays of a game.g. 13 Intro to RL Passive Learning in a Known Environment Passive Learner: A passive learner simply watches the world going by. and tries to learn the utility of being in various states.

g transitions and the agent perceives them.4/28/2012 Passive Learning in a Known Environment In passive learning. the environment generates state p g. East.2] or [4.3] 8 . Consider an agent trying to learn the utilities of the states shown below: Passive Learning in a Known Environment Agent can move {North. South. West} Terminate on reading [4.

4/28/2012 Passive Learning in a Known Environment Agent is provided: Mi j = a model given the probability of reaching from state i to state j Passive Learning in a Known Environment the object is to use this information about rewards to learn the expected utility U(i) associated with each l th t d tilit i t d ith h nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning) 9 .

how often reached +1 vs -1 ?) Learner computes average for each state Provably converges to true expected value (utilities) (Algorithm on page 602.1]->[2. Figure 20.3].3) 10 .4/28/2012 Passive Learning in a Known Environment LMS (Least Mean Squares) Agent makes random runs (sequences of random moves) through environment [1.1]->[1.1]->[3.2]->[4.3] = +1 [1. when on [2.1]->[3.2] = -1 Passive Learning in a Known Environment LMS Collect statistics on final payoff for each state (eg.3]->[2.3]->[3.3]->[4.2]->[1.

it takes the agent well over a 1000 training q g sequences to get close to the correct value Passive Learning in a Known Environment ADP (Adaptive Dynamic Programming) Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated model 11 .slow convergence .4/28/2012 Passive Learning in a Known Environment LMS Main Drawback: .

33 x U(4.33 x -0.3) U(3.R(i) is reward of being in state i (often non zero for only a few end states) .0 + 0.2) = 0.3) + 0.0886 + 0.3) + 0.4430 = 0.2152 12 .33 x U(3.Mij is the probability of transition from state i to j Passive Learning in a Known Environment ADP Consider U(3.33 x 0.3) = 0.33 x U(2.33 x 1.4/28/2012 Passive Learning in a Known Environment ADP In general: .

4/28/2012 Passive Learning in a Known Environment ADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces Passive Learning in a Known Environment TD (Temporal Difference Learning) The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations 13 .

unlike ADP) 14 .4/28/2012 Passive Learning in a Known Environment TD Learning j Suppose we observe a transition from state i to state U(i) = -0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule Passive Learning in a Known Environment TD Learning Performance: Runs “noisier” than LMS but smaller error Deal with observed states during sample runs (Not all instances.5 and U(j) = +0.

4/28/2012 Passive Learning in an Unknown Environment Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment. Passive Learning in an Unknown Environment ADP Approach • The environment model is learned by direct observation of transitions • The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors 15 . Adaptive Dynamic Programming(ADP) approach adds a step that updates an estimated model of the environment.

4/28/2012 Passive Learning in an Unknown Environment ADP & TD Approaches pp • The ADP approach and the TD approach are closely related • B h try to make local adjustments to the Both k l l dj h utility estimates in order to make each state “agree” with its successors Passive Learning in an Unknown Environment Minor differences : • TD adjusts a state to agree with its observed successor • ADP adjusts the state to agree with all of the successors Important differences : • TD makes a single adjustment per observed k i l dj b d transition • ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M 16 .

at } s for all s ′. and histories st . at .e. means whatever information is available to the agent at step t about its environment. it should have the Markov Property: Pr { t +1 = s ′. rt .K .4/28/2012 Passive Learning in an Unknown Environment To make ADP more efficient : • directly approximate the algorithm for value iteration or policy iteration • prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates tilit ti t Advantage of the approximate ADP : • efficient in terms of computation • eliminate long value iterations occur in early stage The Markov Property “the state” at step t.. st −1 . a0 . 34 Intro to RL 17 . st −1 . r1 . a state should summarize past sensations so as to retain all “essential” information. a0 }= s Pr { t +1 = s ′. at −1 . i. s0 . rt +1 = r st . and structures built up over time from sequences of sensations. Ideally. at −1 .K . r1 . rt +1 = r st . r. highly processed sensations. s0 . at . rt . The state can include immediate “sensations”.

s′ ∈S. robot has to decide whether it should (1) actively search for a can. r s 35 Intro to RL An Example Finite MDP Recycling Robot At each step.st +1 = s ′} for all s. if runs out of power while searching.at = a} for all s. To define a finite MDP. Searching is better but runs down the battery. it is basically a Markov Decision Process (MDP). a ∈ A(s). low. a ∈ A(s). (2) wait for someone to bring it a can. P. If state and action sets are finite.at = a. or (3) go to home base and recharge.4/28/2012 Markov Decision Processes If a reinforcement learning task has the Markov Property. you need to give: M = S. R state and action sets one-step “dynamics” defined by transition probabilities: Psa′ = Pr{ t +1 = s′ st = s. has to be rescued (which is bad). Decisions made on basis of current energy level: high. A. Reward = number of cans collected 36 Intro to RL 18 . s s reward expectations: Rsa ′ = E{t +1 st = s. s′ ∈S. it is a finite MDP.

R wait 37 Intro to RL Value Functions The value of a state is the expected return starting from that state. R search 1–α . taking that action.4/28/2012 Recycling Robot MDP S = {high. a t = a } = E π ⎨ ∑ rt + k +1 s t = s . of cans while searching Rsearch > R wait = expected no. 0 recharge low search α. wait. recharge} 1. low} A(high) = { search. a ) = E π {Rt s t = s . depends on the agent’s policy: agent s State .value function for policy π : ⎧T ⎫ R V π (s) = Eπ { t st = s}= Eπ ⎨∑ rt + k +1 st = s ⎬ ⎩k = 0 ⎭ The value of a state-action pair is the expected return starting from that state.value function for policy π : ⎧T ⎫ Q π ( s . R search R R search wait = expected no. –3 search β. and thereafter following π : Action . a t = a ⎬ ⎩ k =0 ⎭ 38 19 . R wait wait 1–β . of cans while waiting p g high 1. R search wait 1. wait} A(low) = {search.

policies can be partially ordered: π ≥ π ′ if and only if V π (s) ≥ V π ′ (s) for all s ∈S There i always at l Th is l least one ( d possibly many) (and ibl ) policies that is better than or equal to all the others. without the expectation operator: V π (s) = ∑ π (s. Optimal policies share the same optimal statevalue function: V ∗ (s) = max V π (s) for all s ∈S π 40 Intro to RL 20 .a)∑ Psa′ ⎡ Rsas ′ +V π ( s′ ) ⎤ s ⎣ ⎦ a s′ 39 Intro to RL Optimal Value Functions For finite MDPs. This is an optimal policy. We denote them all π *.4/28/2012 Bellman Equation for a Policy π The basic idea: Rt = rt +1 + rt + 2 + rt +3 + rt + 4 L = rt +1 + Rt +1 So: V π (s) = Eπ { t st = s} R = rt +1 + ( rt + 2 + rt +3 + rt + 4 L) = Eπ {t +1 + V (st +1 ) st = s} r Or.

a) = E rt +1 + max Q ∗ (st +1 . a′ ) ⎤ s ⎣ ⎦ a′ s′ Q is the unique solution of this system of nonlinear equations. at = a a∈A(s ) { } = max ∑ Psa′ ⎡ Rsas ′ +V ∗ ( s′ ) ⎤ s ⎣ ⎦ a∈A(s ) s′ V is the unique solution of this system of nonlinear equations. ∗ 41 Intro to RL Bellman Optimality Equation Similarly.4/28/2012 Bellman Optimality Equation The l Th value of a state under an optimal policy must equal f d i l li l the expected return for the best action from that state: V ∗ (s) = max E rt +1 +V ∗ (st +1 ) st = s. a′ ) st = s. 42 Intro to RL * 21 . Similarly the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy Q ∗ (s. at = a a′ { } = ∑ Psa′ ⎡ Rsas ′ + max Q∗ ( s′.

4/28/2012 Dynamic Programming DP is the solution method of choice for MDPs Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge! RL methods: online approximate dynamic programming No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available 43 Intro to RL Policy Evaluation Policy Evaluation: for a given policy π.value function for policy π : ⎧∞ ⎫ R V π (s) = Eπ { t st = s}= Eπ ⎨∑ rt + k +1 st = s ⎬ ⎩k = 0 ⎭ Bellman equation for V π : V π ( s ) = ∑ π ( s. compute the π state value state-value function V Recall: State . a )∑ Psa′ Rsas′ + V π ( s′) s a s′ [ ] — a system of S simultaneous linear equations — solve iteratively 44 Intro to RL 22 .

at = a = ∑ Psa′ ⎡ Rsas ′ +V π ( s′) ⎤ V ′)⎦ s ⎣ s′ { } It is better to switch to action a for state s if and only if Qπ (s.a) > V π (s) 45 Intro to RL Policy Improvement Cont. a) = arg max ∑ Psa′ Rsas′ + V π ( s′) s a s′ a [ ] Then V ≥ V π′ π 46 Intro to RL 23 . For a given state s s.4/28/2012 Policy Improvement Suppose we have computed V for a deterministic policy π. Do this for all states to get a new policy π ′ that is greedy with respect to V π : π ′( s ) = arg max Qπ ( s. would it be better to do an action a ≠ π (s) ? π The value of doing a in state s is : Q π (s. a) = Eπ rt +1 + V π (st +1 ) st = s.

i e for all s ∈ S .4/28/2012 Policy Improvement Cont.e. So V π ′ = V ∗ and both π and π ′ are optimal policies. What if V π ′ = V π ? a s′ i. 47 Intro to RL Policy Iteration π 0 → V π → π1 → V π → L π * → V * → π * 0 1 policy evaluation policy improvement “greedification” 48 Intro to RL 24 .. V π ′ ( s ) = max ∑ Psa′ Rsas′ + V π ( s′) ? s [ ] But this is the Bellman Optimality Equation.

granularity evaluation V →V π π π→greedy(V ) improvement V A geometric metaphor for convergence of GPI: V = Vπ starting V π V* π* g re π= ed y π* 50 V* Intro to RL (V) 25 .4/28/2012 Value Iteration Recall the Bellman optimality equation: V ∗ (s) = max E rt +1 +V ∗ (st +1 ) st = s. at = a a∈A(s ) { } = max ∑ Psa′ ⎡ Rsas ′ +V ∗ ( s′ ) ⎤ s ⎣ ⎦ a∈A(s ) s′ We can convert it to an full value iteration backup: Vk +1 (s) ← max ∑ Psa′ ⎡ Rsas ′ +Vk ( s′ ) ⎤ s ⎣ ⎦ a s′ Iterate until “convergence” 49 Intro to RL Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement. independent of their granularity.

4/28/2012 Dynamic Programming V(st ) ← Eπ { t +1 + V(st )} r st rt +1 st +1 T T T T T T T T T T T T 51 Intro to RL Simplest TD Method V (st ) ← V (st ) + α [rt +1 + V (st +1 ) − V (st )] st st +1 T T T rt +1 T T T T T T T T T T T T 52 Intro to RL 26 .

Uses “bootstrapping” and sampling The simplestTD method. No knowledge of P and R.TD(0): V (st ) ← V (st ) + α[rt +1 + V (st +1 ) − V (st )] Intro to RL 53 Advantages of TD TD methods do not require a model of the environment. or a “sample” model assumed. compute the state-value function. only experience TD methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences 54 Intro to RL 27 . but access to the real system.4/28/2012 RL Algorithms – Prediction Policy Evaluation (the p y ( prediction p problem): for a g ) given policy.

at ) ← Q(st . at +1 ) − Q(st .at ) ← Q (st .4/28/2012 RL Algorithms – Control SARSA Q(st . at )] If st +1 is terminal. at ) + α [rt +1 + Q(st +1 .at )⎤ ⎣ ⎦ a 55 Intro to RL Actor-Critic Methods Policy Actor Critic TD error state Value Function reward action Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models Environment Intro to RL 56 28 . then Q( st +1 . at +1 ) = 0. do this : Q-learning One-step Q-learning: Q (st .at )+ α ⎡ rt +1 + max Q (st +1 . After every transition from a nonterminal state st .a )− Q (st .

∑ e p(s . a) .a) = Pr{ t = a st = s}= a e p(s. p(s.4/28/2012 Actor-Critic Details TD-error is used to evaluate actions: δ t = rt +1 + V (st +1 ) − V (st ) ( ( If actions are determined by preferences.a).at ) + β δ t 57 Intro to RL Active Learning in an Unknown Environment An active agent must consider : • what actions to take • what their outcomes may be h t th i t b • how they will affect the rewards received 29 .b) b then you can update the preferences like this : p(st . as follows: π t (s. at ) ← p(st .

4/28/2012 Active Learning in an Unknown Environment Minor changes to passive learning agent : • environment model now incorporates the probabilities of transitions to other states given a particular action • maximize its expected utility • agent needs a performance element to choose an action at each step Active Learning in an Unknown Environment Active ADP Approach • need to learn the probability Maij of a transition instead of Mij • the input to the function will include the p action taken 30 .

4/28/2012 Active Learning in an Unknown Environment Active TD Approach • the model acquisition problem for the TD agent is identical to that for the ADP agent • the update rule remains unchanged • the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity Exploration Learning also involves the exploration of unknown areas Photo:http://www.edu/~icheese/cgeorge.html 31 .duke.

215 -0.443 -0.089 0. Greedy Approach -0.772 32 .165 0.544 -0.4/28/2012 Exploration An agent can benefit from actions in 2 ways immediate rewards received percepts Exploration Wacky Approach Vs.418 -0.038 -0.

net Exploration The Exploration Function a simple example i l l u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible 33 .4/28/2012 Exploration The Bandit Problem Photos: www.freetravel.

4/28/2012 Learning An Action Value-Function What Are Q-Values? Learning An Action Value-Function The Q-Values Formula 34 .

i to j 35 .requires no model .4/28/2012 Learning An Action Value-Function The Q-Values Formula Application -just an adaptation of the active learning equation Learning An Action Value-Function The TD Q-Learning Update Equation .calculated after each transition from state .

36 .Q) are represented in tabular form • explicit representation involves one output li i i i l value for each input tuple.R.4/28/2012 Learning An Action Value-Function The TD Q-Learning Update Equation in Practice The TD-Gammon System(Tesauro) Program:Neurogammon P N .M.attempted to learn from self-play and implicit representation Generalization In Reinforcement Learning Explicit Representation • we have assumed that all the functions learned by the agents(U.

37 .4/28/2012 Generalization In Reinforcement Learning Explicit Representation • good for small state spaces. but the time to convergence and the time per iteration increase rapidly as the space gets larger • it may be possible to handle 10.000 states or more • this suffices for 2-dimensional. Chess & backgammon are tiny subsets of the real world. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game. maze-like environments Generalization In Reinforcement Learning Explicit Representation p p • Problem: more realistic worlds are out of question • eg. yet their state spaces contain 120 50 on the order of 10 to 10 states.

but that is much more compact than the tabular form form.+wnfn(i) 38 . an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f1………fn: U(i) = w1f1(i)+w2f2(i)+…. Generalization In Reinforcement Learning Implicit R I li it Representation t ti • For example .4/28/2012 Generalization In Reinforcement Learning Implicit R I li it Representation t ti • Overcome the explicit problem • a form that allows one to calculate the output for any input.

so this is enormous compression Generalization In Reinforcement Learning Implicit Representation p p • enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited • the most important aspect : it allows for h i i ll inductive generalization over input states. such method are said to perform input generalization 39 . • A typical chess evaluation function might only have 10 weights.4/28/2012 Generalization In Reinforcement Learning Implicit R I li it Representation t ti • The utility function is characterized by n weights. • Therefore.

otherwise he’ll never learn • each death proves to be a valuable lesson as the more experienced Mendel begins to avoid the things that cause him pain . he error programs himself to satisfy those desires • he is born not even knowing how to walk. and he has to learn to identify all of the deadly things in his environment • he has two basic drives.4/28/2012 Game-playing : Galapagos • Mendel is a four-legged spider-like creature • he has goals and desires. rather than instructions • through trial and error. 40 . move and avoid pain (negative reinforcement) Game-playing : Galapagos • player has no direct control over Mendel • player turns various objects on and off and activates devices in order to guide him • player has to let Mendel die a few times.

4/28/2012 Generalization In Reinforcement Learning Input Generalisation p • The cart pole problem: • set up the problem of balancing a long pole upright on the top of a moving cart. 41 . and θ’ • the earliest work on learning for this problem was carried out by Michie and Chambers(1968) i d t b Mi hi d Ch b (1968) • their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials. Generalization In Reinforcement Learning Input Generalisation p • The cart can be jerked left or right by a controller that observes x. θ. x’.

hence the name • it then ran trials until the pole fell over or the cart hit the end of the track.4/28/2012 Generalization In Reinforcement Learning Input Generalisation p • The algorithm first discretized the 4dimensional state into boxes. track • Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence Generalization In Reinforcement Learning Input Generalisation p • The discretization causes some problems when the apparatus was initialized in a different position • i improvement : using the algorithm that i h l ih h adaptively partitions that state space according to the observed variation in the reward 42 .

4/28/2012 Genetic Algorithms And Evolutionary Programming • G Genetic algorithm starts with a set of one or i l ih ih f more individuals that are successful.the analogy to natural selection is greatest Genetic Algorithms And Evolutionary Programming • G Genetic algorithm simply searches directly in i l ih i l h di l i the space of individuals. as measured by a fitness function • several choices for the individuals exist. with the goal of finding one that maximizes the fitness function in a performance measure or reward function • search is parallel because each individual in the population can be seen as a separate search 43 . such as: -Entire Agent function’s the fitness function is a performance measure or reward function .

4/28/2012 Genetic Algorithms And Evolutionary Programming • component function of an agent t f ti f t • the fitness function is the critic or they can be anything at all that can be framed as an optimization problem • Evolutionary process: learn an agent function based on occasional rewards as supplied by the selection function. What is the fitness function? 2. How is an individual represented? 3. we need to answer 4 questions : 1. How are i di id l selected? individuals l d 4. How do individuals reproduce? 44 . it can be seen as a form of reinforcement learning Genetic Algorithms And Evolutionary Programming • Before we can apply Genetic algorithm to a problem.

0) to represent DNA 45 . an individual is represented as a string over a finite alphabet • each element of the string is called a gene • in genetic algorithm. we usually use the binary alphabet(1. but it is a function that takes an individual as input and returns a real number as output p Genetic Algorithms And Evolutionary Programming How is an individual represented? • In the classic genetic algorithm.4/28/2012 Genetic Algorithms And Evolutionary Programming What is fitness function? • Depends on the problem.

if an individual X scores twice as high as Y on the fitness function.4/28/2012 Genetic Algorithms And Evolutionary Programming How are individuals selected ? • The selection strategy is usually randomized. • selection is done with replacement Genetic Algorithms And Evolutionary Programming How do individuals reproduce? • By cross-over and mutation • all the individuals that have been selected for reproduction are randomly paired • f each pair. a cross-over point is randomly for h i i ti d l chosen • cross-over point is a number in the range 1 to N 46 . • for example. then X is twice as likely to be selected for reproduction than is Y. with the probability of selection proportional to fitness p .

and the rest from the second parent • the second offspring will get genes 1 through p g g g g 10 from the second parent. and the rest from the first • however.4/28/2012 Genetic Algorithms And Evolutionary Programming How do individuals reproduce? • One offspring will get genes 1 through 10 from the first parent. each gene can be altered by random mutation to a different value Conclusion • Passive Learning in a Known Environment • Passive Learning in an Unknown Environment • Active Learning in an Unknown Environment • Exploration • Learning an Action Value Function • G Generalization in Reinforcement Learning li ti i R i f tL i • Genetic Algorithms and Evolutionary Programming 47 .

Prentice Hall 48 .4/28/2012 Resources And Glossary Information Source Russel. Artificial Intelligence . S. NJ. Norvig (1995).A Modern Approach. Upper Saddle River. and P.

- Solution of Differential Games
- Document 2
- code-en
- FCAT1000-N
- Annam E-Brochure LR (1)
- Field_Fail_Pro.pdf
- Introduction_to_System_Protection-_Protection_Basics.pdf
- LL_SGDA
- Impulse Generator Haefly
- Transformer in and Out
- 213079142 ABB Power Transformer Testing Manual
- 213079142 ABB Power Transformer Testing Manual
- Toshiba Generator
- High Voltage Circuit Breaker Displacement Curves - Zensol
- HV Capacitor Bank Switching
- Dakshin Nirikshan IInd Issue (OCT-DeC 2013)
- PDF Laying of Cables Brochure
- MULTIAGENT SYSTEMS Algorithmic, Game-Theoretic, and Logical Foundations
- Fundementals of Multiagent Systems

- Equation Journal - 2011 - Accelerated Version
- Reinforcement Learning
- Math Study Guide - Transitive Property
- Complex Numbers
- hexagonal more.pdf
- Exponents Discovery Activity
- SIM-PRIM
- Function
- Lec16p10
- Oldroyd Model
- Coordinate Geometry - SAT 2 Mathematics Level 1 -Tutorial and Worksheet
- Handout 1
- Maths Analysis
- Basic Maths 23 May
- MA3220 AY08-09 sem1
- PC
- Part-1 Summary F12
- Test 1, EE 2401 ... Answers
- Homework 1 - QMH
- EN(327)
- imo-2007-q3
- Velocity & Accleration
- 1
- Statistics Operations and Notations
- 11 Difference Equations
- Topic Sum of Products
- Simplex method
- ergodic
- Basic Probability Theory_20121114025247943
- RP1 Introduction

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd