You are on page 1of 108

FFM-223 SPRING 2023

Optimization

Lecture Notes
Nandana N
C LASS NOTES FOR O PTIMIZATION , P LAKSHA U NIVERSITY
Starting from, January 2023
Contents

1 Introduction to Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Optimization 9

2 Kelly’s Criterion and Gambler’s Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.1 Kelly’s Criterion 11
2.2 Derivation 11
2.3 Finding Plot for k = 1 12
2.4 Kelly Versus Gambler Simulation 13

3 Expected Utility and Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.1 Expected Utility 15
3.2 Zero Risk Bias 16

4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Conditioning 17
4.2 Methods 18
4.2.1 Explore Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Exploit Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 ε Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Let’s Banditize Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


5.1 Terms to know 21
5.2 Action-Value Method 21
5.3 Multi-Armed Bandit example 22
5.4 Simulation 23
5.4.1 Pseudocode of the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.2 Reinforcement learning code - Discrete reward example with ε = 0 . . . . . . . . . 24
5.4.3 Discrete reward example with ε = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4.4 Continuous probability reward example with ε = 0 . . . . . . . . . . . . . . . . . . . . . . . 30
5.4.5 Continuous probability reward example with ε = 0.1 . . . . . . . . . . . . . . . . . . . . . 34
5.4.6 Average performance of each armed bandit for each trial . . . . . . . . . . . . . . . 35

6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.0.2 Upper Confidence Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.0.3 Soft-max Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.0.4 Importance of Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.0.2 Algorithm working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.0.3 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.0.1 Bellman’s optimality condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.0.2 Quality of a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.0.3 An iterative example of Q-table updation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

9 Deep q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

10 Introduction to calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


10.1 Finding the shortest path optimally 53
10.2 Brachistocrone problem 53
10.3 Derivation of Euler-Lagrange Equation 54

11 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.1 History- Interesting 57
11.2 Production planning 58
11.3 The brachistochrone problem 58
11.4 Investment strategy 59
11.5 investment planning 60
12 Fermat’s Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.1 Euler-Lagrange Equation 61
12.2 distance minimisation 61
12.3 Light 62
12.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12.3.2 principle of least time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12.3.3 derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

13 Variational Constraint Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


13.1 Introduction 67
13.2 Derivation 67

14 Probability Distribution using Max. Entropy . . . . . . . . . . . . . . . . . . . . . . 73


14.1 Lagrange Multipliers 73
14.2 Principle of Maximum Entropy 73
14.2.1 Shannon Entropy Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3 Derivation 1 - general 74
14.4 Derivation 2 75
14.5 Derivation 3 - mean, variance 76
14.6 Derivation 4 - mean 77
14.7 Applications 78
14.7.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

15 Max Entropy and Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 79


15.1 Maximum Entropy Principle 80
15.2 Applying Entropy to Classical Reinforcement Learning 80
15.3 Applications 81
15.3.1 An alternative method to look at the changes in the geometric structure of the
price series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
15.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
15.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
15.3.4 Gaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
15.3.5 Recommendation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
15.3.6 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
15.3.7 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
15.3.8 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

16 Geodesics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.0.1 Geodesic on a sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.0.2 Intuitive Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.1 Applications 86
16.1.1 Gravitational Lensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
16.1.2 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
16.1.3 Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
16.1.4 Supply Chain Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
16.1.5 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

17 Derivation of n paramters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
17.1 introduction to multi-variable 87
17.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
17.1.2 Solving a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

18 Euler-Lagrange’s Take on Higher Variables . . . . . . . . . . . . . . . . . . . . . . 91


18.1 17.1 Calculus v/s Calculus of Variation 91
18.2 17.2 Higher Dimensions 91
18.3 17.3 Euler-Lagrange in Higher Variables 92
18.4 17.4 Proof 92
18.5 17.5 Area Under the Curve 93
18.6 Applications 94

19 Multi-Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
19.1 What are Multi-Independent Variables? 97
19.2 Derivation of the E-L Equation 97
19.3 Principle of Least Action 98
19.4 Derivation of the Wave Equation 99

20 The Hamiltonian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


20.1 Comparing The Hamiltonian, Lagrange & Schrödinger Equations 102
20.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
20.2 Applications 103

21 Hamilton-Jacobi Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


21.1 Recap 105
21.2 Introduction 105
21.3 Derivation 106
21.4 Harmonic Oscillator 107
7

21.5 Applications 107


21.5.1 Predicting Traffic Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
21.5.2 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
21.5.3 Options Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
21.5.4 Mimicking HJE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
1. Introduction to Optimization

1.1 Optimization
Optimization is the process of finding the best solution among a set of possible solutions, by adjust-
ing the values of one or more variables. It is a method used in many fields, including engineering,
mathematics, computer science, and economics, to maximize or minimize a given objective function.
The goal of optimization is to find the values of the variables that result in the best possible outcome,
based on a set of constraints and assumptions.

The following questions can be answered using principles of optimization.


Support Vector Machine: The goal is to find the best line that separates different classes of data.
The optimization problem is to find the values of the weights and the bias that maximizes the margin
between the classes.
Portfolio optimization in finance: The goal is to find the best combination of investments that
maximizes returns while minimizing risk. The optimization problem is to find the values of the
weights that minimize the risk of the portfolio while maximizing the expected returns.
Traveling salesman problem: The goal is to find the shortest route that visits a set of cities and
returns to the starting city. The optimization problem is to find the permutation of cities that mini-
mizes the total distance traveled.
Airfoil optimization: The goal is to find the best airfoil shape that maximizes lift and minimizes
drag. The optimization problem is to find the shape of the airfoil that maximizes the lift-to-drag
ratio.
2. Kelly’s Criterion and Gambler’s Ruin

2.1 Kelly’s Criterion


Kelly’s criterion determines the optimal fraction of a bankroll to bet in a series of bets in order to
maximize the long-term growth rate of the bankroll.

Original Paper : Link


Reference : Link

In the above paper, Kelly aims to find an optimal fraction f of one’s wealth that one should
gamble, with probability p of winning the best and probability q = (1 − p) losing the bet. Kelly’s
criterion can be viewed as an optimization problem, where the goal is to maximize the long-term
growth rate of the bankroll by determining the optimal fraction of the bankroll to bet in each round.
But, this is also under the assumption of a known probability of winning and a known payout. Also,
the assumption like the infinite horizon and no transaction cost, may not be always true in real-world
scenarios.

The optimal fraction is, f = pk−qk , where f is the optimal fraction of the bankroll to bet, k is
the net odds received on the bet, p is the probability of winning, and q is the probability of losing.
This criterion has been widely used in the field of gambling and investment management.

2.2 Derivation
Let X0 be the initial capital before gambling. A person(Tim) invests β amount for gambling with a
winning probability of p and a losing probability of q = (1 − p). Assume that a fraction f is obtained
if he wins. Let k be the winnings per unit wager. This is the amount he wins per unit bet and if he
loses then the loss fraction is a.
If Tim wins the bet, the total amount will become, X1 = X0 + k f X0 = X0 (1 + k f )
If he loses the next bet then X2 = X1 − f X1 = X0 (1 + k f )(1 − a f ), where X2 is the new total amount.
Typically a = 1.
12 Chapter 2. Kelly’s Criterion and Gambler’s Ruin

Therefore, X0 (1 + k f )(1 − f )
Continue for n iterations with l loses and w wins (n = w + l). The total amount at each iteration will
be of the form,

Xn = X0 (1 + k f )w (1 − f )l
Xn
Now we aim to increase the ratio of the final amount to the initial amount, i.e., X0

h i1
n
Xn nlog Xn
X0
=e
X0
The log term measures the exponential rate of increase per trial.
Kelly now chooses to maximize the expected value of the growth coefficient G( f ), where
"   #
Xn 1/n
G( f ) = E log
X0
  1/n 
w l
= E log (1 + k f ) (1 − f )
  1/n 
w l
= E log (1 + k f ) (1 − f )
 
w l
= E log(1 + k f ) + log(1 − f )
n n
= plog(1 + k f ) + qlog(1 − f )
= plog(1 + k f ) + qlog(1 − f )

p = wn : true probability of winning a bet


p = nl : true probability of losing a bet
Now, let’s talk about maximizing G( f ). We are seeking a value of f that maximizes the growth rate
of return per bet.
d pk q
d f G( f ) = 1+k f − 1− f = 0

f ∗ = f = pk−q
k
This is known as kelly’s fraction. To check if G( f ∗ ) is maxima or minima, we find the second
derivative (Take k = 1).

d 2 − f 2 +2 f (p−q)−1
df G (f) = (1− f 2 )2

Here, f ∈ [0, 1)
p − q ∈ [0, 1)
∴ G2 ( f ) < 0∀ f ∈ 0, 1)

2.3 Finding Plot for k = 1


Note:k = 1 means that winning a bet of Rs.10 will give you Rs.20 from the house as your winning
plus the betting amount.
2.4 Kelly Versus Gambler Simulation 13

If k was 2 then you would win 2 times your bet over the betting amount, i.e., winning a bet of Rs.10
will give you a total of Rs.30 from the house.

d p q
G(0) = 0 G= − = p−q > 0
df 1+0 1−0
d 2
G < 0∀ f ∈ [0, 1)
df
means the graph will be concave.

Conclusions:
For f → 1, G(f) → − inf
After fc , G(f) < 0, i.e., one will lose money in the long run.
More the f (fraction) from fc faster will you will lose most of your money.
The closer the f (fraction) is to f ∗ , the faster you will gain in the long run.
G(f) = 0 for 2 values of f , f0 and fc .

Homework: q
1−e p
To Derive: fc = q
1+e p

2.4 Kelly Versus Gambler Simulation


Simulation : Link

The above simulation depicts multiple simulations(days) of the same person across a number
of bets.
14 Chapter 2. Kelly’s Criterion and Gambler’s Ruin
pk−qa
The optimal fraction( f ) = ak

Applications:
Long-Term Growth Rate: Finding the optimal dividing fraction for our cells if some cells are infected
by bacteria or any virus.

Comparing the geometric and arithmetic mean

If I maximise the expected square root of wealth and you maximise the expected log of wealth
then after 10 years you will be richer 90% of the time. But so what, becasue I will be much richer
than remaining 10% of the time. After 20 years, you will be richer 99% of the time, but I will be
fantastically richer the remaining 1% of the time.
3. Expected Utility and Risk

3.1 Expected Utility

Consider a lottery (a game of chance) wherein several outcomes are possible with defined proba-
bilities. Typically, outcomes in a lottery consist of monetary prizes. Returning to our dice let’s say
that when a six-faced die is rolled, the payoffs associated with the outcomes are Rs.1 if a 1 turns up,
Rs.2 for a 2, . . . , and Rs.6 for a 6. Now if this game is played once, one and only one amount can be
won—Rs.1, Rs.2, and so on. However, if the same game is played many times, what is the amount
that one can expect to win?
Mathematically, the answer to any such question is very straightforward and is given by the expected
value of the game.
In a game of chance, if W1 ,W2 , ....WN are the N outcomes possible with probabilities π1 , π2 , ....πN .
The computation can be extended to expected values of any uncertain situation, say losses, provided
we know the outcome numbers and their associated probabilities. The probabilities sum to 1.

The paradox lies in a proposed game wherein a coin is tossed until “head” comes up. That is
when the game ends. The payoff from the game is the following: if the head appears on the first toss,
then Rs.2 is paid to the player, if it appears on the second toss then Rs.4 is paid, if it appears on the
third toss, then Rs.8, and so on, so that if head appears on the nth toss then the payout is Rs.2n . The
question is how much would an individual pay to play this game?

Let us try and apply the fair value principle to this game, so that the cost an individual is willing to
bear should equal the fair value of the game. The expected value of the game E(G) is calculated below.

E(G) = 12 .2 + 14 .4 + .... = ∞
It is evident that while the expected value of the game is infinite, not even the Bill Gates and Warren
Buffets of the world will give even a thousand dollars to play this game, let alone billions.
Bernoulli used U(W ) = ln(W ) to represent the utility that this lottery provides to an individual where
W is the payoff associated with each event H, TH, TTH, and so on, then the expected utility from the
16 Chapter 3. Expected Utility and Risk

game is given by,



E(U) = ∑ πiU(Wi )
i=1
1 1
= .ln(2) + .ln(4) + ....
2 4

1
= ∑ i ln(2i )
i=1 2

which can be shown to equal 1.39 after some algebraic manipulation. Since the expected utility that
this lottery provides is finite (even if the expected wealth is infinite), individuals will be willing to
pay only a finite cost for playing this lottery.

3.2 Zero Risk Bias


Zero risk bias relates to our preference for absolute certainty. We tend to opt for situations where we
can completely eliminate risk, seeking solace in the figure of 0%, over alternatives that may actually
offer greater risk reduction.

Why it happens: Choices with zero risk offer certainty, which the brain seeks to maximize in
order to reduce cognitive strain. Additionally, losses loom larger than gains, and the prospect of a
loss drives us to eschew that possibility even if it leads to a suboptimal choice.

Example 1 - Allais Paradox: Maurice Allais presented participants with two choices: the first
option was a guaranteed jackpot while the second, though it had a greater expected value, carried a 1
percent of not winning anything at all. When presented with this choice set, people tend to opt for
the first option with zero risk, despite it having a lower expected value.

Example 2 - The money back guarantee: This all too familiar marketing hook plays to con-
sumer worries over the risk of not being satisfied with a product. By eliminating such a risk,
consumers are more likely to buy.

How to avoid it?: It can help to review your goal in making a decision. Is your goal overall
risk reduction? Or does completely eliminating risk carry more value? Monitoring your emotion to
see if it’s leading you astray from your decision-making goals can sometimes help mitigate the zero
risk bias.
4. Reinforcement Learning

Reinforcement Learning is maximizing expected utility using reinforces, by trial and error in learn-
ing.It’s an area of machine learning inspired by operant conditioning.
The Expected Utility, E(U) = ∑n1 P(Xi )U(Xi )
n: number of events
P: the probability of event
U: utility associated with the event
Here, we knew the utilities associated with the events. Now, we will look into how we can calculate
the utilities without prior knowledge.

4.1 Conditioning
The reaction(response) to an object or event(stimulus) by a person can be modified by ’learning’ or
conditioning. This is of two types Classical and Operant conditioning.

Classical conditioning is creating a relationship between a stimulus and desired response.


Operand conditioning is by giving rewards for a positive response and punishments for a negative
response. This punishment can be by taking back the reward given.
MAIN GOAL: Choose actions to maximize rewards. This is an optimization problem. Elements
of Reinforcement LearningBeyond the agent and the environment, one can identify four main
subelements of a reinforcement learning system: a policy, a reward signal, a value function, and,
optionally, a model of the environment.// The elements of Reinforcement learning are the agent,
the environment, the action and the reinforcer.//// Multi-Armed Bandit ProblemA problem for
this is Timmy having 3 options and 300 days to explore and exploit the available options (La Pinoz,
Taco Bell, MCD).The problem is to maximize happiness.// As said, the only two approaches are
exploitation and exploration.
Exploitation is choosing the maximum utility/reward option every time whereas exploration
18 Chapter 4. Reinforcement Learning

approach is to explore all the options with equal probability.

4.2 Methods
4.2.1 Explore Only
Explore only option is going to all restaurants with equal probability. Since here there are 3 restau-
rants, Timmy will go to each with a probability 13 . He will go to La Pinoz for 100 days, Taco Bell for
100 days, and the same with the case of MCD.

Optimal: 3000
Explore Only method: 2300

4.2.2 Exploit Only


This method checks out each of the 3 restaurants once and goes to the restaurant with the maximum
utility for the remaining 297 days. In this case, it is so obvious that Taco Bell will win. But this
is when we don’t consider the noise. When noise matters, the Exploit only method will give a
maximum happiness of 2396.

4.2.3 ε Greedy
This is a combination of exploration and exploitation.
For ε = 0.1(10%), means explore 10% of the time and exploit. This will give a happiness value of
2930.
More exploration leads to more accurate exploitation but at the expense of the opportunity cost of
deeper exploration. Every turn we use in exploring, we take away from exploiting. Epsilon helps
4.2 Methods 19

find a balance.
Multi-Armed Bandit Mathematically
vi = v(ai ) = E[Uk |Ak = ai ]
The expected or the mean reward of the action we take is referred to as the value of that action.// The
expectation of Utility(Reward) given an action.
If exploration reveals a more profitable option, we switch options. ε helps ensure that we continue to
explore.
What was the Reinforcer for the ε−Greedy method?The happiness/rewards offered by each
restaurant.
This reward helped reinforce our decision on the best choice of restaurant and arrive at a more optimal
combination. Hence, Reinforcement Learning is a method to use TRIAL and ERROR (striking the
right balance between exploration and exploitation) to MAXIMIZE EXPECTED UTILITY and
made the OPTIMAL DECISION.
5. Let’s Banditize Arms

Simulations - Multi Arm Bandit Problem.

5.1 Terms to know


Policy: It’s a mapping from the states of the environment to the actions to be taken when in those
states.
Agent: The learner who is a decision-maker.
Environment: The things Agent interacts with, comprising everything outside the agent.
Reward: The nnumber sent by the environment after every action that the agent takes.
Value: The total amount of reward an agent can expect to accumulate over the future, starting from
that state.
Model of the Environment: A model that mimics the behaviour of the environment.

5.2 Action-Value Method


The Action-Value Method is a popular solution to the Multi-Armed Bandit Problem. It is a model-
based approach that involves maintaining an estimate of the expected reward for each machine,
called the "action-value." The action-value is updated as the gambler tries out each machine and
gathers more information about its payout probability.

The key idea behind the Action-Value Method is to balance exploration and exploitation by choosing
the machine that has the highest estimated action-value with some probability (i.e., exploiting the
best option), and choosing a machine atrandom with some other probability (i.e., exploring other
options). This way, the gambler can continuously improve their estimate of the action-values as they
gather more information and make more choices, leading to a better overall reward.

At : Action selected at time step t


22 Chapter 5. Let’s Banditize Arms

Rt :Corresponding reward of the selected section

Value of an action is the expected or mean reward of that action given that the action is selected.
True value of an action is represented by q∗ (a).

q∗ (a) = E[Rt |At = a]

Estimated value (Qt ) of an action ’a’ at time step t is given by,

sumo f Rewardswhenatakenpriortot
Qt (a) =
numbero f timesatakenpriortot
t−1
∑i=1 Ri .1Ai = a
=
∑t−1
i=1 .1Ai = a

Action Selection Decision, At = argmaxa Qt (a)


Incremental implementation
→ Rt : Reward received after it h selection of the action.
→ Qn : Estimate the action value after it has been selected n-1 times

5.3 Multi-Armed Bandit example


Given below link has an example of the Multi-Armed Bandit problem. Refer to Slides 12 to 21.
Link
5.4 Simulation 23

5.4 Simulation
In the real world, we don’t know the value of each action that we are going to take.
This is also the reason why we can’t use Kelly’s criteria in real life because we don’t priorly know
the winning or losing probability.

This is where Reinforcement learning comes to the rescue. We don’t know the correct action
for every state beforehand. Here, we receive rewards based on a series of actions. It is only after
running the iteration or cycle multiple times. That you are able to get some optimal results.

The idea of the k-armed bandit problem is to maximize reward. But to achieve this goal we
need to take the best actions. And to find the best actions, we need to explore or exploit.

The need to balance exploration and exploitation is a distinctive challenge that arises in
reinforcement learning.

5.4.1 Pseudocode of the simulation


1 fix the value of epsilon
2 initialise allMachineQValueMatrix matrix to be 0
3

4 for range in runs:


5

6 :-create an empty array (counterArray) to keep track the number of times an action was ta
7

8 for trials in trial:


9

10 Step 1: generating a random number using random number generator between 0 and 1
11

12 Step 2: use that value for Action selection


13 if random Number is less than or equal to 1 - epsilon
14 then
15 (exploit)
16 :select the action which has got the max expected value
17 else
18 (explore)
19 :choose any action with equal probability
20

21 :-store the action that you took above


22 :-increment the counter Array for that particular action
23

24 Step 3: reward Allocation


24 Chapter 5. Let’s Banditize Arms

25 :-store the reward for the particular action that was taken
26

27 Step 4: Action Value Updation:


28 :- Update the allMachineQValueMatrix values for the next trial using the below in
29 NewEstimate <- OldEstimate + stepSize[Target - Old Estimate]
30

31 Step 5: plot the allMachineQValueMatrix by taking mean column wise for each bandit machine

5.4.2 Reinforcement learning code - Discrete reward example with ε = 0

1 "Discrete reward example"


2

3 import numpy as np
4 import matplotlib.pyplot as plt
5 import random
6

8 def epsilonGreedyMethod(e, numberOfBanditArms , runs , trials):


9

10 allMachineQValueMatrix = np.zeros((numberOfBanditArms,runs,trials))
11

12 for run in range(runs):


13

14 #starting a fresh game


15 actionSelectedCounter = np.zeros(numberOfBanditArms)
16

17 for trial in range(trials):


18

19 #The np.random.uniform() function generates a random float number between a and b


20 #a but excluding b.
21 rand_num = np.random.uniform(0, 1)
22

23 '''
24 Step 1: Action selection:
25 e = epsilon = e.g. 0.1
26 probability of exploration = e = e.g. 0.1
27 probability of exploitation = 1 - e = e.g. 0.9
28

29 R is the reward
30 A is the action
31 '''
32

33

34 if rand_num <= 1 - e:
35 '''
36 exploit
5.4 Simulation 25

37

38 '''
39

40 #storing the expected Q value for each arm for the current trial
41 QvalueOfBanditArms = np.zeros(numberOfBanditArms)
42

43 for arm in range(numberOfBanditArms):


44 QvalueOfBanditArms[arm] = allMachineQValueMatrix[arm][run][trial]
45

46

47 '''
48 '''
49

50

51 #finding the max expected value of an action (Qvalue) in the QvalueArrayForAl


52 maxExpectedValue = max(QvalueOfBanditArms)
53

54

55 #finding which actions are giving me this maxExpectedValue


56 shortListedActions = [action for action, estimateValue in enumerate(QvalueOfB
57 if estimateValue == maxExpectedValue]
58

59

60

61 #breaking the tie randomly if there are more than one selected actions
62

63 actionLength = len(shortListedActions )
64

65

66

67 if actionLength > 1:
68

69 #choosing an action randomly from the shortListedActions


70 finalAction = random.choice(shortListedActions )
71

72 else:
73 finalAction = shortListedActions[0]
74

75 #incrementing the counter Array for that particular action


76 actionSelectedCounter[finalAction] += 1
77

78

79 A = finalAction
80

81 else:
82 '''
26 Chapter 5. Let’s Banditize Arms

83 explore
84 '''
85

86 #selecting randomly from among all the actions with equal probability.
87 randActionIndex = np.random.randint(0,numberOfBanditArms)
88

89 #incrementing the counter Array for that particular action


90 actionSelectedCounter[randActionIndex] += 1
91

92 A = randActionIndex
93

94 '''
95 Step 2: Reward Allocation
96 '''
97

98 rand_num = np.random.uniform(0, 1)
99 if A == 0:
100 R = 2000
101

102

103 else:
104 if rand_num <= 0.6:
105 R = 5000
106 else:
107 R = 0
108

109 '''
110 Step 3:
111 Action Value Updation:
112 NewEstimate <- OldEstimate + stepSize[Target - Old Estimate]
113

114 Matrix Updation For Performance Measurement


115 A is actual arm lever that you pulled dowm
116 '''
117 #if condition to prevent index out of bound error.
118 if trial+1 != trials:
119 for arm in range(numberOfBanditArms ):
120

121 if arm == A :
122

123 #updation of the expected value for Q for the arm that we pulled down
124

125 update = (R - allMachineQValueMatrix[arm][run][trial])/actionSelecte


126 allMachineQValueMatrix[arm][run][trial+1] =allMachineQValueMatrix[arm
127 else:
128 #updation of the expected Q value for the other arms that we left in
5.4 Simulation 27

129 if trial != 0:
130 #updating the Q value of the arm of the next trial with the previ
131 #if this arm was not pulled down
132 allMachineQValueMatrix[arm][run][trial + 1] = allMachineQValueMat
133

134 return allMachineQValueMatrix


135

136 runs = 2000


137 trials = 1000
138

139 #1 run = 1000 trials


140 # we are running this for 2000 * 1000 trials!!!!
141

142 numberOfBanditArms = 2
143 allMachineQValueMatrix = epsilonGreedyMethod(e = 0, numberOfBanditArms = numberOfBanditArms
144 , runs = runs, trials = trials)
145

146 #print(allMachineQValueMatrix)
147

148 column_wise_mean = []
149 for banditArmIndex in range(numberOfBanditArms):
150 column_wise_mean.append(np.mean(allMachineQValueMatrix[banditArmIndex], axis=0))
151

152 timeStepArray = np.arange(1,trials + 1)


153

154

155 for i in range(numberOfBanditArms):


156 plt.plot(timeStepArray, column_wise_mean[i], label = "Bandit Arm: " + str(i))
157

158 plt.legend(loc = "best")


159 plt.title("Average performance of each armed bandit for each trial")
160 plt.xlabel("Trial")
161 plt.ylabel("Average reward")
28 Chapter 5. Let’s Banditize Arms

5.4.3 Discrete reward example with ε = 0.1


1

3 allMachineQValueMatrix = epsilonGreedyMethod(e = 0.1, numberOfBanditArms = numberOfBanditArms


4 , runs = runs, trials = trials)
5

6 #print(allMachineQValueMatrix)
7

8 column_wise_mean = []
9 for banditArmIndex in range(numberOfBanditArms):
10 column_wise_mean.append(np.mean(allMachineQValueMatrix[banditArmIndex], axis=0))
11

12 timeStepArray = np.arange(1,trials + 1)
13

14

15 for i in range(numberOfBanditArms):
16 plt.plot(timeStepArray, column_wise_mean[i], label = "Bandit Arm: " + str(i))
17

18 plt.legend(loc = "best")
19 plt.title("Average performance of each armed bandit for each trial")
20 plt.xlabel("Trial")
21 plt.ylabel("Average reward")
5.4 Simulation 29

Visualising the Continuous Reward Distribution


1 dataPoint = []
2 arms = len(qValuesOfBanditArms)
3 for x in range(arms):
4 dataPoint.append(np.random.normal(qValuesOfBanditArms[x], 1, 10000))
5

6 plt.title("K-armed Bandit Problem with k = " + str(arms)+ "")


7 plt.grid(True)
8 sns.violinplot(data=dataPoint, split=True, orient='v', width=0.8, scale='count')
9 plt.xlabel("Actions")
10 plt.ylabel("Reward Distribution")
11 plt.show()
30 Chapter 5. Let’s Banditize Arms

5.4.4 Continuous probability reward example with ε = 0


1 import numpy as np
2 import matplotlib.pyplot as plt
3 import random
4

7 def epsilonGreedyMethod(e, numberOfBanditArms, qValuesOfBanditArms, runs, trials):


8

9 allMachineQValueMatrix = np.zeros((numberOfBanditArms,runs,trials))
10

11 for run in range(runs):


12

13 #starting a fresh game


14 actionSelectedCounter = np.zeros(numberOfBanditArms)
15

16 for trial in range(trials):


17

18

19 rand_num = np.random.uniform(0, 1)
20

21 '''
22 Step 1: Action selection:
23 e = epsilon
24 probability of exploration = e
25 probability of exploitation = 1 - e
5.4 Simulation 31

26

27 R is the reward
28 A is the action
29 '''
30

31

32 if rand_num <= 1 - e:
33 '''
34 exploit
35

36 '''
37

38 #storing the expected Q value for each arm for the current trial
39 QvalueOfBanditArms = np.zeros(numberOfBanditArms)
40 for arm in range(numberOfBanditArms):
41 QvalueOfBanditArms[arm] = allMachineQValueMatrix[arm][run][trial]
42

43

44 #finding the max expected value of an action (Qvalue) in the QvalueArrayForAl


45 maxExpectedValue = max(QvalueOfBanditArms)
46

47

48 #finding which actions are giving me this maxExpectedValue


49 shortListedActions = [action for action, estimateValue in enumerate(QvalueOfB
50 if estimateValue == maxExpectedValue]
51

52 #breaking the tie randomly if there are more than one selected actions
53 actionLength = len(shortListedActions)
54 if actionLength > 1:
55

56 #choosing an action randomly from the shortListedActions


57 finalAction = random.choice(shortListedActions)
58

59 else:
60 finalAction = shortListedActions[0]
61

62 #print("the agent pulled the arm of bandit machine number ", finalAction)
63 actionSelectedCounter[finalAction] += 1
64

65 A = finalAction
66 '''
67 reward added for action
68 '''
69

70 else:
71 '''
32 Chapter 5. Let’s Banditize Arms

72 explore
73 '''
74

75 #selecting randomly from among all the actions with equal probability.
76 randActionIndex = np.random.randint(0,numberOfBanditArms)
77 #rewardOfRandAction = rewardListForEachBanditArmsAction[randActionIndex]
78 actionSelectedCounter[randActionIndex] += 1
79 #print("actionSelectedCounter: ", actionSelectedCounter)
80

81 A = randActionIndex
82

83

84 '''
85 Step 2: Reward Allocation
86 When learning method applied to that k bandit problem selected action A at time s
87 the actual reward R, is being selected from a normal distribution with q(A at tim
88 '''
89

90 R = np.random.normal(qValuesOfBanditArms[A], 1, 1)
91

92

93 '''
94 Step 3:
95 Action Value Updation:
96 NewEstimate <- OldEstimate + stepSize[Target - Old Estimate]
97 '''
98 '''
99 Matrix Updation For Performance Measurement
100 A is actual arm lever that you pulled dowm
101 '''
102 #if condition to prevent index out of bound error.
103 if trial+1 != trials:
104 for arm in range(numberOfBanditArms):
105 if arm == A :
106 #updation of the expected value for Q for the arm that we pulled down
107 update = (R - allMachineQValueMatrix[arm][run][trial])/actionSelected
108 allMachineQValueMatrix[arm][run][trial+1]=allMachineQValueMatrix[arm]
109 else:
110 #updation of the expected Q value for the other arms that we left in
111 if trial != 0:
112 #basically we are updating the Q value of the arm of the previous
113 #if this arm was not pulled down
114 allMachineQValueMatrix[arm][run][trial + 1] = allMachineQValueMat
115

116 return allMachineQValueMatrix


117
5.4 Simulation 33

118

119

120 allMachineQValueMatrix = epsilonGreedyMethod(e = 0, numberOfBanditArms = numberOfBanditArms


121 ,qValuesOfBanditArms=qValuesOfBanditArms,runs =
122 #print(allMachineQValueMatrix)
123

124

125 column_wise_mean = []
126 for banditArmIndex in range(numberOfBanditArms):
127 column_wise_mean.append(np.mean(allMachineQValueMatrix[banditArmIndex], axis=0))
128

129 timeStepArray = np.arange(1,trials + 1)


130 #print(column_wise_mean)
131 timeStepArray
132

133 for i in range(numberOfBanditArms):


134 plt.plot(timeStepArray, column_wise_mean[i], label = "Bandit Arm: " + str(i))
135

136 plt.legend(loc = "best")


137 plt.title("Average performance of each armed bandit for each trial")
138 plt.xlabel("Trial")
139 plt.ylabel("Average reward")
140 plt.grid(True)
141 plt.show()
142
34 Chapter 5. Let’s Banditize Arms

5.4.5 Continuous probability reward example with ε = 0.1


1 allMachineQValueMatrix = epsilonGreedyMethod(e = 0.9, numberOfBanditArms = numberOfBanditArms
2 ,qValuesOfBanditArms = qValuesOfBanditArms,runs=
3 #print(allMachineQValueMatrix)
4 column_wise_mean = []
5 for banditArmIndex in range(numberOfBanditArms):
6 column_wise_mean.append(np.mean(allMachineQValueMatrix[banditArmIndex], axis=0))
7

8 timeStepArray = np.arange(1,trials + 1)
9

10

11

12 for i in range(numberOfBanditArms):
13 plt.plot(timeStepArray, column_wise_mean[i],label = "Bandit Arm: " + str(i))
14

15 plt.legend(loc = "best")
16 plt.title("Average performance of each armed bandit for each trial")
17 plt.xlabel("Trial")
18 plt.ylabel("Average reward")
19 plt.grid(True)
20 plt.show()
5.4 Simulation 35

5.4.6 Average performance of each armed bandit for each trial

1 import numpy as np
2 import matplotlib.pyplot as plt
3 import random
4

7 def epsilonGreedyMethodMem(e, numberOfBanditArms, qValuesOfBanditArms, runs, trials):


8

9 allMachineQValueMatrix = np.zeros((numberOfBanditArms,runs,trials))
10

11 for run in range(runs):


12

13 #starting a fresh game


14 actionSelectedCounter = np.zeros(numberOfBanditArms)
15

16 for trial in range(trials):


17

18

19 rand_num = np.random.uniform(0, 1)
20

21 '''
22 Step 1: Action selection:
23 e = epsilon
24 probability of exploration = e
25 probability of exploitation = 1 - e
26

27 R is the reward
28 A is the action
29 '''
30

31

32 if rand_num <= 1 - e:
33 '''
34 exploit
35

36 '''
37

38 #storing the expected Q value for each arm for the current trial
39 QvalueOfBanditArms = np.zeros(numberOfBanditArms)
40

41 for arm in range(numberOfBanditArms):


42 QvalueOfBanditArms[arm] = allMachineQValueMatrix[arm][run][trial]
43

44 #finding the max expected value of an action (Qvalue) in the QvalueArrayForAl


36 Chapter 5. Let’s Banditize Arms

45 maxExpectedValue = max(QvalueOfBanditArms)
46

47 #finding which actions are giving me this maxExpectedValue


48 shortListedActions = [action for action, estimateValue in enumerate(QvalueOfB
49 if estimateValue == maxExpectedValue]
50

51 #breaking the tie randomly if there are more than one selected actions
52 actionLength = len(shortListedActions)
53 if actionLength > 1:
54

55 #choosing an action randomly from the shortListedActions


56 finalAction = random.choice(shortListedActions)
57

58 else:
59 finalAction = shortListedActions[0]
60

61 #print("the agent pulled the arm of bandit machine number ", finalAction)
62 actionSelectedCounter[finalAction] += 1
63

64 A = finalAction
65 '''
66 reward added for action
67 '''
68

69 else:
70 '''
71 explore
72 '''
73

74 #selecting randomly from among all the actions with equal probability.
75 randActionIndex = np.random.randint(0,numberOfBanditArms)
76 #rewardOfRandAction = rewardListForEachBanditArmsAction[randActionIndex]
77 actionSelectedCounter[randActionIndex] += 1
78 #print("actionSelectedCounter: ", actionSelectedCounter)
79

80 A = randActionIndex
81

82

83 '''
84 Step 2: Reward Allocation
85 When learning method applied to that k bandit problem selected action A at time s
86 the actual reward R, is being selected from a normal distribution with q(A at tim
87 '''
88

89 R = np.random.normal(qValuesOfBanditArms[A], 1, 1)
90
5.4 Simulation 37

91

92 '''
93 Step 3:
94 Action Value Updation:
95 NewEstimate <- OldEstimate + stepSize[Target - Old Estimate]
96 '''
97 '''
98 Matrix Updation For Performance Measurement
99 A is actual arm lever that you pulled dowm
100 '''
101 #if condition to prevent index out of bound error.
102 if trial+1 != trials:
103 for arm in range(numberOfBanditArms):
104 if arm == A :
105 #updation of the expected value for Q for the arm that we pulled down
106 update = (R - allMachineQValueMatrix[arm][run][trial])/actionSelected
107 allMachineQValueMatrix[arm][run][trial+1] = allMachineQValueMatrix[ar
108 else:
109 #updation of the expected Q value for the other arms that we left in
110 #basically we are updating the Q value of the arm of the previous tri
111 allMachineQValueMatrix[arm][run][trial + 1] = allMachineQValueMatrix[
112 else:
113 if run + 1 != runs:
114 for arm in range(numberOfBanditArms):
115 allMachineQValueMatrix[arm][run + 1][0] = allMachineQValueMatrix[arm]
116

117

118 return allMachineQValueMatrix


119

120

121

122 runs = 2000


123 trials = 500
124 allMachineQValueMatrix = epsilonGreedyMethodMem(e = 0.9, numberOfBanditArms = numberOfBanditA
125 ,qValuesOfBanditArms=qValuesOfBanditArms,runs
126 #print(allMachineQValueMatrix)
127 column_wise_mean = []
128 for banditArmIndex in range(numberOfBanditArms):
129 column_wise_mean.append(np.mean(allMachineQValueMatrix[banditArmIndex], axis=0))
130

131 timeStepArray = np.arange(1,trials + 1)


132

133

134

135 for i in range(numberOfBanditArms):


136 plt.plot(timeStepArray, column_wise_mean[i],label = "Bandit Arm: " + str(i))
38 Chapter 5. Let’s Banditize Arms

137

138 plt.legend(loc = "best")


139 plt.title("Average performance of each armed bandit for each trial")
140 plt.xlabel("Trial")
141 plt.ylabel("Average reward")
142 plt.grid(True)
143 plt.show()
6. Reinforcement Learning

UCB balances exploration and exploitation. It aims to maximize the total reward received over time
by efficiently selecting actions with higher expected rewards.

6.0.1 Introduction
Statistical Potential is:
s
ln1
At = argmaxa [Qt (a) + c ]
Nt (a)

where Qt(a) is the estimated value of action ‘a’ at time step ‘t’.
Nt(a) is the number of times that action ‘a’ has been selected, prior to time ‘t’.
‘c’ is a confidence value that controls the level of exploration.

The value decreases over time, which is what we wanted.


40 Chapter 6. Reinforcement Learning

6.0.2 Upper Confidence Bound

UCB selects actions based on their potential to maximize the reward and the uncertainty around the
true value of each action. It estimates the mean reward of each action using the data collected so far
and then adding an uncertainty term to this estimate, which is proportional to the variance of the
rewards. The action with the highest score (which is the sum of the mean estimate and uncertainty
term) is selected for the next trial.
The algorithm has two stages: the initialization stage and the action selection stage. During the
initialization stage, each action is played at least once, and the corresponding rewards are recorded.
The action selection stage begins after the initialization stage and continues for a predetermined
number of trials. At each trial, the algorithm selects the action with the highest score, plays it, and
records the reward. The estimate of the mean reward for that action is updated, and the algorithm
calculates the score for each action based on the updated estimates and uncertainties.

6.0.3 Soft-max Distribution

Softmax is a mathematical function that converts a vector of real numbers into a probability distribu-
tion. The resulting probability distribution assigns a probability to each element of the input vector,
with the sum of probabilities equaling one.

exi
so f tmax(xi ) = x
∑nj=1 e j

where x is a vector of real numbers, i is an index in the vector, and n is the length of the vector.

The denominator of the softmax function ensures that the resulting probabilities sum to one,
while the exponentiation of the input values ensures that the probabilities are non-negative.

6.0.4 Importance of Baseline

In the case of softmax distribution in policy gradients, a baseline is often used to reduce the variance
of the estimated policy gradient. The policy gradient represents the change in the policy parame-
ters that maximizes the expected reward, and is typically estimated using the softmax distribution.
However, the policy gradient estimate has high variance, which can make it difficult to learn an
accurate policy. By introducing a baseline in the policy gradient, we can reduce the variance of
the estimate and make it easier to learn an accurate policy. The baseline subtracts a value from the
estimated value of the action, reducing the variance of the estimate and resulting in a more stable
policy gradient estimate.
41
7. SARSA

7.0.1 Introduction
State-Action-Reward-State-Action (SARSA) is a reinforcement learning algorithm that is used to
learn a policy for an agent in an environment. The SARSA algorithm is based on the idea of learning
a Q-value function that estimates the expected future reward for taking an action in a given state. The
Q-value function is used to determine the policy, which is the agent’s strategy for selecting actions in
different states.

7.0.2 Algorithm working


The SARSA algorithm updates the Q-value function using a sequence of state-action-reward-state-
action tuples. In each tuple, the agent observes the current state, takes an action based on its current
policy, receives a reward for the action, and transitions to a new state. The new state is used to
determine the next action, based on the policy.
The SARSA algorithm updates the Q-value function using the Bellman equation, which expresses
the Q-value of a state-action pair as the sum of the immediate reward and the discounted future
reward for the next state-action pair. The discount factor is a value between 0 and 1 that determines
the relative importance of immediate and future rewards.
The SARSA algorithm updates the Q-value function after each state-action-reward-state-action tuple,
using the following formula:
[Q(s, a) = Q(s, a) + α ∗ (r + γ ∗ Q(s′ , a′ ) − Q(s, a))]
where Q(s,a) is the current estimate of the Q-value for the state-action pair, s is the current state,
a is the current action, r is the reward for the current action, s’ is the next state, a’ is the next action,
alpha is the learning rate, and gamma is the discount factor.

The SARSA algorithm updates the Q-value function after each tuple, and the policy is updated
based on the Q-value function. The policy is typically an epsilon-greedy policy, which selects the
action with the highest Q-value with probability 1-epsilon, and a random action with probability
epsilon.
44 Chapter 7. SARSA

7.0.3 Q-learning
The Q-value function is updated using the Bellman equation, which expresses the Q-value of a
state-action pair as the sum of the immediate reward and the discounted future reward for the
next state-action pair. The discount factor is a value between 0 and 1 that determines the relative
importance of immediate and future rewards.

The Q-value function is initialized with random values, and the agent interacts with the environ-
ment by taking actions and observing the resulting state and reward. After each action, the Q-value
function is updated using the following formula:

[Q(s, a) = Q(s, a) + α ∗ (r + γ ∗ max(Q(s′ , a′ )) − Q(s, a))]


where Q(s,a) is the current estimate of the Q-value for the state-action pair, s is the current state,
a is the current action, r is the reward for the current action, s’ is the next state, alpha is the learning
rate, and gamma is the discount factor.

The max(Q(s’,a’)) term in the formula represents the maximum Q-value for the next state-action
pair, which represents the optimal action to take in the next state. The learning rate determines
the extent to which the new Q-value is used to update the current estimate, and the discount factor
determines the importance of future rewards.

Q-learning is an off-policy learning algorithm, which means that it learns the optimal policy
even if it follows a different policy during the learning process. The optimal policy is derived from
the Q-value function by selecting the action with the highest Q-value in each state.

Cliff-walking example

Reasons why the Graph is the way it is: For Q-Learning,the max operator only tells you the best
45

accumulative discounted reward you could get in the future regardless of danger.
For SARSA, during the training on the faster path are sometimes offset by the negative rewards due
to falling into cliff. Therefore, on the longer, safer path, though discounted by more actions, are
higher and more optimal to go through.

Here is a table differentiating SARSA and Q-learning


8. Q-learning

8.0.1 Bellman’s optimality condition


Define the value function V (s) as the expected cumulative reward starting from state s and following
the optimal policy:
V ( s) = max Q( s, a) = max E[rt+1 + γV ∗ (st+1 )|st = s, at = a]
a∈A(s) a∈A(s)

Here, A(s) is the set of possible actions in state s, Q∗ (s, a) is the optimal action-value function,
and γ is the discount factor.
Using the Bellman equation for the state-value function V (s), we can rewrite V (s) as:
V ( s) = max E[rt+1 + γV ( st+1 )|st = s, at = a]
a∈A(s)

= max ∑ p(s′ , r|s, a)[r + γV ∗ (s′ )]


a∈A(s) s′ ,r

Here, p(s′ , r|s, a) is the probability of transitioning to state s′ and receiving reward r when taking
action a in state s.
Taking the maximum over all actions a in state s, we obtain the Bellman optimality equation:
V ( s) = max ∑ p(s′ , r|s, a)[r + γV ( s′ )]
a∈A(s) s′ ,r

This equation states that the optimal value of a state is equal to the maximum expected return
that can be obtained by taking any action in that state, plus the discounted value of the next state.
Similarly, we can derive the Bellman optimality equation for the optimal action-value function
Q∗ (s, a):
Q( s, a) = ∑ p(s′ , r|s, a)[r + γ max Q( s′ , a′ )]
s′ ,r a′ ∈A(s′ )

optimal action-value function is equal to the expected return that can be obtained by taking
action a in state s, plus the discounted value of the next state-action pair, where the action is chosen
to maximize the value of the next state.
48 Chapter 8. Q-learning

8.0.2 Quality of a state


we can use the Bellman equation for the state-value function V (s) and solve for V (s) in terms of the
optimal action-value function Q∗ (s, a):

V ( s) = max Q( s, a) = max E[rt+1 + γV ∗ (st+1 )|st = s, at = a]


a∈A(s) a∈A(s)

= max ∑ p(s′ , r|s, a)[r + γV ∗ (s′ )]


a∈A(s) s′ ,r

Rearranging terms, we obtain:

V ( s) − max ∑ p(s′ , r|s, a)[r + γV ( s′ )] = 0


a∈A(s) s′ ,r

Now, we can define the quality of a state as the maximum difference between the value of that
state and the value of its neighboring states under the optimal policy. This is known as the Bellman
error, denoted as δ (s):

δ (s) = max ∑ p(s′ , r|s, a)[r + γV ( s′ )] −V ( s)


a∈A(s) s′ ,r

This quantity represents the amount by which the current estimate of the value of state s differs
from the optimal value, and it can be used to update the value function in a variety of reinforcement
learning algorithms.

8.0.3 An iterative example of Q-table updation


Initialize the Q-table with zeros or small random values. Observe the current state s and select an
action a using an exploration-exploitation strategy (e.g. ε-greedy). Execute the action a and observe
the reward r and the new state s′ . Update the Q-value for the current state-action pair using the
Q-learning update rule:

Q(s, a) ← Q(s, a) + α[r + γ max



Q(s′ , a′ ) − Q(s, a)]
a

where α is the learning rate and γ is the discount factor. 5. Set the current state to the new state,
s ← s′ , and repeat the process from step 2 until the agent reaches the terminal state.
During each iteration, the Q-values are updated based on the observed rewards and the maximum
Q-value of the next state. Over time, the Q-values converge to their optimal values, allowing the
agent to make better decisions and achieve higher rewards in the environment.
9. Deep q-learning

Deep Q-learning (DQL) is a type of reinforcement learning that uses neural networks to approximate
the optimal action-value function, which estimates the expected reward of taking a particular action
in a given state.

In DQL, an agent interacts with an environment by taking actions and receiving rewards. The
agent’s goal is to learn a policy that maximizes the cumulative reward over time. To achieve this, the
agent uses a Q-learning algorithm to update its estimates of the action-value function based on the
observed rewards and the next state of the environment.

In traditional Q-learning, the action-value function is represented as a lookup table that stores the
expected reward for each possible state-action pair. However, this approach becomes infeasible when
the state space is large or continuous. To address this, DQL uses a neural network to approximate
the action-value function.

During training, the agent repeatedly samples experiences from its interaction with the envi-
ronment and uses them to update the parameters of the neural network. The network is trained to
minimize the difference between the predicted action values and the observed rewards, using a loss
function such as mean squared error.
50 Chapter 9. Deep q-learning

Training the neural networks include:


1. Initialization
2. Forward Propagation
3. Calculation of Error
4. Backward Propagation
5. Update Weights
6. Repeat (steps 2-5)

But before we dive into neural networks, lets see the structure:
input layer → hidden layer → output layer
The input layer consists of neurons corresponding to the features that we consider and the output
layer consists of neurons corresponding to the output/categories that we have. The hidden layer is
hidden
Here is an example of handwritten numer recognition:

For an in-depth analysis:


51

This could be applied in the stock trading bot, and has great number of applications in stock market
and trading.
10. Introduction to calculus of variations

Function:x → f (x)
Takes an input value and produces an output value. For example: A curve that describes the shape of
the path.
Functional: f (x) → F[ f (x)]
Takes a function as input and produces a scalar value as output. For example: An integral that
represents total energy associated with a path.

10.1 Finding the shortest path optimally


: Important steps int the derivation: 1. We want to solve the problem as a 2 dimensional variational
problem.
2. Take a small segment on the curve and derive incremental distance dS.
3. Use the Euler-Langrange equation( describes the condition under which a given function is a
stationary point of a functional).
0 = dtd ( δδL ) − δδ Lq
4. Try to take the integral of a functional that minimizes the interval. f (x, y, y′ )
5. Comapring the two forms of the integral
6. Rearrange in terms of y’.
7. Getting that slope is constant hence the shortest distance is a line.

10.2 Brachistocrone problem


The brachistochrone problem asks what shape a curve should take so that a ball can roll down it in
the least amount of time. It was first studied by mathematicians in the 17th century. The solution to
the problem is a curve called a cycloid. The cycloid is the path traced out by a point on a circle as it
rolls along a straight line.
54 Chapter 10. Introduction to calculus of variations

R P2 1+y′2
√ dx
P1 2gy

10.3 Derivation of Euler-Lagrange Equation


1. Catenary Problem - Potential energy minimization
2. Brachistochrone Problem(time minimization)
10.3 Derivation of Euler-Lagrange Equation 55
11. Calculus of Variations

In mathematics, a functional is a function that takes one or more functions as inputs and produces
a scalar value as output. Formally, a functional is a mapping from a space of functions to the real
numbers or complex numbers.
A functional can be viewed as a generalization of a function, which takes a single point or vector as
input and produces a scalar value as output. In contrast, a functional takes a function as input, which
can be thought of as an infinite dimensional vector.

11.1 History- Interesting

Functionals have a long history in mathematics, with roots in the calculus of variations and the study
of differential equations. Here are some notable milestones in the history of functionals:
1. The calculus of variations was developed in the 17th century by mathematicians such as Johann
Bernoulli and Jacob Bernoulli, who were interested in finding the path of least time for light to travel
between two points in a medium of varying density.
2. In the 19th century, mathematicians such as Karl Weierstrass and Bernard Riemann developed the
theory of functionals and their properties. This led to the development of functional analysis, which
is the study of the properties of functionals and the spaces on which they act.
3. In the mid-20th century, functional analysis and optimization theory were used extensively in
engineering and economics, particularly in the development of control theory and game theory.

Finding numerical values for a number of variables, such as x1,..., xn, was our objective in the
optimization issues we have addressed thus far. That maximise a specific quantity F that depends
on x1 and subsequent values up to xn. We now examine optimization problems that aim to identify
functions x1 that optimally maximise a specific quantity F that depends on x1 through xn.
58 Chapter 11. Calculus of Variations

11.2 Production planning


An organisation is given the task of producing B units of a good in time T. They want to charge the
least amount possible to complete the order. There are two sources of cost. (1) The cost of storage is
calculated per item and per hour.
(2) The price of production at rate r is amount b per thing. How might the business organise its
production to reduce overall costs? Let’s convert the issue to a strictly mathematical one. By x(t),
indicate the volume produced up until time t. We’re looking for a function that performs this. Let’s
list the complete price.
incurred as a function of x (t). The cost incurred over a brief time period

t,t + ∆ t

is roughly equal to:

(1)storage : a.x(t).∆t

(2)production : b.x′ (t).x′ (t)∆t


summing over the entire interval and talking the limits as

∆t → 0
Z 0
totalcost = ax(t) + bx′ (t)2 dt
T
The optimization problem is then the following:
problem1: find a function of x(t) that minimizes:
Z 0
ax(t) + bx′ (t)2 dt
T

subject to x(0)=0 and x(T) = B

11.3 The brachistochrone problem


The calculus of variations was born out of this physics conundrum. The objective is to create a path
between two points A and B that will allow a ball to travel there in the shortest amount of time under
the influence of gravity. Let’s convert the issue to a strictly mathematical one. Assume that A’s
coordinates are (0, a) and B’s coordinates are (b, 0). Assume the graph of a function y of x specifies
the path. Let’s note how long it took the ball to go along this course.
suppose the ball is at (x.y) and rolls to

x + ∆x, y − ∆yintime∆t.

(∆(x)2 + ∆(y)2 ) = (1 + (y′ )2 ).∆x


p p

the speed v is given by equating the kinetic energy with the loss in the potential energy ()assuming
the initial speed is zero ):

(1/2)mv2 = mg(a − y)
11.4 Investment strategy 59
p
v = 2g(a − y)
so the time taken is :

( (1 + (y′ )2 ).∆x)
p
∆= p
(2g(a − y))
summing over the whole interval and talking the limits as
∆x → 0
gives the total time Total time
( (1 + (y′ )2 .dx)
Z 0 p
= p
b ( (2g(a − y)))

this leads to the followings problem.

11.4 Investment strategy


Suppose we are planning an investment and consumption strategy, in the following setup. A capital
K yields returns at a rate F(K). The returns can be spent or reinvested, or a combination of both.
Spending results in enjoyment, or utility, and re-investment results in greater K for greater future
returns. Suppose consumption at rate C yields utility
U(C). Our goal is to devise a consumption/re-investment strategy that will maximize total utility.
Denote by K(t) the invested capital at time t, by C(t) the rate of consumption at time t and by R(t) the
rate of re-investment at time t. Then we have the equation:
F(K(t)) = C(t) + R(t)
which expresses the fact that the returns from capital K(t) are partially consumed and partially
re-invested. Furthermore, the rate of re-investment is precisely the rate of growth of K(t). Therefore,
we have
K ′ (t) = R(t)
Using the two equations above, we can write C(t) = F(K(t)) - K’(t) We are given the initial value
K(0) and the utility function U. Our goal is to figure out the function K(t) that leads to the maximum
total utility
Z 0
U(C(t))dt
t

hence the problem is solved.


remark: Strictly speaking, the only conclusion we can draw from solving the Euler–Lagrange
equation is that if a minimizer x(t) exists, then it must be the one we found. But let us assume that it
does (which we will not prove). Note some qualitative features of the solution. If a = 0 (no storage
cost), then the optimal strategy is to produce at a constant rate, which means that the graph of x(t) is
a straight line. If a > 0, then the graph of x(t) is a parabola, which gets steeper as a/4b increases. So,
for a/b (negligible production cost compared to storage), the optimal strategy is to do most of the
production close to the deadline
60 Chapter 11. Calculus of Variations

11.5 investment planning


Let us take the rate of returns to be proportional to the capital and the utility function to be the
logarithm. Let us also assume that both the initial and the final value of the capital is given. For
simplicity, let us assume T = 1.
12. Fermat’s Principle

12.1 Euler-Lagrange Equation


Z a
S[y] = F(x′ y′ y′ )dx
b
wherey(a) = Aandy(b) = B
The stationary paths for this functional are given by solving the Euler-Lagrange Equation

d δF δF
.( )−( )=0
dx δ y′ δy
special case (Beltrami’s identity) when x does not appear in F explicity
δF
F − y′ =C
δ y′
where c is a constant

12.2 distance minimisation


Z B
I= ds
A
Z Bq
I= 1 + y′( 2).dx
A
with the application of the euler-lagrange Equation on the functional
s
c( 2)
y′ =
1 − c( 2)
y′ = m
y′ = mx + c
hence, the result is a straight line
62 Chapter 12. Fermat’s Principle

12.3 Light
12.3.1 Introduction
Ray and wave optics are connected by Fermat’s concept, also referred to as the principle of least time.
According to Fermat’s principle in its original, "strong" form, the path that a ray takes between any
two places is the one that can be traversed in the shortest amount of time. In order for this statement
to hold in all situations, the "least" time must be replaced with a time that is "stationary" with respect
to changes in the path, such that a change in the path only results in a second-order change in the
traversal time. In general, a ray path is surrounded by nearby paths that can be travelled through
quickly. With this technical description,It may be demonstrated that this technical definition of a ray
corresponds to less technical definitions, like a line of sight or the course of a narrow beam.
Fermat’s concept, which he first put forth in 1662 to explain the natural rule of light refraction,
caused some controversy at the time since it appeared to attribute intelligence and intention to nature.
It wasn’t until the 19th century that the ability of nature to examine different courses was recognised
as merely a fundamental characteristic of waves. A wavefront spreading from point A sweeps all
potential ray routes radiating from point A, whether they pass via point B or not. If the wavefront
reaches point B, it sweeps an infinite number of neighbouring pathways with the same endpoints in
addition to the ray path(s) from point A to point B. Any beam that happens to reach is described by
Fermat’s principle.

12.3.2 principle of least time


Ray and wave optics are connected by Fermat’s concept, also referred to as the principle of least
time. Fermat’s principle states that the route taken by a ray between any two places is the one that
can be traversed in the shortest amount of time. This is known as its original "strong" form.

12.3.3 derivation
conversion of the angles with the sine of the angle note: (consider that θ 1 is the initial angle and θ 2
is final angle)
for understanding the representation of n

Z Z
T = t1 = t2 = dt + dt
1 2

dl1
dl1 = V 1dt1 → dt1 =
V1
dl2
dl2 = V 2dt2 → dt2 =
V2
therefore the final result will be:
q
dl1 = (dx)( 2) + (dy)2
r
dy 2
dl1 = dx 1+( )
dx
12.3 Light 63

therefore the value of sine function:


dy
dy dy
sinθ 1 = == p = q dx
dl1 (dx)2 + (dy)2 dy 2
1 + ( dx )

hence the value of T will be derived as

dl1 dl2
Z Z
T= +
V1 V2
Z a r Z A r
1 dy 1 dy 2
T= dx 1 + ( )2 + dx 1+( )
V1 0 dx V2 a dx

Theta function the initial value of the T will be converted into:

dl1 dl2
Z Z
T= +
V1 V2
Z a r Z A r
1 dy 1 dy 2
T= dx 1 + ( )2 + dx 1+( )
V1 0 dx V2 a dx

Z A r
θ (a − x) θ (x − a) dy
T= [ + ] 1 + ( )2
0 V1 V2 dx
given that:

θ (a − x) θ (x − a) 1
[ + ]=
V1 V2 v(x)

whereV 1 → x < aandV 2 → x > a

hence the value of the T obtained is the shortest pathway from 0 to A


Z A
1
T= dx q
0 dy 2
v(x) 1 + ( dx )

Euler Lagrange let consider that :

r
1 dy 2
G(x) = 1+( )
v(x) dx
64 Chapter 12. Fermat’s Principle

therefore
δG
dy
= constant
δ dx

hence, by the application of the formula into the derived equation:

dy
r
δ dy
dy
1 + ( )2 = q dx
δ dx dx dy 2
1 + ( dx )

hence by applying the final derived form into the equation, we can states that
dy
δG 1 dx
dy
=
δ dx v(x) sqrt1 + ( dy )2
dx

application of this derivation constant into the equation of the sin function:

dy
dy
there f orsinθ 1 = = q dx
dl1 dy 2
1 + ( dx )

hence by equating the two quations:

dy
Sinθ 1 1
= q dx
V1 V 1 + ( dy )2
dx

hence by intergrating the limits


therefor;
sinθ 1 V 1
=
θ2 V2

Inhomogenous medium
Let us consider a medium M that is optically inhomogeneous and in which the light velocity is a
single-valued continuous function of the y coordinate

u = u(y)

We approximate M by using a sequence of homogenous media M1, M2...Mn. As the number of


subdivisions increase and their widths approach zero, the approximation becomes closer to the actual
path through the medium M
12.3 Light 65

Assumption
Refractive index depends upon the position.
Continuous variation of refractive index can be thought of limiting case of medium consist of thin
layers of specific value of refractive index.
Here, the beam of light travels through ’N’ distinct layers each with a different refractive index.
As the beam propagates, it experiences multiple refractions at all the interfaces
13. Variational Constraint Problem

13.1 Introduction
A catenary is a curve formed by an imaginary chain or cable that hangs freely under the influence of
gravity between two points. Unlike a parabola, this curve has a U-like appearance and is commonly
found in the fields of physics and geometry. The catenary shape is frequently used in the design of
arches and can also be seen in the cross-section of a soap film confined between two parallel circular
rings, which is known as a catenoid.
The shape of the cable is determined by the function y(x), where y represents the vertical
displacement and x represents the horizontal position along the cable.

13.2 Derivation
Assuming
Ra p
we have two supporting points, (-a,A) and (a,A), then:
S[y] = −a 1 + y′ (x)2 dx
y(−a) = A, y(a) = A

The function y that minimizes this functional when the length of the curve remains constant .
Z aq
L= 1 + (y′ )2
−a

The slope of the catenary is given by: c cosh( xc ) + d where c and d are constants.
µ = mass per unit length, L = total length
Upon minimizing the potential energy of the chain, we get:

Ra p
1 + (y′ )2 dx
R
ds = −a
Z aq
=⇒ L = 1 + (y′ )2
−a
68 Chapter 13. Variational Constraint Problem

RB
PE = A µgy ds
Ra p
PE = −a µgy 1 + (y′ )2 dx
Just the path with
pthe least Potential Energy (PE) would be shown by solving the aforementioned
equation using µgy 1 + (y′ )2 as F. This path would look like this:
Hence, we take a constraint equation.

N = PE + λ L
Where ∗ ∗ λ = ∗∗ Lagrange multiplier and L = Length of the chain
Substituting PE and L into the new equation, we get:

p p
N = µgy 1 + (y′ )2 dx + λ 1 + (y′ )2 dx
p
= (µgy + λ ) 1 + (y′ )2 dx
which is subject to the new constraint equation.
Using Euler-Lagrange and taking into account

F = (µgy + λ ) 1 + (y′ )2 → F(y, y′ )


p

we find that F does not explicitly depend on x, and therefore us the Beltrami identity. Thus:

δF d δF
− =0
δ y dx δ y
Multiplying both sides by y′ :

y’ δ F d δF
δ y−y′ dx δ y′
=0
The partial derivation of F(y, y′ ) w.r.t. x is:

δF
δ x= δδFy y′ + δδ yF′ y”

δF ′ δF δF
=⇒ y = − y”
δy δ x δ y′
Substitute the equation obtained by multiplying y′ to the equation obtained above:

δF δF d δF δF d δF δF
− ′ y” − y′ ( ′ ) = 0 ′ y” + y′ ( ′ ) − =0
δx δy dx δ y δy dx δ y δx
d ′ δF − F). We rewrite it as:
Which is the expanded form of dx (y δ y′
13.2 Derivation 69

d ′δF
(y − F) = 0
dx δ y′
implying that the term inside the derivative is a constant.
Beltrami’s Identity:

F - y’ δ F δ y′ =C1
p
After substituting **F = (µgy + λ ) 1 + (y′ )2 ** into the identity:


(µgy + λ ) 1 + (y′ )2 − y′ [ (µgy+λ
√ )·2y
p
′ 2
] = c1
2 1+(y )

)·y′
(µgy + λ ) 1 + (y′ )2 − y′ [ (µgy+λ
p
√ ′ 2
] = c1
1+(y )

′ 2
(µgy + λ )[ 1 + (y′ )2 − √ (y ) ′ 2 ] = c1
p
1+(y )

′ 2 −(y′ )2
√)
(µgy + λ )[ 1+(y ] = c1
1+(y′ )2

µgy + λ 1+(y′ )2 =c
1
Writing the equation in terms of y′ :

p
µgy + λ = c1 1 + (y′ )2

( µgy+λ 2 ′ 2
c1 ) = 1 + (y )
q
dy ′=
dx = y ( µgy+λ 2
c1 ) − 1
Writing it in terms of dy:

dy q
( µgy+λ
c )2 −1=dx
1
µgy+λ
For integrating, we substitute c1 = cosh u.

R
√ dy R
2
= x
cosh u−1
70 Chapter 13. Variational Constraint Problem
µgy+λ
Replacing dy by ∗ ∗ du:** c1 = cosh u. Then, differentiating:

c1 c1 sinh u c1 c1 (µgy + λ )
Z Z
µg
dy = sinh u dudy = sinh u √ du = x u = x+c2 cosh−1 = x+c2
c1 µg µg sinh2 u µg µg c1

Writing the equation in terms of y:

cosh−1 (µgy+λ
c1
)
= µg
c1 (x + c2 )

µgy+λ
c1 = cosh ( µg
c1 (x + c2 ))

c1
y= µg cosh ( µg λ
c1 (x + c2 )) − µg
This is the equation of the chain.
At (−a, 0) and (0, 0):

0 = c1 c
µg cosh ( µg λ 1 µg λ
c (−a+c2 ))− µmg = µg cosh ( c (−a+c2 ))= mg
1 1

0 = c1 c
µg cosh ( µg λ 1 µg λ
c (a+c2 ))− µmg = µg cosh ( c (a+c2 ))= mg
1 1
As both equations’ right-hand sides are the same, their left-hand sides must also be equal. As a
cannot be 0, this is only conceivable when c2 = 0.
When c2 = 0 is substituted in both equations:

c1 µg cosh ( −µga )= c1 cosh ( µga )


c1 µg c1
As cosh is an even function, where cosh (−k) = cosh (k):

c1
µg cosh ( µga λ µga
c )= µg λ =c1 cosh ( c )
1 1
To find c1 , we substitue c2 and λ in y.

y = c1 µg[cosh ( µgx )]− c1 [cosh ( µga )]y= c1 [cosh ( µgx )−cosh ( µga )]
c1 µg c1 µg c1 c1
So y’ is found to be:

dy µgx
= y′ = sinh ( )
dx c1
Substituting this into the constraint equation:
13.2 Derivation 71

Ra p
L= −a 1 + (y′ )2 dx
Ra q Ra
L= −a 1 + sinh2 ( µgx
c1 )dx =
µgx
−a cosh ( c1 )

sinh ( µgx
c )
= µg
1
for (−a <= x <= a)
c1

c1 µga −µga
= µg [sinh ( c1 ) − sinh ( c1 )]

c1 µga µga
µg [sinh ( c1 ) + sinh ( c1 )]

L = 2c 1 µga
µg sinh c1
L will not be a function, but a real number. If we solve further knowing L1 we get c1 . Substitute
this c1 into the equation for y. This is the solution to the Catenary Problem.
14. Probability Distribution using Max. Entropy

14.1 Lagrange Multipliers


Lagrange multipliers are used to optimize a function subject to a constraint. By introducing a
Lagrange multiplier variable and constructing a new function called the Lagrangian, we can solve
for the optimal values of x and y that maximize or minimize the original function while satisfying
the given constraint.

14.2 Principle of Maximum Entropy


The principle of maximum entropy states that, subject to precisely stated prior data (such as a
proposition that expresses testable information), the probability distribution which best represents the
current state of knowledge is the one with largest entropy. Another way of stating this: Take precisely
stated prior data or testable information about a probability distribution function. Consider the set
of all trial probability distributions that would encode the prior data. According to this principle,
the distribution with maximal information entropy is the proper one. . . . In ordinary language, the
principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum
ignorance. The selected distribution is the one that makes the least claim to being informed beyond
the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior
data.
74 Chapter 14. Probability Distribution using Max. Entropy

Surprise is inversely related to probability

However, when we take the inverse of the probability of getting heads, we get 1 instead of what we
want, 0.
So surprise is the log of the inverse of the probability,

1
Surprise = log Probability
We can say that Entropy is the expected value of surprise.
Entropy = ∑ni=1 log p1i pi
where surprise → log p1i
and P(surprise) → pi
Therefore, Entropy = − ∑ni=1 pi log(pi )

14.2.1 Shannon Entropy Formula


If X is a discrete random variable with distribution P(X = xi ) = pi then the entropy of X is:
H(X) = − ∑i pi log pi
But in our case,
R +∞
X is a continuous random variable with probability density p(x)
H(X) = − −∞ p(x)logp(x)dx
The principle of maximum entropy suggests choosing the probability distribution with the largest
entropy when we have prior data. This promotes epistemic modesty by selecting the distribution that
admits the most uncertainty beyond what we already know.

Where do Lagrangian multiplier fit into this? The principle of maximum entropy is a method
used to determine probability distributions that are consistent with a given set of constraints. The
basic idea is to choose the probability distribution that has the maximum entropy subject to these
constraints. The Lagrange multipliers are used to incorporate these constraints into the maximization
problem.
Now using the principle of maximum entropy and the Lagrange multipliers, we will derive 3 common
probability distributions that we’ve all studied before.

14.3 Derivation 1 - general


J(p) = − (p(x)lnp(x)dx) + λ0 ( (p(x)dx − 1) + ∑ i = 1N λi (p(x) fi (x)dx − ai )
R R

Here λ0 and λi are the Lagrange multipliers.


These variables are introduces by us in order to incorporate the constraints. No, taking a partial
derivative w.r.t p(x) dx will give us the following://
δJ
= −1 − lnp(x) + λ0 + ∑ i = 1N λi fi (x)
δ p(x)
0 = −1 − lnp(x) + λ0 + ∑ i = 1N λi fi (x)
p(x) = exp λ0 − 1 + ∑ i = 1N λi fi (x)

Depending on the different values of f(x), our constraints would change.


14.4 Derivation 2 75

14.4 Derivation 2

Here,
R
our value of f(x) = 0 because in Uniform Distribution, the only constarint we require is:
p(x)dx − 1
So, our Lagrange
R
Equation wouldR now become:
J(x) = − p(x)log(p(s))dx + λ0 p(s)dx − 1
Now, taking a partial derivative w.r.t p(x)dx and equating it to 0 will give us the following:
δ J(x)
δ p(x) = −1 − log(p(x)) + λ0 = 0
log((x)) = λ0 − 1
p(x) = exp λ0 − 1
To check if this is a maximum(which would maximize entropy given the way the equation was set
up), we also need to see if the second derivative with respect to p(x)dx is negative here or not, which
it
Rb
clearly always is:
p(x)dx = 1
Rab
a exp lambda0 − 1dx = 1
1
exp λ0 − 1 = b−a
Comparing the equations, we get,
1
p(x) = b−a
Of course, this is only defined in the range between a and b, however, so the final function is:

Graph of Uniform Distribution


76 Chapter 14. Probability Distribution using Max. Entropy

14.5 Derivation 3 - mean, variance

Derivation of maximum entropy probability distribution for given fixed mean µ and variance
σ 2 (Gaussian Distribution)
R +∞
First constraint: −∞ p(x)dx − 1
R =∞
Second constraint:
R +∞ −∞
(x − µ)2 p(x)dx −σ2
R +∞ R +∞
(x − µ)2 p(x)dx − σ 2
 
J(p(x)) = − ∞ p(x)lnp(x)dx + λ0 −∞ p(x)dx − 1 + λ1 −∞
, where the first constraint is the definition of pdf and the second is the definition of the variance
(which also gives us the mean for free). Taking the derivative with respect to p(x) and setting it to
zero,
J(p(x)) = − ∞+∞ p(x)lnp(x)dx + λ0 −∞
R +∞ R +∞
(x − µ)2 p(x)dx − σ 2
R  
p(x)dx − 1 + λ1 −∞
δ J(p(x)) 2
δ p(x) = −lnp(x) − 1 + λ0 + λ1 (x − µ) = 0
p(x) = exp λ0 − 1 + λ1 (x − µ)2
p(x) = exp λ0 − 1 exp λ1 (x − µ)2
To check whether the above is minimum, we take the second order derivative
δ J(x) 1
δ p(x)2
= − p(X)
This equation should satisfy the two constraints,
Satisfying
R +∞
the first constraint,
R−∞
p(x)dx =1
+∞
−∞ exp λ0q − 1 exp λ1 (x − µ)2 dx = 1
π
exp λ0 − 1 −λ =1
q1
exp λ0 − 1 = −λ π
1

Satisfying
R +∞
the second constraint,
R−∞
exp λ0 − 1(x − µ)2 p(x)dx = σ 2
+∞ 2 2 2
q−∞ exp λ0 − 1(x − µ) exp λ1 (x − µ) dx = σ
−λ1 R +∞ 2 2 2
−∞ (x − µ) exp λ1 (x − µ) dx = σ
r π q
λ1 1
π .2
π
−λ 3
= σ2
1

λ1 = − sσ1 2
We can also write,
q
−λ1
exp λ0 − 1 = π
exp λ0 − 1 = √ 1
2πσ 2

Putting everything together into the equation below

1 −(x − µ)2
p(x) = exp λ0 − 1 exp λ1 (x − µ)2 p(x) = √ exp
2πσ 2 2σ 2
14.6 Derivation 4 - mean 77

This will give Gaussian Distribution.

14.6 Derivation 4 - mean


We have the Lagrangian,
J(p(x), λ0 , λ 1) = − ∞+∞ p(x)lnp(x)dx + λ0 −∞
R +∞ R +∞
(x − µ)2 p(x)dx − σ 2
R  
p(x)dx − 1 + λ1 −∞

Now, taking the partial derivative wrt p(x) and setting it to zero,
δ J(p(x),λ0 ,λ1 )
δ p(x) = −lnp(x) − 1 + λ0 + λ1 x = 0
That would give,

p(x) = exp λ0 − 1 exp λ1 x (14.1)

Substituting them into the constraints, solving for lambda


Steps followed:
1. Divide the two equations
2. Substitute the value of lambda1 intoour f irstconstraint
3.Goingbacktoinitial p(x)andsubstituting f orthevaluesweobtained

T hiswouldgivethep.d. f o f anexponentialdistributionover[0, in f )
p(x) = exp λ0 − 1 exp λ1 x
p(x) = 1
µ exp −x
µ

(14.2)
78 Chapter 14. Probability Distribution using Max. Entropy

14.7 Applications
14.7.1 Cryptography
Are your passwords strong?
Setting the key as random as possible ensures that whoever’s trying to decrypt your secret stuffs has
a very hard time.

Statistical Modelling
Deciding which of your sets of experimental results describes your system the best.

Constants Additionally, this principle can also be used to determine universal physical constants.
15. Max Entropy and Reinforcement Learning

Lagrange MultiplierLet us consider a function that we want to maximise or minimize.


f(x,y) is the function
Let there be another function that is the constraint.

g(x,y) = c = constant

Entropy Entropy is defined as measuring the disorder of the information being processed.
Reinforcement Learning Reinforcement learning (RL) is an area of machine learning concerned
with how intelligent agents ought to take actions in an environment in order to maximise the notion
of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms,
alongside supervised learning and unsupervised learning.
80 Chapter 15. Max Entropy and Reinforcement Learning

15.1 Maximum Entropy Principle


The principle of maximum entropy states that the probability distribution which best represents the
current state of knowledge about a system is the one with largest entropy, in the context of precisely
stated prior data.

In the intuition of principle of maximum entropy, the aim is to find the distribution that has maximum
entropy. In many RL algorithms, an agent may converge to local optima. By adding the maximum
entropy to the objective function, it enables the agent to search for the distribution that has maximum
entropy. As we defined earlier, in Maximum Entropy RL, the aim is to learn the optimal policy that
can achieve the highest cumulative reward and maximum entropy. As the system has to search for
the entropy as well, it enables more exploration and chances to avoid converging to local optima is
higher.

15.2 Applying Entropy to Classical Reinforcement Learning

So, why would we expect MaxEnt RL to be robust to disturbances in the environment? Recall that
15.3 Applications 81

MaxEnt RL trains policies to not only maximize reward, but to do so while acting as randomly as
possible. In essence, the policy itself is injecting as much noise as possible into the environment, so
it gets to “practice” recovering from disturbances. Thus, if the change in dynamics appears like just
a disturbance in the original environment, our policy has already been trained on such data. Another
way of viewing MaxEnt RL is as learning many different ways of solving the task (Kappen ‘05). For
example, let’s look at the task shown in videos below: we want the robot to push the white object to
the green region. The top two videos show that standard RL always takes the shortest path to the
goal, whereas MaxEnt RL takes many different paths to the goal. Now, let’s imagine that we add a
new obstacle (red blocks) that wasn’t included during training. As shown in the videos in the bottom
row, the policy learned by standard RL almost always collides with the obstacle, rarely reaching the
goal. In contrast, the MaxEnt RL policy often chooses routes around the obstacle, continuing to
reach the goal for a large fraction of trials.

15.3 Applications
15.3.1 An alternative method to look at the changes in the geometric structure of the price
series
The price of an asset varies in function of the available information. A simple change in the
information will be immediately priced in. The connected nature of financial markets needs the
adoption of a measure to catch the spatial and temporal dimensions. In this sense, Structural Entropy
helps us to make statements and perform computations in this scenario about situations involving
uncertainties.

15.3.2 Example 2
Your willingness to adjust your wardrobe to fit the new circumstances is directly related to how
“high-entropy” your clothing policy was.
If you had a high-entropy policy, you will quickly be able to adapt to the new city.
If you had a low-entropy clothing policy, then you may stubbornly hold on to your pre-existing
clothing patterns, and suffer as a result.
The key here is not only having a high-entropy policy in the moment, but ensuring that when
something like a move to a new city happens, the entropy will also be high. This would correspond
to not spending all your money on T-shirts, for example.

15.3.3 Example 3
Scientists attempting to optimize the rewards (fame, truth, societal/technological impact, etc) of
scientific discovery, were faced with an opportunity to either continue along their pre-existing lines
of research, or to adapt to the new paradigm.
Those with “high-entropy” research programs are more likely to adapt to a scientific program based
on a sun-centered universe, or natural-selection based development of an organism’s traits.
In contrast, those with low-entropy policies are more likely to continue with their pre-existing
programs to their detriment.

15.3.4 Gaming
Adapts to opponents strategy
Makes actions more unpredictable to help avoid being exploited by its opponent
82 Chapter 15. Max Entropy and Reinforcement Learning

balances exploration and exploitation which is crucial for sucess with insufficient information

15.3.5 Recommendation System


Recommend items to users based on their preferences and behavior.
Can learn a policy that is diverse and can handle uncertainty in the user’s preferences
Enables the system to recommend items that are more likely to be of interest to the user, even if they
have not explicitly expressed their interest in the item.

15.3.6 Finance
Can learn a policy that is adaptive to changing market conditions
A policy that balances exploration and exploitation, which is crucial for discovering profitable trading
strategies in the market.

15.3.7 Robotics
Object manipulation
Locomotion
Navigation
Eg: training a robot to walk or run.
In TRL, a robot would learn a policy that maximises the expected cumulative reward, which can
lead to conservative and repetitive behaviours
In contrast, by using MERL, the robot can learn a policy that maximises its entropy which encourages
more diverse and unpredictable movements.

15.3.8 Autonomous Driving


Intelligent control policies
Complex traffic scenarios
Eg: training a vehicle to navigate through a crowded intersection
Consider the case of an autonomous vehicle that needs to navigate through a busy intersection with
pedestrians and other vehicles.
Using MERL, the vehicle can learn a policy that maximizes its entropy , which enables it to explore
a variety of actions , such as adjusting its speed, changing its lane , or stoping abruptly if necessary.
This can help the vehicle adapt to the crowded intersection and avoid getting stuck in a local minimum
where it repeats the same behaviour repeatedly.
16. Geodesics

16.0.1 Geodesic on a sphere

Given two points A and B on the surface of a sphere, find the path from A to B that minimizes the
distance.
Now we know that the length of a curve between two points A and B is given by,

L = AB ds
R

This is the functional which we want to minimize for this problem.

Now p
using the Pythagorean theorem we can write,
ds = dx2 + dy2 + dz2 . This is the distance formula in 3D space.

Now we have to transform the coordinates into spherical coordinates.

Here, in the above figure, x = Rcosθ sinφ ; y = Rsinθ sinφ ; z = Rcosφ


where θ is the angle fro the x axis and φ is the angle from the z axis.
So, to find dx, dy, and dz; we use chain rule
84 Chapter 16. Geodesics

δx δx
dx =
δθ + δφ
δθ δφ
δy δy
dy = δθ + δφ
δθ δφ
δz δz
dz = δθ + δφ
δθ δφ
dx = −Rsinθ sinφ dθ + Rcosθ cosφ dψ
dy = Rcosθ sinφ dθ + Rsinθ cosφ dφ
dz = −Rsinφ dφ

Now putting the values in the equation

q
ds = (−Rsinθ sinφ dθ + Rcosθ cosφ dφ )2 + (Rcosθ sinφ d + Rsinθ cosφ dθ )2 + (−Rsinφ dφ )2

By simplifying the above equation we get,

q
ds = R2 sinφ (dφ )2 + R2 cosφ (dφ )2 + R2 sin2 φ (dφ )2
s

= Rdφ sin2 φ ( )2 + 1

q
= R 1 + sin2 φ (θ ′ )2 dφ

Now, substituting the value of ds into the functional,


R φB q
L = φA R sin φ ( dθ
2 2
dφ ) + 1dφ
Now,R applying the Euler-Lagrangian equation on the functional,
L = φAB F(φ , δ , δ ′ )
φ
p
where F = R 1 + sin2 φ (δ ′ )2
Integrating both sides w.r.t φ

δF
= F
δθ′ q
F = R sin2 φ (θ ′ )2 + 1dφ
θ ′ sin2 φ
k = √
θ ′2 sinφ + 1

From here we’ll get,


θ ′ = √ k 2 2 = dθdφ
sin sin φ k
85

So the intergral will be,


√ k 2 2 = dθ
R
θ= dφ dφ
sin sin φ k
Now using substitution method, U = kcotφ
k < 1, Because if k > 1, then the marked part will be less than 1 that will make the existence of the
imaginary part.
From u, we know that,
du = −kcosec2 φ dφ
sinφ = √u2k+k2

kdφ √ du
θ= √ Now, aswesaidkislessthan1.So, 1−k2
R R
Simpli f yingthisweget, θ = − 1−u2 −k2
sin2 φ −k2 sinφ
is a positive number so we could say that it is a2 .

θ = − √adu
R
2 −u2
Nowtospeci f ywemakeanothersubstitution,U = acosψ

Simplifying this we get,


θ = θ0 Integratingetacotφ ) → T hisistheequationo f geodesiconthesphere, wherethevalueso f ]betaandθ0 canbe f ound f ro

16.0.2 Intuitive Understanding

It is the intersection of a sphere’s surface and a plane passing through the sphere’s center.
Equation of the Great circle. We know that the integral of a plane passing through the origin

Ax + By +Cz = 0
Now, this place also lies n the sphere so the intersection points should also satisfy the equation of the
sphere.
x = Rcosθ sinφ ; y = Rsinθ sinφ ; z = Rcosφ
Substituting this in the equation of the plane,
Steps to follow:
1. Cancel R (common term)
2. Divide both sides by sinφ √
3. Multiply and Divide by A2 + B2 .
4. Considering a triangle with angle θ0 and base A and perpendicular B
We get,

A2 + B2 cos(θ − θ0 ) = −Ccotφ
θ − θ0 = arccos(β cotφ )
where β = √A−C2 +B2
86 Chapter 16. Geodesics

16.1 Applications
16.1.1 Gravitational Lensing
Gravity warps the fabric of space-time, and light traveling through this warped space follows a
curved path. This leads to an effect called gravitational lensing, where the image of a distant object
is distorted by the gravitational field of a massive object (such as a galaxy or black hole) located
between the observer and the source.

16.1.2 Robotics
The concept of the shortest path is used in robotics to plan the movements of robots and to control
their motion. This involves computing optimal paths and trajectories, avoiding obstacles and
collisions, and maintaining stability and balance. Further used in computer vision to define paths or
contours on images that minimize certain energy functionss, such as the length or area of the path.
This is useful for a wide range of applications, such as image segmentation, object recognition, and
medical imaging.

16.1.3 Molecular Biology


Geodesics is used in biology to study the shape and structure of biological molecules, such as
DNA and proteins. This involves using geodesic techniques to analyze and model the geometry
and topology of molecular structures, and to understand their function and behavior. Also used
in material science to study the properties of surfaces and interfaces. In particular, the geodesic
curvature of a surface measures the deviation of the surface from a planar shape

16.1.4 Supply Chain Management


To optimize the flow of goods and services between suppliers, manufacturers, and customers. This
involves minimizing transportation costs, reducing inventory levels, and improving delivery times
and customer satisfaction.

16.1.5 Network Design


Designing the optimal layout of communication networks, transportation networks, or power grids.
This involves minimizing the length of cables or transmission lines, maximizing connectivity and
reliability, and balancing capacity and cost constraints
17. Derivation of n paramters

17.1 introduction to multi-variable


17.1.1 Derivation
Consider the function I a parameterized function of x
Z b
I= (x1, y1, y2, ....yn, y1′ , y2′ , ....yn′ )dx
a

the given boundary conditions are:

yi(x1) = yi1

yi(x2) = yi2
by applying a small variation in yi(x)

zi(x) = yi(x) + αηi(x)

z′ i(x) = y′ i(x) + αη ′ i(x)


ηi(x1) = ηi(x2) = 0
Z b
I(α) = f (x1, y1, y2, ....yn, y1′ , y2′ , ....yn′ )dx
a
here y is dependent on alpha

Z b
dI(α)
= f (x1, y1, y2, ....yn, y1′ , y2′ , ....yn′ )dx
dα a

for an externalizing path


∂I
∂ α (at = 0)
88 Chapter
Green’s theorem(Vector 17. Derivation of n paramters
calculus)

applying the chain rule in the obtained expression

∂ F ∂ z′ i
Z b
dI ∂ F ∂ zi
= ( + )dx = 0
dα a ∂ zi ∂ α ∂ z′ i ∂ α

∂ zi
= ηi(x)
∂α
∂ z′ i
= η ′ i(x)
∂α

∂ z′ i ′
Z b
∂I ∂ zi
= ( ηi(x) + η i(x))dx
∂α a ∂α ∂α
integration of the second term using the path

Z b
∂F d ∂F
( − )ηidx
a ∂ zi dx ∂ z′ i
the final expression will be:

∂F d ∂F
=
∂ yi dx ∂ yi
where i is an element of natural number.

17.1.2 Solving a Problem


Expression for Two Variables
 
∂f d ∂f
− =0
∂ x dt ∂ ẋ

 
∂f d ∂f
− =0
∂ x dt ∂ ẋ
Determine the closed curve with a given fixed length that encloses the largest possible area

Given: Expression for the area of the curve:


1
I
A= xdy − ydx
2 C

Let D be an open, simply connected region with a boundary curve C that is a piece-wise smooth,
simple closed curve oriented counterclockwise.
Let F =< P, Q >
17.1 introduction to multi-variable
Green’s theorem(Vector calculus) 89

Let this be a vector field with component functions that have continuous partial derivatives on D.
Then,
H H RR
RC
F.dr = C PdxR+ Qdy = D (QZ − Py )dA
C Pdx + Qdy = C F.T ds

So, Green’s theorem verifies the equation for the area given.
If the curve is parameterised as x = x(t) and y = y(t), then it will modify the Area and Length
accordingly.
90 Chapter
Green’s theorem(Vector 17. Derivation of n paramters
calculus)
18. Euler-Lagrange’s Take on Higher Variables

18.1 17.1 Calculus v/s Calculus of Variation


Two disciplines of mathematics, calculus and calculus of variations, are linked yet have different
focuses and applications. The study of rates of change and how objects change over time is the focus
of the mathematic branch known as calculus. In order to address problems involving optimization,
integration, and differentiation, it is concerned with the analysis of functions and their derivatives.
The study of optimizing functionals, which are mathematical objects that give a function a value,
is the focus of the discipline of mathematics known as calculus of variations. It is concerned with
identifying the function that, given certain restrictions, minimizes or maximizes a given functional.
There are numerous fields that use the calculus of variations, such as physics, engineering, and
economics, to find optimal solutions to problems that involve optimization of functionals.

Calculus and calculus of variations differ primarily in their scope. Calculus of variations deals with
the optimization of functionals, whereas calculus is concerned with the analysis of functions and their
derivatives. Calculus of variations is also used to identify the best solutions to problems involving
the optimization of functionals, while calculus is used to address problems involving optimization,
integration, and differentiation. Hence, whereas calculus is a fundamental tool in many branches
of science and mathematics, calculus of variations is a more specialized topic that is utilized in
particular contexts.

18.2 17.2 Higher Dimensions


Higher dimensional problems are those that involve optimizing or analyzing functions with more
than one independent variable in optimization and calculus of variations. To put it another way,
it is the study of functions, such as those in three-dimensional or higher-dimensional spaces, that
depend on a number of different factors. A functional, for instance, can be described as a map
from a space of functions to a scalar field in the calculus of variations. The functional may be
dependent on the derivatives of the function with respect to each of these variables if the func-
tions being optimized are functions of many variables, such as time and space. The challenge in
92 Chapter
Green’s 18. Euler-Lagrange’s
theorem(Vector calculus) Take on Higher Variables

this situation is to optimize a function over a space of functions with numerous independent variables.

Similar to this, when a function is specified over a space with more than two dimensions, the
challenge of determining the maximum or minimum of a function can be expanded to higher
dimensions. For instance, a function can be constructed over a space of high-dimensional vectors in
machine learning, where each component of the vector corresponds to a different attribute of the
data. Finding the values of the variables that reduce the discrepancy between the model predictions
and the actual data is thus the goal of the optimization problem.
In conclusion, the study of functions with many independent variables, which can be represented
as points in spaces with more than two dimensions, is referred to as having greater dimensions in
optimization and calculus of variations.

18.3 17.3 Euler-Lagrange in Higher Variables


To determine the functions that minimize or maximize a functional, the Euler-Lagrange equation
is a key tool in the calculus of variations. The Euler-Lagrange equation, which can be represented
as points in spaces with more than two dimensions, is an extension of this equation to functions of
many independent variables in higher variables. The Euler-Lagrange equation takes the form of a
partial differential equation in the context of calculus of variations in higher variables, involving
derivatives of the function with respect to each independent variable. By taking into account the
functional’s variation with respect to each independent variable and setting it to zero, the equation
can be constructed.

Subject to certain boundary conditions, the Euler-Lagrange equation’s solution provides the function
that minimizes or maximizes the functional. This is a crucial tool that is used to identify the best
answers to issues involving the optimization of functionals over spaces with multiple variables in
many branches of research and engineering. For instance, in physics, the action of the system is the
function to be minimized, and the Euler-Lagrange equation is used to determine the equations of
motion for a system of particles. In economics, the production function for a firm—which explains
the connection between inputs and outputs—is derived using the Euler-Lagrange equation.

18.4 17.4 Proof


Consider a functional of the form:

 
∂y ∂y ∂y
Z
I[y] = L x, y, , ,..., dx1 dx2 . . . dxn
∂ x1 ∂ x2 ∂ xn

where y = y(x1 , x2 , ..., xn ) is a function of n independent variables and L is a Lagrangian that depends
on y and its partial derivatives with respect to each of the independent variables. We want to find the
function y that minimizes or maximizes the functional I[y]. To derive the Euler-Lagrange equation,
we consider a small variation of the function y, given by:

y (x1 , x2 , . . . , xn ) → y (x1 , x2 , . . . , xn ) + εη (x1 , x2 , . . . , xn )


18.5 17.5 Area Under the Curve
Green’s theorem(Vector calculus) 93

where ε is a small parameter and η(x1 , x2 , ..., xn ) is an arbitrary function that vanishes at the boundary
of the integration domain. The variation of the functional I[y] under this perturbation is given by:

"
n
∂ (εη)
Z
δ I[y] = I[y + εη] − I[y] = Ly εη + ∑ Lyi +
i=1 ∂ xi
O ε 2

dx1 dx2 . . . dxn

where Ly and Lyi denote the partial derivatives of the Lagrangian L with respect to y and ∂ y/∂ xi
respectively. Integrating by parts and neglecting the boundary terms, we obtain:

" #
n  
∂L ∂ ∂L
Z
εη dx1 dx2 . . . dxn + O ε 2

δ I[y] = ∑ −
i=1 ∂ yi ∂ xi ∂ yi

where we have used the product rule for partial derivatives and the fact that η vanishes at the
boundary. Since the variation εη is arbitrary, the condition for y to be an extremum of I[y] is that the
integrand in the above expression vanishes:

∂L ∂ ∂L
− = 0, i = 1, 2, . . . , n
∂ yi ∂ xi ∂ yi

These n equations are the Euler-Lagrange equations in higher variables, which must be satisfied by
any function that minimizes or maximizes the functional I[y].

18.5 17.5 Area Under the Curve


To derive the formula for the area enclosed by a curve using Green’s theorem, we start with the
two-dimensional form of the theorem:

ZZ  
∂Q ∂P
I
⃗F · d⃗r = − dA
C R ∂x ∂y

where ⃗F is a vector field, C is a simple closed curve that encloses a region R, d⃗r is an infinitesimal
tangent vector to C, and dA is an infinitesimal area element in the plane. P and Q are the components
of ⃗F. To find the area enclosed by a curve, we choose ⃗F = (− 12 y, 12 x), which has P = − 12 y and
Q = 12 x. This vector field has zero divergence, meaning that the right-hand side of the equation above
simplifies to:
94 Chapter
Green’s 18. Euler-Lagrange’s
theorem(Vector calculus) Take on Higher Variables

ZZ  
∂Q ∂P
− dA = 0
R ∂x ∂y
Therefore, the left-hand side of the equation must also be zero, meaning that:

ZZ  
∂Q ∂P
− dA = 0
R ∂x ∂y

Since $C$ is a simple closed curve, it can be parameterized by a single variable, say $t$. Then,
the line integral can be written as:
I Z 2π
⃗F · d⃗r = ⃗F(r(t)) · r′ (t)dt
C 0

Z 2π  
1 1
I
⃗F · d⃗r = − y, x · (− sint, cost)dt = πr2
C 0 2 2

1 ⃗F · d⃗r = 1 πr2
I
A=
2 C 2

18.6 Applications
The Euler-Lagrange equation in higher variables and dimensions has several applications in mathe-
matics and physics. Here are a few examples:

1. Field Theory: The field variables and their derivatives with regard to space and time determine
the Lagrangian density in field theory. The equations of motion for the field variables can be derived
from the Euler-Lagrange equation. For instance, the Euler-Lagrange equation produces Maxwell’s
equations for the fields in the electromagnetic field theory, where the Lagrangian density is propor-
tional to the square of the electric and magnetic fields.

2. Classical Mechanics: The motion equations for physical systems with numerous degrees of
freedom can be derived from the Euler-Lagrange equation. For instance, the position of each particle
in a system of n particles in three dimensions is a function of time and three spatial variables, for a
total of 4n independent variables. The positions and velocities of the particles determine the system’s
Lagrangian, and the Euler-Lagrange equation produces 4n linked second-order partial differential
equations that describe the motion of the particles.

3. Optimal Control: Finding a control strategy that minimizes a cost function while taking into
account restrictions is the aim of optimal control theory. The cost function is a function of the control
policy and the state variables, while the control policy is often a function of time and other variables.
The essential conditions for the control policy to be optimal can be derived from the Euler-Lagrange
equation in the form of a system of partial differential equations.
18.6 Applications Green’s theorem(Vector calculus) 95

4. Geometric Analysis: Geometric analysis uses the Euler-Lagrange equation, notably when looking
at minimum surfaces and harmonic maps. A harmonic map is a map between two Riemannian
manifolds that reduces its energy, whereas a minimum surface is a surface that minimizes its area
according to specific requirements. The equations governing the behavior of various surfaces and
maps can be derived from the Euler-Lagrange equation.
19. Multi-Independent Variables

19.1 What are Multi-Independent Variables?


A system that depends on multiple factors or parameters is described by a set of variables called
multi-independent variables, commonly referred to as multivariables or multidimensional variables.
In other words, we have a multivariable system when there are two or more independent variables
that influence the behavior of a system. For instance, in physics, the three independent variables x, y,
and z can be used to describe the location of an object in three dimensions. In statistics, a dependent
variable’s behavior may be influenced by a number of independent variables, including age, income,
and educational attainment.

Since multivariable systems frequently feature intricate relationships and interactions between
the independent variables, they can be difficult to study and improve. But in many disciplines,
including engineering, physics, economics, and biology, where systems frequently depend on several
factors, they are crucial. Multivariable calculus, linear algebra, and optimization techniques like
gradient descent, stochastic gradient descent, and convex optimization are all used in the study of
multivariable systems. We can obtain insights into complex processes and create more effective
systems by comprehending and optimizing multivariable systems.

19.2 Derivation of the E-L Equation


To determine a functional’s extremum, the Euler-Lagrange (E-L) equation is a crucial tool in the
calculus of variations. An integral of a Lagrangian, which is dependent on one or more independent
variables and their derivatives, is often used to express the functional. The E-L equation for a
functional that depends on two independent variables is derived as follows. Consider a functional of
the form:

Z x2 Z y2  
∂y ∂y
S[y] = L x, y, , dydx
x1 y1 ∂x ∂y
98 Chapter
Green’s theorem(Vector 19. Multi-Independent Variables
calculus)

where,
L is the Lagrangian, x and y are the independent variables
∂y ∂y
∂ x and ∂ y are the partial derivatives of y with respect to x and y, respectively.

To find the extremum of the functional, we use the calculus of variations to vary the function y(x)
while keeping the endpoints fixed. We define a variation of y(x) as:

y(x) → y(x) + εη(x)

where,
ε is a small parameter
η(x) is an arbitrary function that satisfies η(x1 ) = η(x2 ) = 0.

The variation of the functional is then:

Z x2 Z y2  
∂y ∂η ∂y ∂η
S[y + εη] = L x, y + εη, +ε , +ε dydx
x1 y1 ∂x ∂x ∂y ∂y
Z x2 Z y2   Z x2 Z y2  
∂y ∂y ∂L ∂L ∂η ∂L ∂η
dydx + O ε 2

= L x, y, , dydx + ε η+ ′ + ′′
x1 y1 ∂x ∂y x1 y1 ∂y ∂y ∂x ∂y ∂y

∂y ∂y
where, y′ = ∂x and y′′ = ∂y.

The first term on the right-hand side is the original functional, while the second term is the variation
of the functional. The terms of order ε 2 and higher are neglected. To find the extremum of the
functional, we require that the variation of the functional vanishes for all possible variations η(x)
that satisfy the boundary conditions. This leads to the Euler-Lagrange equation:

∂L ∂ ∂L ∂ ∂L
− ′
− =0
∂y ∂x ∂y ∂ y ∂ y′′

This equation must be satisfied for all values of x and y in the domain of the functional. The E-L
equation gives the necessary condition for the function y(x) to be an extremum of the functional.

19.3 Principle of Least Action


The least action that may be made will decide the motion of a physical system, according to the
principle of least action, commonly referred to as Hamilton’s principle or the principle of least
action. In other words, the principle of least action asserts that the path a system follows between
two points is the one that minimizes the action, a numerical value that characterizes the system’s
motion. The Lagrangian, a function that defines the energy and motion of the system, is used to
define the action as its integral. According to the concept of least action, the system will follow the
19.4 Derivation of the WaveGreen’s
Equation
theorem(Vector calculus) 99

path that minimizes the action, which is the integral of the Lagrangian. The Lagrangian is defined as
the difference between the kinetic energy and the potential energy of the system.

For example,

A beam of light traveling through a medium is an illustration of the concept of least action. The path
used by the light is the one that cuts down on the amount of time needed to get from one place to
another, which is the same as cutting down on the amount of action. An additional illustration is the
swing of a basic pendulum. The difference between the pendulum’s kinetic and potential energies
determines the path the pendulum will take in order to minimize the action. The pendulum’s path
minimizes the action, which is the integral of the Lagrangian equation.

The least action that can be taken will decide the motion of a physical system, according to the
principle of least action, which is a fundamental tenet of physics. The system follows the path that
minimizes the action, which is the integral of the Lagrangian.

19.4 Derivation of the Wave Equation


The wave equation is a partial differential equation that describes the propagation of waves. It
can be derived using the principles of wave motion and the laws of conservation of energy and
momentum.Consider a small element of a string, with length dx and tension T. Let u(x,t) be the
transverse displacement of the element at position x and time t, and let v(x,t) be the transverse
velocity of the element. The kinetic energy of the element is given by:
 2
1 ∂u
ρdx
2 ∂t

where,
ρ is the linear mass density of the string. The potential energy of the element is given by:
 2
1 ∂u
T dx
2 ∂x

where,
the factor of 1/2 comes from assuming that the element is symmetric about its center. The total
energy of the element is the sum of its kinetic and potential energies:

 2  2
1 ∂u 1 ∂u
E = ρdx + T dx
2 ∂t 2 ∂x

The principle of conservation of energy states that the total energy of the system remains constant.
Therefore, the rate of change of energy with respect to time must be zero:
  2   2
∂E ∂u ∂ u ∂u ∂ u
= ρdx 2
+T =0
∂t ∂t ∂t ∂ x ∂ x2
100 Chapter
Green’s theorem(Vector 19. Multi-Independent Variables
calculus)

Using the principle of conservation of momentum, the force acting on the element is given by:

∂ 2u
F =T dx
∂ x2
The principle of conservation of momentum states that the rate of change of momentum with respect
to time must be equal to the force acting on the system:

∂ 2u
 
∂ ∂u
ρdx = T 2 dx
∂t ∂t ∂x

Simplifying this equation and dividing by dx, we get

∂ 2u T ∂ 2u
=
∂t 2 ρ ∂ x2

This is the wave equation, which describes the propagation of waves in a medium with tension T
and linear mass density ρ.

The diffusion equation and the wave equation have a strong relationship. In fact, if the wave speed
is assumed to be infinitely tiny, the diffusion equation can be obtained from the wave equation.
Diffusion equation, which describes the dispersion of particles or energy in a medium, follows from
this. A function called the Lagrangian characterizes a system’s motion in terms of its coordinates and
velocities. The notion of least action is strongly related to this crucial idea in classical mechanics. A
system’s equations of motion can be derived from the Lagrangian and then solved using the wave
equation or the diffusion equation.
Physics and engineering frequently use multiple independent variables to explain complicated
systems. For instance, the Navier-Stokes equations used in fluid dynamics incorporate multiple
independent variables to explain fluid motion. Similar to this, several independent variables can
be used to explain a system’s entropy in thermodynamics. As a result, it can be said that the wave
equation and the diffusion equation are significant partial differential equations in both physics and
engineering. With the use of the Lagrangian, one can generate equations of motion in classical
mechanics that can subsequently be solved with the wave equation or the diffusion equation. Complex
systems are frequently described by multiple independent variables, which can be combined with
these equations to simulate real-world processes.
20. The Hamiltonian

The Hamiltonian Equation


The Lagrangian multipliers approach of solving constrained optimization problems uses a set of equa-
tions called the Hamiltonian equations. By adding extra variables known as Lagrange multipliers, the
Lagrangian multiplier approach is a technique used to optimize a function subject to restrictions. The
initial objective function, the sum of each Lagrange multiplier and its related constraint, and their
products are together referred to as the Hamiltonian function in optimization. The partial derivatives
of the Hamiltonian function with respect to the initial variables and the Lagrange multipliers are
then used to obtain the Hamiltonian equations.

In particular, the Hamiltonian equations for optimization take the form:

σH σf σg
= +λ
σx σx σx
σH
= g(x)
σx
where,
H is the Hamiltonian function
f is the objective function
g(x) are the constraints
σ is the Lagrange multiplier
x is the original variable.

These equations are used to find the values of x and λ that satisfy the necessary conditions for
optimality, known as the Karush-Kuhn-Tucker (KKT) conditions. Solving the Hamiltonian equations
is an important step in finding the optimal solution to a constrained optimization problem using the
method of Lagrangian multipliers.
102 Chapter 20. The Hamiltonian
Green’s theorem(Vector calculus)

20.1 Comparing The Hamiltonian, Lagrange & Schrödinger Equations


Despite having various uses in optimization, the Hamiltonian, Lagrange, and Schrödinger equations
all have significant roles to play in physics and engineering. In constrained optimization problems,
the values of the initial variables and the Lagrange multipliers that satisfy the prerequisites for
optimality are found using the Hamiltonian equation. The Lagrange equation is used to find the best
pathways for motion and to construct the equations of motion for a system. In order to simulate
decision-making under restrictions, it is frequently utilized in economics and finance. In order to
overcome combinatorial optimization issues and improve quantum algorithms, the Schrödinger
equation is employed in quantum optimization.

The domain in which these equations are applied is one of their fundamental distinctions. The
Lagrange equation has applications in both classical mechanics and optimization in economics
and finance, whereas the Hamiltonian equation is mostly employed in classical mechanics and
engineering optimization. On the other hand, quantum optimization—a relatively young area that
blends optimization with quantum computing—uses the Schrödinger equation. The kinds of issues
that these equations are used to solve also differ. Constrained optimization issues are resolved using
the Hamiltonian equation, differential equation issues are resolved using the Lagrange equation,
and combinatorial and quantum algorithm optimization issues are resolved using the Schrödinger
equation. These equations have different mathematical forms and require different techniques to
solve, making them suitable for different types of optimization problems.

20.1.1 Derivation
Suppose we have a constrained optimization problem of the form

min f (x) subject to g(x) = 0

We can solve this problem using the method of Lagrange multipliers, where we define the Lagrangian
function as:

L (x, λ ) = f (x) + λ g(x)

The Hamiltonian function is defined as:

H (x, λ , p) = λ g(x) − f (x) + PT ẋ

where,
p is a vector of Lagrange multipliers
ẋ is the derivative of x with respect to time

To derive the Hamiltonian equations, we first take the partial derivatives of the Hamiltonian function
with respect to x, λ , and p:
20.2 Applications Green’s theorem(Vector calculus) 103

σH σL
=− = − ▽ f (x)
σx σx

σH
= g(x)
σλ

σH
= ẋ
σp

Next, we use the chain rule to find the time derivative of each of these partial derivatives:

d
ẋ = (−∇ f (x)) = −∇2 f (x)ẋ − ∇ f (x)λ̇
dt
d
λ̇ = (g(x)) = ∇g(x)ẋ
dt
d
ṗ = (ẋ) = −∇2 f (x)p − ∇ f (x) − ∇g(x)λ
dt

These are the Hamiltonian equations, which can be used to solve for the values of x, λ , and p that
satisfy the necessary conditions for optimality.

20.2 Applications
In a variety of disciplines, such as optimization, mechanics, physics, and finance, Hamiltonian equa-
tions are a crucial tool for comprehending the behavior of dynamic systems. From the standpoint of
optimization, the following are a few instances of Hamiltonian equation applications:

1. Control Theory: Hamiltonian equations are employed in control theory to describe the dynamics
of control systems. The Hamiltonian formalism is used by optimal control theory to derive prerequi-
sites for solving control issues. The Pontryagin’s Maximum Principle, which is frequently applied in
control theory to identify the ideal control method, is a result of these circumstances.

2. Mechanics: In mechanics, the motion of particles in a system is described by Hamiltonian


equations. The equations of motion of a particle in a conservative force field can be derived using
the Hamiltonian formalism. This approach is helpful in solving optimization issues like figuring out
the path with the lowest energy between two sites or locating a particle’s equilibrium position in a
potential well.

3. Finance: To simulate the behavior of financial markets, Hamiltonian equations are employed in
finance. The equations of motion describe the evolution of the Hamiltonian, which in finance reflects
the overall value of a portfolio of financial assets. The Hamiltonian formalism offers a method for
evaluating the predictability and stability of financial markets as well as for maximizing investment
104 Chapter 20. The Hamiltonian
Green’s theorem(Vector calculus)

plans.

4. Quantum Mechanics: Hamiltonian equations are frequently used in quantum mechanics to


explain how quantum states change over time. The equations of motion define how the energy of
a quantum system evolves over time, and the Hamiltonian operator expresses this energy. These
equations are used to solve quantum optimization issues like locating a quantum system’s ground
state or creating quantum algorithms.

5. Harmonic Oscillator: The harmonic oscillator is a ubiquitous model system in physics and
engineering, and Hamiltonian equations are often used to describe its behavior. In optimization
problems, the Hamiltonian formalism is used to analyze the stability and predictability of the system
and to determine the optimal control strategy for manipulating the oscillator’s motion.

In summary, Hamiltonian equations offer an effective tool for deciphering the behavior of dynamic
systems and enhancing their functionality. The equations of motion offer a method to evaluate the
stability and predictability of the system, while the Hamiltonian formalism offers a framework for
defining the evolution of a system in terms of its energy. Hamiltonian equations have numerous uses
outside of optimization, from finance and quantum mechanics to control theory and mechanics. In
several disciplines, optimization issues can be defined and resolved using Hamiltonian equations,
resulting in more effective and efficient system design and control. Overall, Hamiltonian equations
are an essential tool in the modern world since they are critical in the optimization of dynamic
systems.
21. Hamilton-Jacobi Equation

21.1 Recap

1 2 2 2 2
Recalling the Hamilton equation: H = 2m (p + m ω q ) =E
here, ω 2 = mk

21.2 Introduction
The Hamilton-Jacobi equation is a partial differential equation that plays an important role in classical
mechanics and mathematical physics. It is named after the mathematicians William Rowan Hamilton
and Carl Gustav Jacob Jacobi, who independently discovered the equation in the mid-19th century.
106 Chapter
Green’s theorem(Vector 21. Hamilton-Jacobi Equation
calculus)

The Hamilton-Jacobi equation describes the evolution of a classical mechanical system in terms of
a so-called "action" function, which is a mathematical function that describes the dynamics of the
system. The equation is given by:
∂S ∂S
∂t + H(q, ∂ q ,t) = 0
∂S
where S(q,t) is the action function, H(q, p,t) is the Hamiltonian function of the system, and ∂x is
the gradient of the action function with respect to the position variable x.

The Hamilton-Jacobi equation is a central tool in classical mechanics for solving the equations
of motion of a system, and it provides a way to compute the trajectory of a system by minimizing
the action function. The equation also has important applications in quantum mechanics, where it
plays a role in the development of the path integral formulation of quantum mechanics.

21.3 Derivation
Consider a classical mechanical system described by a Hamiltonian H(q, p, t), where,

q = (q1 , q2 , . . . , qn ) represents the generalized coordinates,


p = (p1 , p2 , . . . , pn ) represents the conjugate momenta, and t represents time.

We define a function called the Hamilton’s characteristic function, denoted by W(q, S, t), where
S = S(q,t) is an unknown function of the generalized coordinates and time, which we seek to find.
The characteristic function W has the following form:

W (q, S,t) = S(q,t) − Et

where,
S(q,t) is the Hamilton’s principal function or action function, which depends on the general-
ized coordinates and time, and Et is a constant known as the characteristic energy.

We can calculate the partial derivatives of W with respect to q and t as follows:

∂ W/∂ q = ∂ S/∂ q − E∂ t/∂ q

∂ W/∂ t = ∂ S/∂ t − E
We can use the canonical equations of Hamiltonian mechanics to express the partial derivatives of S
with respect to q and t in terms of the generalized coordinates, momenta, and the Hamiltonian H.
The canonical equations are:

∂ S/∂t = −H

∂ S/∂ qi = pi for i = 1, 2, . . . , n
∂ S/∂ pi = −qi for i = 1, 2, . . . , n
where,
pi represents the conjugate momentum corresponding to the generalized coordinate qi
21.4 Harmonic Oscillator Green’s theorem(Vector calculus) 107

Substituting the partial derivatives of S from the canonical equations into the expressions for the
partial derivatives of W, we get:

∂W /∂ q = p − E∂t/∂ q
∂ W/∂ t = −H − E
Since we are seeking a solution for S, we set the partial derivatives of W with respect to q and t to
zero:

∂ W/∂ q = 0 ⇒ p = E∂t/∂ q
∂W /∂t = 0 ⇒ H + E = 0
The second equation above gives us the Hamilton-Jacobi equation in its standard form:

H(q, ∂ S/∂ q,t) + ∂ S/∂t = 0


or equivalently,
∂ S/∂t + H(q, ∇S) = 0
where,
∇S = (∂ S/∂ q1 , ∂ S/∂ q2 , . . . , ∂ S/∂ qn ) is the gradient of S with respect to the generalized
coordinates q, and we have used the fact that p = ∂ S/∂ q from the canonical equations.

So, the Hamilton-Jacobi equation is a partial differential equation that relates the Hamilton’s principal
function S to the Hamiltonian H of a classical mechanical system. Solving the Hamilton-Jacobi
equation allows us to find the action function S, which in turn provides a complete description of the
motion of the system in terms of the generalized coordinates and the characteristic energy E.

21.4 Harmonic Oscillator


1 2 2 2 2
Recall, we have: H = 2m (p + m ω q

21.5 Applications
21.5.1 Predicting Traffic Flow
1. Assume current traffic density as the initial scalar field.
2. Input factors -> traffic speed, volume, and road geometry.
3. Predicts traffic density and congestion

21.5.2 Computer vision


1. Often used for image segmentation
2. Input factors -> image gradients, edge strength, and texture information
3. Determines foreground and background segmentation of image
108 Chapter
Green’s theorem(Vector 21. Hamilton-Jacobi Equation
calculus)

21.5.3 Options Pricing


1. Option pricing and portfolio optimization.
2. Input factors → volatility, expiration time and interest rate.
3. Determine options pricing

21.5.4 Mimicking HJE


u: initial conditions for scalar field
HJE parameters: D, dt, niter(timeduration)
Gradient : gradiento f ascalar f ieldisavectorthat pointsinthedirectiono f thesteepestincreaseinthe f ield, anditsmagnitud
Front : theboundarybetween f oregroundandbackgroundregionsinanimage.
Laplacian : Curvatureo f f rontDi f f usion : Tosmoothenthe f ront.
The code uses the HJE to evolve a curve that separates different regions in an image based on the properties of th

You might also like