Ballistic Missile Maneuver Penetration Based On Reinforcement Learning

Ballistic Missile Maneuver Penetration Based on Reinforcement
Learning
Chaojie Yang, Jiang Wu, Guoqing Liu, Yuncan Zhang
Abstract—Ballistic missiles, as the main weapon for mathematical models and index functions. But the exact
long-range precision fire strikes, reflect the military mathematical model is difficult to obtain and when the model
development level and strategic capabilities of a country. This is complicated, the analytical solution can’t be obtained.
paper focuses on the midcourse penetration process of ballistic
missile maneuvers. Assuming that the interceptor missile uses a Current missile penetration strategies lack autonomy and
proportional guidance strategy, the reinforcement learning intelligence. New methods such as machine learning and other
methods is used to train network models. The method avoids the artificial intelligence methods are less used for missiles. But
need for traditional control theory methods to establish precise these intelligent methods have been studied extensively and
mathematical models based on controlled objects, and this deeply in the autonomous air combat of drones, which has
reduces the difficulty of the performance model to solve the achieved very good simulation and actual results [4-7].
optimal analytical solution. The use of State space discretization Nicholas Ernest et al. of U.S. University of Cincinnati used
reduce the action space, and improves the network learning genetically-fuzzy tree method for air combat decision-making.
efficiency. Finally, the simulation proves that reinforcement This provides guidance and lessons for us to conduct research
learning can greatly increase the miss distance of missile on missile intelligence strategies. These methods can be
maneuver penetration. applied to the real-time active maneuvering of missile
I. INTRODUCTION high-speed aircraft.
Ballistic missiles have the characteristics of long-range This paper focuses on the midcourse penetration process
strike, fast delivery of firepower, and strong damage of ballistic missiles and builds a ballistic missile maneuvering
capability. It is the core force to achieve accurate strikes defense strategy system. So that ballistic missiles has the
outside the defense zone, which has become one of the most ability to autonomously make maneuver decisions without
threatening offensive weapons in modern war [1]. The United knowing the interceptor model, which can achieve better
States and Russia are the leaders in the development of penetration. The method is based on reinforcement learning,
ballistic missile. They have been tirelessly promoting the through the interaction with the environment, the model can
development of ballistic missiles in recent years and the level gain knowledge and improve its capabilities. The system
of their development represents the highest level of ballistic gradually optimizes the strategy by trained through
missile development [2]. well-designed offline tasks. Finally, a real-time task test was
performed on the obtained maneuver defense strategy system.
Due to the long inertial flight time in the missile's middle Simulation results demonstrate the effectiveness and
section, the trajectory is relatively fixed and easy to intercept. generalization ability of the mobile penetration strategy
The midcourse defense of ballistic missiles has become the system.
focus of the current deployment and development of missile
defense systems. Raising the penetration ability of ballistic
missiles in the midcourse is an important means to maintain II. INTERCEPTOR MISSILE GUIDANCE STRATEGY
the vitality and combat effectiveness of missiles. The The proportional guidance method is a guide method in
maneuvering penetration technology has become an important the process that the interceptor missile is in the process of
development direction in penetration prevention technology attacking the target ballistic missile, in which the angular
by virtue of its dual features of anti-detection and velocity of the interceptor's velocity vector is proportional to
anti-interception [3]. the angular velocity of the target's line of sight.
Existing methods for maneuvering penetration strategies This guidance method is simple, reliable, easy to
are mainly divided into procedural maneuvers (such as Square implement and has a high tracking efficiency, which can
wave maneuvers and sine wave maneuvers) and optimization choose the right scale factor according to the flight
methods. Procedural maneuvers lack flexibility, which can’t characteristics of the target ballistic missile. This method
adjust the strategy in time according to the actual situation on make the use of the interception missile to overload is less
the battlefield and difficult to deal with increasingly advanced than can be used to overload and it can achieve
interception and defense systems. Optimization method needs omnidirectional attacks. The interceptor missile trajectory is
to solve control commands based on both accurate relatively straight and technically easy to implement.
Assuming that the interceptor missile and the target ballistic
missile have the same speed and the target ballistic missile is
Chaojie Yang, Jiang Wu, Guoqing Liu, Yuncan Zhang are with the State not maneuverable, it is the optimal guidance method.
Key Laboratory of Integrated Guidance and Control, Beihang University,
Beijing, 100083, China (corresponding author to provide phone:
86-13910069931; e-mail: wujiang@buaa.edu.cn).
Assume that the attack plane is a horizontal plane, the counterclockwise to the velocity vector of Interceptor, and
relative positional relationship between interceptor missiles negative if it is not.
and target ballistic missiles is shown in Fig. 1.
ηt is the angle between the speed vector of the target
T
ballistic missile and the target line of sight, which is called the
speed lead angle of target ballistic missile. The speed lead
angle is positive when the baseline is defined to rotate
R counterclockwise to the velocity vector of target ballistic
L missile, and negative if it is not.
It can be seen from Fig. 1. At the moment of t0 , the target
M
line of sight is q0 and the ballistic inclination of the
interceptor is θ 0 ; At the moment of t1 , the target line of sight
is q1 and the ballistic inclination of the interceptor is θ1 .
O Baseline
According to the definition of proportional guidance:
Fig. 1: Relative positional relationship between interceptor and
target ballistic missile θ1 − θ 0 q1 − q0
=K (1)
Among them, M is the location of the interceptor; T is the Δt Δt
location of the target ballistic missile; O is the location of the
strike target; MT is the target line of sight; OT is for striking In (1):
sight. Δt = t1 − t0 (2)
The parameters are defined as follows:
Let Δt → 0 , It can be known from (1) and (2), The
R is the relative distance between the interceptor and the proportional guidance method is˖
target ballistic missile, when hitting a target ballistic missile,
R=0; dθ dq
=K (3)
L is the relative distance between target and target ballistic dt dt
missiles, When hitting the target, L=0; In (3) ˈK is Scale factor ( 1 < K < ∞ ), which can be
dθ
q is the angle between the baseline and the target line of constant or variable, is the angular velocity of the ballistic
sight, which is called target line-of-sight angle. The target dt
line-of-sight angle is positive when the baseline is defined to dq
inclination of the interceptor, is the angle of rotation for
rotate counterclockwise to the target line of sight, and dt
negative if it is not. target line of sight.
p is the angle between the baseline and the striking sight, Assume that the proportional coefficient K is a constant
which is called striking Angle of sight. The striking Angle of value, another version of the guide equation for proportional
sight is positive when the baseline is defined to rotate guidance can be obtained by the integral of (3):
counterclockwise to the striking line of sight, and negative if it
is not. θ − θ 0 = K (q − q0 ) (4)
In (4), θ 0 is the initial ballistic inclination of the
V is the speed of the interceptor.
interceptor, q0 is the initial target line of sight.
Vt is the speed of the target ballistic missile
According to Fig. 1, q = θ + η , Guiding both sides of this
θ is the angle between the speed vector of the interceptor
and the baseline, which is called Ballistic inclination of formula we can get:
interceptor. The Ballistic inclination is positive when the dq dθ dη
baseline is defined to rotate counterclockwise to the velocity = + (5)
vector of interceptor, and negative if it is not. dt dt dt
According to (3) and (5), we can get the proportional
θt is the angle between the speed vector of the target guide method guides the other two forms of the equation:
Ballistic missile and the baseline, which is called the ballistic
dη dq
inclination of target ballistic missile. The Ballistic inclination =(1 − K ) (6)
is positive when the baseline is defined to rotate dt dt
counterclockwise to the velocity vector of target ballistic dη 1 − K dθ
missile, and negative if it is not. = (7)
dt K dt
η is the angle between the velocity vector of the
interceptor and the target line of sight, which is called the
speed lead angle of Interceptor. The speed lead angle is
positive when the baseline is defined to rotate
III. THE STATE SPACE AND ACTION SPACE OF BALLISTIC
MISSILE
r21
p12 S2
This paper simplifies the interception and confrontation
process of ballistic missiles and interception missiles. A1 p22
Assuming that both are within the same level, Ballistic p13
missiles achieve penetration by changing the lateral A2 S3
S1 p33
acceleration. p24
A3
In order to reduce the action space, the control quantity
can be discretized into three values, which can be represented p34 S4
r31
by actions A1, A2, and A3. The action A1 represents the
missile turning to the left with a certain lateral acceleration; r41
Action A2 represents the missile to maintain its current speed
Fig. 3: The MDP of the Mobile Penetration Strategy System
of flight; Action A3 represents the missile turning to the right
with a certain lateral acceleration. S and A have already been defined in the previous
discussion. The reward and penalty function is used to guide
the mobile defense strategy system to obtain the maneuver
defense strategy, which can be defined as:
A1
Ballistic missile 10, L ≤ R1 &&R ≥ R0
°
A2 ° R − R0 L − L0 t
Velocity r = ®2* − − − Δθ , otherwise (9)
A3 R0 L0 T
°
°−10, R ≤ R
¯ 0
Fig. 2: The Action Space of Ballistic Missile When R ≤ R0 , it indicates that the ballistic missile is
within the damage radius of the interception bomb, which can
be considered that the penetration has failed. And when
Assume that the speed and position of the interceptor are L ≤ R1 &&R ≥ R0 , it indicates that the ballistic missiles
known, the State quantities R, q, L and p can be calculated.
Then the state space can be represented by {R, q, L, p}. successfully hit the target without blocked by interception.
Otherwise, the reward can be defined as the above formula,
Assume that the damage radius of the interceptor is R0 and which considers distance, energy, angle and other factors.
the ballistic missile’s damage radius is R1. Then the mobile
defense policy system can be defined as follows: B. The Q Learning Algorithm of Mobile Penetration Strategy
System
Objective : L ≤ R1 & & R ≥ R0
° The Mobile Penetration Strategy System is offline trained,
® Input : S = { R, q, L, p} (8) which is mainly based on a method of reinforcement learning
°
¯Output : a ∈ { A1 , A2 , A3 }
Q-learning algorithm. The algorithm is different from
supervised learning method, which hasn’t marked data and it
acquires knowledge and strategies primarily through
IV. MOBILE PENETRATION STRATEGY SYSTEM continuous interaction with the environment. According to
SELF-TRAINING METHOD this principle, the establishment of a maneuver defense
According to (8), Maneuver Penetration Strategy System strategy system is shown in Fig. 4.
Needs to Select Appropriate Actions in Action Space, which
MPSs System
can be modeled as a Markov decision process. In this paper,
this Markov decision process will be solved through MPSs policy module
action
environment
reinforcement learning methods. In particular, the table q state
learning method is applied in the maneuver defense strategy
system. Through constantly interacting with the environment, policy update
a maneuver defense strategy system can be trained.
policy learning module reward signal
A. The MDP for the MPSs Action Selection
As shown in Fig. 3, the action selection model of the Fig. 4: Architecture of the MPSs system
maneuver defense strategy system can be expressed as a tuple The essence of Q learning algorithm is the learning and
{S , A, r, p} . Among them, S represents the space, A represents use of Q function, which can be represented by Q(s, a). Q(s, a)
the action space, r represents the current rewards, p is the is the weighted sum of all rewards and penalties. The
transition probability between two states. corresponding state is s and the action taken is a. Q function
can be approximated as:
∞
Q( st , at ) = (1 − α) ∗ Q ( st , at ) +
Q( st , at ) = ¦ γ k rt + k +1 (10)
k =0 ½
In the above formula, the discount factor is 0 ≤ γ ≤ 1 , rt +1 α * ® rt +1 + γ * argmax Q( st +1 , ai ) ¾
¯ a i ¿
is the value of reward and punishment at time t+1. (12)
= Q ( st , at ) +
It can be seen from Fig. 4, the architecture of the MPSs
½
system is composed by two modules: MPSs policy module α * ® rt +1 + γ * argmax Q( st +1 , ai ) − Q ( st , at ) ¾
and policy learning module. ¯ a i ¿
MPSs policy module contains maneuver strategies in 12)
different states. The essence of this module is the use of Q α is the learning rate that can control the Q(s, a) update
functions. Its input is the current state of the target ballistic step size.
missile and the output is the maneuver selected in the action
Each time the mobile defense strategy system interacts
space. The choice of maneuver is always based on greedy
with the environment, it can obtain a reward system. Then Q(s,
strategies, which selects maneuver that maximizes Q function
a) can be updated according to (12). Repeat this process many
value. In this process, reinforcement learning not only has to
times, the Q function eventually converges and the maneuver
open up the acquired knowledge to gain more rewards, but
defense strategy system is also obtained at the same time.
also explore and try out some new maneuvers to have better
action choices in the future. In order to balance development The above self-learning maneuver defense strategy
and exploration, the -greedy strategy will always be used in system method can be summarized as follows:
reinforcement learning. The detailed process description is
shown in Table 1. TABLE II. SELF-LEARNING MPSS METHOD
TABLE I. MPSS POLICY MODULE
The policy learning module is the most important part of

the mobile penetration strategy system, which plays a central
role in self-training. The essence of this module is the
self-learning of the Q function. Through continuous
interaction with the environment, the reward and penalty
values are obtained, and then the Q(s, a) value is updated.
According to dynamic programming theory, (10) can be
organized as follows:
∞
Q( st , at ) = ¦ γ k rt + k +1
k =0
∞
= rt +1 + γ * ¦ γ k rt +1+ k +1 (11)
k =0
V. TRAINNING AND RESULTS
= rt +1 + γ * argmax Q ( st +1 , ai ) In the verification experiment, it was assumed that the
ai initial distance between the interceptor and the ballistic
In order to control the update step of Q(s, a), the learning missile was 200 km. The starting point of the interception can
rate can be determined by (12): be set as the origin (0, 0) and the target is [50km, 35km]. Then
randomize the initial position of the ballistic missile according
to the starting distance.
To verify the effectiveness of the learning strategy, set the
speed of ballistic missile to 1700m/s. The interceptor missile
has a higher trajectory speed and is set to 1800m/s. If the
speed is at a disadvantage, it can be maneuvered, indicating
that the mobile penetration strategy is effective.
A. Train#1
When there is no training, it can be seen that the missile's
flight path is disorderly and disorderly, with no purpose, as
shown in the Fig.5.
Fig. 7: Trajectories after training 500,000 times
VI. CONCLUSION
In this paper, we propose a Mobile Penetration Strategy
System to deal with the interception of unknown interceptor
Fig. 5: Trajectories of the beginning episodes missiles. This method is based on reinforcement learning,
through continuous trial and error to learn a better
B. Train#2 maneuvering strategy. Through simulation and verification,
In training 100,000 times, we can see that the missile has a ballistic missiles can be maneuvered in a timely manner to
certain purpose and to avoid the interception awareness, but successfully avoid the interception of interceptor bombs and
the effect is not good, as shown in Fig.6. hit the target, demonstrating the effectiveness of our mobile
defense strategy system.
REFERENCES
[1] Shupe N K. The Control of Aircraft in Terrain-Following and
Formation-Flight Applications[J]. IEEE Transactions on Aerospace &
Electronic Systems, 1966, aes-2(6):502-507.
[2] Zaharia S M, tefneanu, Rare Ioan. Design and Manufacturing
Process for a Ballistic Missile[J]. Scientific Bulletin, 2017, 21(2).
[3] Liu S H, Bi Z J, Lu L. Research of Characteristic Extraction and
Matching Algorithm of Ballistic Missile Hyperspectral Data Based on
Quantization Coding Method[J]. Advanced Materials Research, 2014,
945-949:1936-1941.
[4] Schvaneveldt R W, Goldsmith T E, Benson A E, et al. Neural Network
Models of Air Combat Maneuvering[J]. Neural Network Models of Air
Combat Maneuvering, 1992.
[5] Smith R E, Dike B A. Learning Novel Fighter Combat maneuver Rules
Via Genetic Algorithms[J]. International Journal of Expert Systems,
1995,8(3):247-276.
[6] Smith R E, Dike B A, Mehra R K, et al. Classifier systems in combat:
Two-sided learning of maneuvers for advanced fighter aircraft[J].
Fig. 6: Trajectories after training 100,000 times Computer Methods in Applied Mechanics & Engineering, 2000,
186(2-4):421-437.
C. Train#3
[7] Krishna Kumar K, Kaneshige J, Satyadas A. Challenging Aerospace
After training 500,000 times, it can be seen that the missile Problems for Intelligent Systems[C]. Proc. Of the Von Karman Lecture
has a clear purpose and the ability to maneuver to intercept Series on Intelligent Systems for Aeronautics, Belgium, 2002:1-15.
interception bullets is strong, indicating that the learned
maneuvering strategy is effective.

Ballistic Missile Maneuver Penetration Based On Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ballistic Missile Maneuver Penetration Based On Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Ballistic Missile Maneuver Penetration Based on Reinforcement

TABLE I. MPSS POLICY MODULE

The policy learning module is the most important part of

Fig. 7: Trajectories after training 500,000 times

You might also like