You are on page 1of 13

Dual ISP failover with RPM ip monitoring

Internet isnt perfect and we may have link failures from time to time. How do we react to these
failures? Manually or we have an automatic way. I would like to show on this post how Junos
can take action upon an upstream gateway reachability issue and how SRX flow behaves in such
a scenario. To achieve this task we will use a handful of features currently available on an SRX
box. Before getting started, check my test topology below in order to understand this post. It is a
simulated Internet environment with some fake public IP addresses. BranchC is our client side
SRX device and we have two connected PCs and we will do every config magic on this BranchC
device.

Test Plan

1) Create two routing instances for each ISP & cross import the routes between these two
instances

2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2 by using Filter Based
Forwarding

3) Monitor each ISP by using RPM (Real Time Performance Monitoring) feature

4) Test the ideal condition traffic flow

5) If any ISP link fails, failover the default route to the other ISP by using ip monitoring
feature

6) Analyse the effects of this failover on established TCP/UDP traffic

Now we will go step by step and complete each task.


1) Create two routing instances for each ISP
First we need to create RIB groups so that each ISP routing instance can have the interface routes
of the other ISP.
[edit]
root@branchC# show routing-o
rib-groups {
ISP1-to-ISP2 {

1
2
3
4
5
6
7
8
9
10

[edit]
root@branchC# show routing-options
rib-groups {
ISP1-to-ISP2 {
import-rib [ ISP1.inet.0 ISP2.inet.0 ];
}
ISP2-to-ISP1 {
import-rib [ ISP2.inet.0 ISP1.inet.0 ];
}
}
Then create routing instances and activate rib-groups.

[edit]
root@branchC# show routing-in
ISP1 {
instance-type virtual-router;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

[edit]
root@branchC# show routing-instances
ISP1 {
instance-type virtual-router;
interface ge-0/0/0.951;
routing-options {
interface-routes {
rib-group inet ISP1-to-ISP2;
}
static {
route 0.0.0.0/0 next-hop 173.1.1.1;
}
}
}
ISP2 {
instance-type virtual-router;
interface ge-0/0/0.202;
routing-options {
interface-routes {
rib-group inet ISP2-to-ISP1;
}
static {
route 0.0.0.0/0 next-hop 212.44.1.1;
}
}
}
Now routing table should be ready i.e routes from each instances should be cross imported.
root@branchC> show route
inet.0: 2 destinations, 2 routes (
+ = Active Route, - = Last Activ

1
2
3
4
5
6

root@branchC> show route


inet.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
173.63.1.0/24

*[Direct/0] 1d 01:58:44

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

> via ge-0/0/0.963


173.63.1.1/32
*[Local/0] 1d 01:58:45
Local via ge-0/0/0.963
ISP1.inet.0: 6 destinations, 6 routes (6 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
0.0.0.0/0

*[Static/5] 1d 01:53:34
> to 173.1.1.1 via ge-0/0/0.951
173.1.1.0/24
*[Direct/0] 1d 01:54:14
> via ge-0/0/0.951
173.1.1.2/32
*[Local/0] 1d 01:54:14
Local via ge-0/0/0.951
173.1.1.10/32
*[Static/1] 1d 01:54:14
Receive
212.44.1.0/30
*[Direct/0] 1d 01:37:00 <<<< --- This is the route of ISP2
> via ge-0/0/0.202
212.44.1.2/32
*[Local/0] 1d 01:37:00
Local via ge-0/0/0.202
ISP2.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
0.0.0.0/0

*[Static/5] 1d 01:54:14
> to 212.44.1.1 via ge-0/0/0.202
173.1.1.0/24
*[Direct/0] 1d 01:37:00 <<<< --- This is the route of ISP1
> via ge-0/0/0.951
173.1.1.2/32
*[Local/0] 1d 01:37:00
Local via ge-0/0/0.951
212.44.1.0/30
*[Direct/0] 1d 01:54:14
> via ge-0/0/0.202
212.44.1.2/32
*[Local/0] 1d 01:54:14
Local via ge-0/0/0.202
We have completed the first task. Each routing instance is aware of the brother routing instance.
Now we should route traffic from clients to the respective ISPs.
2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2
Below by using firewall filters, we redirect each traffic to the routing instances.
[edit]
root@branchC# show firew all
family inet {
filter redirect-to-isp {

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

[edit]
root@branchC# show firewall
family inet {
filter redirect-to-isp {
term to-isp1 {
from {
source-address {
173.63.1.100/32;
}
}
then {
routing-instance ISP1;
}
}
term to-isp2 {
from {
source-address {
173.63.1.200/32;
}
}
then {
routing-instance ISP2;
}
}
term default-allow {
then accept;
}
}
}
but it isnt activated until we apply it on the incoming interface
[edit]
root@branchC# show interface
vlan-id 963;
family inet {

1
2
3
4
5
6
7
8

[edit]
root@branchC# show interfaces ge-0/0/0.963
vlan-id 963;
family inet {
filter {
input redirect-to-isp; <<< --- We are redirecting client traffic.
}
address 173.63.1.1/24;

9 }
Redirecting client traffic to routing instances is also completed. Now we will monitor ISP links.
3) Monitor each ISP by using RPM
Junos has a great real time monitoring feature. You can continuously check link quality and
probe remote hosts. RPM requires another dedicated post actually but shortly what we do below
is that we probe each ISP gateway with 1 seconds interval 5 times by using ICMP and if the total
loss of in a single test is 5, then TEST FAILS. What does a test failure mean practially for us? It
means we can take an IP monitoring action for this failure.
[edit]
root@branchC# show services
probe probe-isp1 {
test test-isp1 {

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

[edit]
root@branchC# show services rpm
probe probe-isp1 {
test test-isp1 {
probe-type icmp-ping;
target address 173.1.1.1;
probe-count 5;
probe-interval 1;
test-interval 3;
source-address 173.1.1.2;
routing-instance ISP1;
thresholds {
total-loss 5;
}
}
}
probe probe-isp2 {
test test-isp2 {
probe-type icmp-ping;
target address 212.44.1.1;
probe-count 5;
probe-interval 1;
test-interval 3;
source-address 212.44.1.2;
routing-instance ISP2;
thresholds {
total-loss 5;
}
}

30

}
If we want to check the probe results
root@branchC>
show services rpm probe-resu
Ow ner: probe-isp1, Test: tes
Target address: 173.1.1.1, So

root@branchC> show services rpm probe-results owner probe-isp1 test test-isp1


Owner: probe-isp1, Test: test-isp1
1
Target address: 173.1.1.1, Source address: 173.1.1.2, Probe type: icmp-ping
2
Routing Instance Name: ISP1
3
Test size: 5 probes
4
Probe results:
5
Response received, Mon Nov 10 23:17:00 2014, No hardware timestamps
6
Rtt: 5225 usec
7
Results over current test:
8
Probes sent: 5, Probes received: 5, Loss percentage: 0 <<<--- Probes are sent and
9
received without any problem, no loss.
10
Measurement: Round trip time
11
Samples: 5, Minimum: 5212 usec, Maximum: 5307 usec, Average: 5264 usec, Peak to
12
peak: 95 usec,
13
Stddev: 39 usec, Sum: 26319 usec
14
Results over last test:
15
Probes sent: 5, Probes received: 5, Loss percentage: 0
16
Test completed on Mon Nov 10 23:17:00 2014
17
Measurement: Round trip time
18
Samples: 5, Minimum: 5212 usec, Maximum: 5307 usec, Average: 5264 usec, Peak to
19
peak: 95 usec,
20
Stddev: 39 usec, Sum: 26319 usec
21
Results over all tests:
22
Probes sent: 64740, Probes received: 63097, Loss percentage: 2
23
Measurement: Round trip time
24
Samples: 63097, Minimum: 617 usec, Maximum: 15220 usec, Average: 5399 usec,
Peak to peak: 14603 usec, Stddev: 631 usec, Sum: 340640344 usec
As we can see there isnt any loss at the moment. Only RPM monitoring without an action
doesnt really mean anything in our scenario. We need to take an action if a test fails which is IPMONITORING. Lets do it.
4) Test the ideal condition traffic flow
For this test to be successful, you must have SOURCE NAT configured and security policies
should allow the traffic

I am running traceroute from each hosts and traffic follows different ISP for each host. This is
what we wanted to do first of all when dual links are functional.
root@debian1:~# traceroute -n
traceroute to 87.1.1.6 (87.1.1.6)
1 173.63.1.1 3.857 ms 3.811
2 173.1.1.1 13.120 ms 13.130

1
2
3
4
5

root@debian1:~# traceroute -n 87.1.1.6


traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
1 173.63.1.1 3.857 ms 3.811 ms 4.635 ms
2 173.1.1.1 13.120 ms 13.130 ms 13.128 ms
3 87.1.1.6 12.489 ms 13.112 ms 13.106 ms
root@hostC:~# traceroute -n 87
traceroute to 87.1.1.6 (87.1.1.6)
1 173.63.1.1 2.876 ms 2.875
2 212.44.1.1 12.244 ms 12.24

1
2
3
4
5

root@hostC:~# traceroute -n 87.1.1.6


traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
1 173.63.1.1 2.876 ms 2.875 ms 3.493 ms
2 212.44.1.1 12.244 ms 12.249 ms 12.305 ms
3 87.1.1.6 12.080 ms 12.154 ms 12.188 ms
5) If any ISP link fails, failover!
With the following config, we check RPM probes for failure. If it happens, we set the preferred
default route to the other ISPs default gateway by which we will have achieved what we really
want. It is done automatically in each event.
[edit]
root@branchC# show services
policy track-isp1 {
match {

1
2
3
4
5
6
7
8
9
10

[edit]
root@branchC# show services ip-monitoring
policy track-isp1 {
match {
rpm-probe probe-isp1;
}
then {
preferred-route {
routing-instances ISP1 {
route 0.0.0.0/0 {

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

next-hop 212.44.1.1;
}
}
}
}
}
policy track-isp2 {
match {
rpm-probe probe-isp2;
}
then {
preferred-route {
routing-instances ISP2 {
route 0.0.0.0/0 {
next-hop 173.1.1.1;
}
}
}
}
}
Now we will simulate a failure on the ISP1 after which Debian1 device will also be routed
through the ISP2 instead of ISP1. Aha, link failed!

Now check IP monitoring status


root@branchC>
show services ip-monitoring st
Policy - track-isp1 (Status: FAIL

root@branchC> show services ip-monitoring status


1
2
Policy - track-isp1 (Status: FAIL)
3
RPM Probes:
4
Probe name
Test Name
Address
Status
5
---------------------- --------------- ---------------- --------6
probe-isp1
test-isp1
173.1.1.1
FAIL <<< --- TEST FAILED
7
Route-Action:
8
route-instance route
next-hop
state
9
----------------- ----------------- ---------------- ------------10
ISP1
0.0.0.0/0
212.44.1.1
APPLIED <<< --- Route action is taken and
11
0/0 next-hop is set to ISP2.
12
13
Policy - track-isp2 (Status: PASS)
14
RPM Probes:
15
Probe name
Test Name
Address
Status
16
---------------------- --------------- ---------------- --------17
probe-isp2
test-isp2
212.44.1.1
PASS
18
Route-Action:
19
route-instance route
next-hop
state
20
----------------- ----------------- ---------------- ------------21
ISP2
0.0.0.0/0
173.1.1.1
NOT-APPLIED
root@branchC> show route tab
ISP1.inet.0: 6 destinations, 7 rou
+ = Active Route, - = Last Activ

1 root@branchC> show route table ISP1.inet.0


2
3 ISP1.inet.0: 6 destinations, 7 routes (6 active, 0 holddown, 0 hidden)
4 + = Active Route, - = Last Active, * = Both
5
6 0.0.0.0/0
*[Static/1] 00:00:25, metric2 0
7
> to 212.44.1.1 via ge-0/0/0.202 <<< --- New default gateway is ISP2 now.
8 Yupppiii!!!
9
[Static/5] 1d 02:24:50
10
> to 173.1.1.1 via ge-0/0/0.951
11 173.1.1.0/24
*[Direct/0] 1d 02:25:30
12
> via ge-0/0/0.951
13 173.1.1.2/32
*[Local/0] 1d 02:25:30
14
Local via ge-0/0/0.951
15 173.1.1.10/32
*[Static/1] 1d 02:25:30
16
Receive
17 212.44.1.0/30
*[Direct/0] 1d 02:08:16
18
> via ge-0/0/0.202
19 212.44.1.2/32
*[Local/0] 1d 02:08:16

Local via ge-0/0/0.202


Now lets see if this new condition is working for hosts Debian1 and Hostc. As you can see
below, debian1 is now following the ISP2 link instead of the failed ISP1 link.
root@debian1:~# traceroute -n
traceroute to 87.1.1.6 (87.1.1.6)
1 173.63.1.1 1.165 ms 1.154
2 212.44.1.1 10.567 ms 10.92

1
2
3
4
5

root@debian1:~# traceroute -n 87.1.1.6


traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
1 173.63.1.1 1.165 ms 1.154 ms 1.141 ms
2 212.44.1.1 10.567 ms 10.929 ms 10.923 ms
3 87.1.1.6 10.501 ms 10.501 ms 10.486 ms
root@hostC:~# traceroute -n 87
traceroute to 87.1.1.6 (87.1.1.6)
1 173.63.1.1 4.354 ms 4.353
2 212.44.1.1 14.263 ms 14.26

1
2
3
4
5

root@hostC:~# traceroute -n 87.1.1.6


traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
1 173.63.1.1 4.354 ms 4.353 ms 4.980 ms
2 212.44.1.1 14.263 ms 14.261 ms 14.258 ms
3 87.1.1.6 13.552 ms 14.179 ms 14.172 ms
6) Analyse the effects of this failover on established TCP/UDP traffic
In order to investigate how SRX FLOW behaves upon this route update. I have initiated an SSH
connection towards the remote box 87.1.1.6 before the link failure. Below is the session entry of
this connection.
root@branchC> show security
Session ID: 6786, Policy name: d
In: 173.63.1.100/33127 --> 87.
Out: 87.1.1.6/22 --> 173.1.1.2/

1
2
3
4
5

root@branchC> show security flow session destination-prefix 87.1.1.6


Session ID: 6786, Policy name: default-policy-00/2, Timeout: 1788, Valid
In: 173.63.1.100/33127 --> 87.1.1.6/22;tcp, If: ge-0/0/0.963, Pkts: 42, Bytes: 4959
Out: 87.1.1.6/22 --> 173.1.1.2/5184;tcp, If: ge-0/0/0.951, Pkts: 41, Bytes: 5317
Total sessions: 1
session is established and working fine. I have also enabled flow trace to see what flow is telling
me once I send a packet after the link failure. My comments are inline in the flow trace.

Nov
10 22:36:39
22:36:39.463629:CID-0:RT:<173
matched filter rpm-ff:

Nov 10 22:36:39 22:36:39.463629:CID-0:RT:<173.63.1.100/33127->87.1.1.6/22;6> matched


filter rpm-ff:
Nov 10 22:36:39 22:36:39.463652:CID-0:RT:packet [100] ipid = 23312, @0x4c4123d2
Nov 10 22:36:39 22:36:39.463660:CID-0:RT:---- flow_process_pkt: (thd 1): flow_ctxt type
15, common flag 0x0, mbuf 0x4c412180, rtbl_idx = 7
1 Nov 10 22:36:39 22:36:39.463665:CID-0:RT: flow process pak fast ifl 76 in_ifp ge-0/0/0.963
2 Nov 10 22:36:39 22:36:39.463698:CID-0:RT: ge-0/0/0.963:173.63.1.100/331273 >87.1.1.6/22, tcp, flag 18
4 Nov 10 22:36:39 22:36:39.463707:CID-0:RT: find flow: table 0x58f397a0, hash
5 61800(0xffff), sa 173.63.1.100, da 87.1.1.6, sp 33127, dp 22, proto 6, tok 7
6 Nov 10 22:36:39 22:36:39.463715:CID-0:RT:Found: session id 0x1a82. sess tok 7
7 Nov 10 22:36:39 22:36:39.463720:CID-0:RT: flow got session.
8 Nov 10 22:36:39 22:36:39.463721:CID-0:RT: flow session id 6786 <<< --- Session is
9 matched
10 Nov 10 22:36:39 22:36:39.463735:CID-0:RT:flow_ipv4_rt_lkup success 87.1.1.6, iifl 0x0,
11 oifl 0x44
12 Nov 10 22:36:39 22:36:39.463740:CID-0:RT: route lookup failed: dest-ip 87.1.1.6 orig ifp
13 ge-0/0/0.951 output_ifp ge-0/0/0.202 fto 0x53bf21a8 orig-zone 6 out-zone 10 vsd 0 <<< --14 route lookup fails as currently this destination is pointing to a different interface
Nov 10 22:36:39 22:36:39.463745:CID-0:RT: readjust timeout to 6 s <<< --- flow adjusts
the timeout to 6 seconds immediately.
Nov 10 22:36:39 22:36:39.463748:CID-0:RT: packet dropped, pak dropped since re-route
failed <<< --- and the packet is dropped.
Nov 10 22:36:39 22:36:39.463751:CID-0:RT: ----- flow_process_pkt rc 0x7 (fp rc -1)
I have learned something new here. Apparently on this new situation flow drops the session
timeout to 6 seconds immeadiately. After seeing this I run flow session command once again and
saw that session timeout has now 2 seconds.
root@branchC> show security
Session ID: 6786, Policy name: d
In: 173.63.1.100/33127 --> 87.
Out: 87.1.1.6/22 --> 173.1.1.2/

1
2
3
4
5

root@branchC> show security flow session destination-prefix 87.1.1.6


Session ID: 6786, Policy name: default-policy-00/2, Timeout: 2, Valid
In: 173.63.1.100/33127 --> 87.1.1.6/22;tcp, If: ge-0/0/0.963, Pkts: 42, Bytes: 4959
Out: 87.1.1.6/22 --> 173.1.1.2/5184;tcp, If: ge-0/0/0.951, Pkts: 41, Bytes: 5317
Total sessions: 1

And after these two seconds also pass, flow deletes the session from session table.
Nov 10 22:36:45 22:36:45.1575
Nov 10 22:36:45 22:36:45.1575
Nov 10 22:36:45 22:36:45.1575

1 Nov 10 22:36:45 22:36:45.157575:CID-0:RT:jsf sess close notify


2 Nov 10 22:36:45 22:36:45.157587:CID-0:RT:flow_ipv4_del_flow: sess 6786, in hash 32
3 Nov 10 22:36:45 22:36:45.157592:CID-0:RT:flow_ipv4_del_flow: sess 6786, in hash 32
We have seen the effects of failover on TCP but I will leave the effects of this failover on UDP
traffic to the reader UDP behaviour is a bit different than this one and if required some
measures can be taken to mitigate it but I leave it to the reader to discover and share with me.