You are on page 1of 53

#CiscoLive

Troubleshooting Cisco
Secure Firewall Cluster
Failures and Packet Drops

Oscar Montoya Torres


Security Team Captain
TACSEC-2006

#CiscoLive
Your presenter

Oscar Montoya Torres


• Mexico City
• Technical Consulting Engineer CX Security
• Security Team Captain
• 4+ years in Cisco TAC
• Focused on Secure Firewall FTD/ASA

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
“Simple can be harder than complex.
You have to work hard to get your
thinking clean to make it simple.”
-Steve Jobs

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
• Introduction
• Connection Flags and Packet
Flow
• Unit Join Failures
• MTU issues
Agenda • NAT/PAT Failures
• Troubleshooting packet drops
• Q&A

TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Introduction
Introduction – What is Cluster?
Clustering lets you group multiple units together as a single logical device
while achieving the increased throughput and redundancy.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Introduction – Requirements
All units in a cluster:
• Must be the same model
• Running the same software version
• Same Firewall mode:
• Physical Appliance: routed/transparent
• Virtual Appliance: routed only

Supported on:
• FPR9300/FPR4100 - Up to 16 units
• Secure Firewall 3100 - Up to 8 units
• vFTD AWS, GCP, Azure – Up to 16 units
• vFTD KVM, Vmware – Up to 4 units

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Introduction – How FTD cluster is deployed?

• Spanned EtherChannels: all data links are grouped into one EtherChannel on
the switch side.
• Cluster Control Link (CCL) includes control and data traffic.
• Recommended to have per-unit EtherChannel for link redundancy on CCL.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Introduction – Cluster Node terminology
Control Unit - One node is elected as Control Unit.
Data Unit - The rest of nodes are Data Units.

Following centralized features are handled by the


control unit only:

• Inspections NetBIOS, TFTP, SUNRPC, SQLNET


• Site-to-Site VPN
• Dynamic routing
• Security Group Tag (SGT) is learned by the
Control Unit and populated to Data Units
• FMC communicates with Control Unit for policy
deployments changes, then the configuration is
replicated to data units

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
Connection Flags
and Packet Flow
Connection Flags - Cluster flow terminology
• Flow Owner: Whichever unit receiving the first packet for a new connection
will become the flow owner.
FTD1# cluster exec show conn add 192.0.2.152 | in 59718
FTD1(LOCAL):******************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 15826, flags UIO
FTD2:*************************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 0, flags Y

• Flow Director: The connection states are backed-up on a different unit


called the director. The director redirect other units to learn which unit is
the flow owner.
FTD1# cluster exec show conn add 192.0.2.152 | in 59718
FTD1(LOCAL):******************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 15826, flags UIO
FTD2:*************************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 0, flags Y

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Connection Flags - Cluster flow terminology
• Forwarder Flow: If a unit receives a packet for a flow that it doesn't own, it
will contact the director for that flow to learn which unit owns the flow.
Once it knows this, it will become a Forwarder, which will then be used to
forward any packets it receives on that connection directly to the owner.
FTD1# cluster exec show conn add 192.0.2.152 | in 59718
[output omitted]
FTD3:*************************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 0, flags z

• Backup director flow: If the director chosen for the flow is also the owner
then a 'backup director' flow will be created.
FTD1# cluster exec show conn add 192.0.2.152 | in 59718
[output omitted]
FTD4:*************************************************************
TCP VLAN2401 192.0.2.152:59718 VLAN2401 172.16.10.131:1523, idle 0:06:32, bytes 0, flags y

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
Packet Flow

1. TCP SYN originates from Client and arrives to FTD1. FTD1 becomes the flow owner. FTD2 is elected the flow director.
2. TCP SYN/ACK packet arrives from Server to FTD3.
3. FTD3 asks the director for the flow owner. FTD3 then forward the packet to the owner.
4. Owner unit sends state update to director unit.
5. The owner reinjects the packet on the interface OUTSIDE and then forwards the packet towards the Client.
6. Any subsequent packets delivered to director unit or forwarder will be forwarded to owner.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Unit Join Failures
Case Study 1: Interface health check failure
• By default, interface monitoring is enabled.
• In case of a link failure, the node is removed from the cluster until the issue is
fixed.
• In the following example FTD1 is out of cluster due to CCL link failure, but the
same issue can also happen due to data link failure.

Switching
CCL failed Infra
CCL Po48 CCL Po48
Troubleshooting commands:
• show cluster history
Data Po1 Data Po1 • show cluster info trace
• scope eth-uplink
scope fabric a
CCL Po48 Data Po1 show port-channel
• connect fxos
show port-channel summary
show port-channel database
show lacp neighbor
show lacp counters interface port-channel ID
show lacp interface ethernet x/x
FTD1 FTD2 FTD3
Control Unit Data Unit Data Unit
show lacp internal event history errors

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Case Study 1: Interface health check failure
• show cluster history command:
FTD1> show cluster history

09:01:34 UTC May 2 2023


CONTROL_NODE DISABLED Cluster interface down

FTD2> show cluster history

09:01:34 UTC May 2 2023


DATA_NODE DATA_NODE Event: Broadcast announce drop message
to all units with reason
CLUSTER_QUIT_REASON_MASTER_UNIT_HC
09:01:34 UTC May 2 2023
DATA_NODE DATA_NODE Event: Cluster unit FTD1 state
is DISABLED

• show cluster info trace command:


FTD1> show cluster info trace

May 02 09:01:31.162 [DBUG]Cluster state machine client Cluster Unit_Test Client returns is done with progression
May 02 09:01:31.162 [DBUG]Cluster state machine notify client Cluster Unit_Test Client of progression
May 02 09:01:31.162 [INFO]State machine changed from state CONTROL_NODE to DISABLED
May 02 09:01:31.162 [INFO]Interface Port-channel48 is going down
May 02 09:01:31.162 [CRIT]Unit FTD1 is quitting due to Cluster Control Link down (1 times after last rejoin). Rejoin will be attempted after 5
minutes.
May 02 09:01:31.162 [DBUG]Send event (DISABLE, RESTART | INTERNAL-EVENT, 300000 msecs, Cluster interface down) to FSM. Current state CONTROL_NODE
May 02 09:01:29.932 [DBUG]RPC call, Cluster SVM Client to id 0 with parameter 0x0000000000000000, returns RPC_SUCCESS

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Case Study 1: Interface health check failure
• LACP errors on FXOS:
FTD1(fxos)# show lacp internal event-history interface ethernet 1/2

10) FSM:<Ethernet2/1> Transition at 258423 usecs after Tue May 2 09:01:31 2023
Previous state: [LACP_ST_PORT_MEMBER_COLLECTING_AND_DISTRIBUTING_ENABLED]
Triggered event: [LACP_EV_UNGRACEFUL_DOWN]
Next state: [LACP_ST_PORT_IS_DOWN_OR_LACP_IS_DISABLED]

11) FSM:<Ethernet2/1> Transition at 350583 usecs after Tue May 2 09:01:31 2023
Previous state: [LACP_ST_PORT_IS_DOWN_OR_LACP_IS_DISABLED]
Triggered event: [LACP_EV_PORT_HW_PATH_DISABLED]
Next state: [FSM_ST_NO_CHANGE]

12) FSM:<Ethernet2/1> Transition at 434181 usecs after Tue May 2 09:01:31 2023
Previous state: [LACP_ST_PORT_IS_DOWN_OR_LACP_IS_DISABLED]
Triggered event: [LACP_EV_CLNUP_PHASE_II]
Next state: [LACP_ST_PORT_IS_DOWN_OR_LACP_IS_DISABLED] Interfaces on the switch were modified causing the
operational state changing to down, so the CCL Po
went to down as LACP BPDUs were not received.
• Switch Logs:
Nexus# show logging

2023 May 2 09:01:11 Nexus1 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel10: first operational port changed from Ethernet1/1 to none
2023 May 2 09:01:11 Nexus1 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel10: Ethernet1/1 is down
2023 May 2 09:01:11 Nexus1 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel10 is down (No operational members)
2023 May 2 09:01:11 Nexus1 %ETHPORT-5-IF_BANDWIDTH_CHANGE: Interface port-channel10,bandwidth changed to 100000 Kbit
2023 May 2 09:01:11 Nexus1 %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/1 is down (Link failure)
2023 May 2 09:01:11 Nexus1 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel10 is down (No operational members)
2023 May 2 09:01:11 Nexus1 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/2 to none
2023 May 2 09:01:11 Nexus1 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/2 is down

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Next Actions:
• Verify the port-channel configuration and make sure the port-channels
are up.
• Schedule switch interface/vPC/Port-channel interface configuration
changes during maintenance window.
• If data port-channels will be modified on switch, disable health-
monitoring on cluster side to avoid a cluster event.
• Take advantage of auto-rejoin configuration to tweak the existing unit re-
join timers.

Device Management > Edit Cluster >


Cluster Tab > Edit Cluster Health
Monitor Settings

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
Case Study 2: Snort engine failure
In the following example FTD2 is out of cluster due to snort failure.
Troubleshooting commands:

• show cluster history


• show cluster info trace
Switching
• Show loggin | inc 7481
Infra • For snort failure:
CCL Po48 CCL Po48 > expert
sudo su
cd /ngfw/var/log/
Data Po1 Data Po1
less messages | grep crash
less messages | grep snort
CCL Po48 Data Po1
top
pmtool status | grep <process>
ls –lrth /ngfw/var/log/crashinfo/ - for snort3
ls –lrth /ngfw/var/data/cores/ - for snort2
• For high disk:
Snort failed df –ah
du –hsx
FTD1 FTD2 FTD3 find /ngfw -type f -exec du -Sh {} + | sort -rh | head -n 15
Control Unit Data Unit Data Unit

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Case Study 2: Snort engine failure
• show cluster history command:
FTD1> show cluster history

06:40:41 UTC May 2 2023


CONTROL_NODE CONTROL_NODE Event: Asking data node FTD2
to quit due to snort Application
health check failure, and
data node's application state
is down
06:40:41 UTC May 2 2023
CONTROL_NODE CONTROL_NODE Event: Cluster unit FTD2 state
is DISABLED

FTD2> show cluster history

06:40:44 UTC May 2 2023


DATA_NODE DISABLED Received control message DISABLE (application health check failure)

• show cluster info trace command:


FTD2 > show cluster info trace
[output omitted]
May 02 06:40:41.882 [INFO]State machine changed from state DATA_NODE to DISABLED
May 02 06:40:41.882 [DBUG]Send event (DISABLE, RESTART | MESSAGE, permanent, Received control message DISABLE (application health check
failure)) to FSM. Current state DATA_NODE
May 02 06:40:41.882 [INFO]ASLR enabled, text region 55d3c6367000-55d3caa12cfd
May 02 06:40:41.882 [INFO]Notify chassis de-bundle port for blade unit-1-1, stack 0x000055d3c7c6da2f 0x000055d3ca3de0a3 0x000055d3c7c68edd
May 02 06:40:41.882 [DBUG]Receive CCP message: CCP_MSG_QUIT from FTD2 to FTD1 for reason CLUSTER_QUIT_REASON_APP_HC
May 02 06:40:41.882 [DBUG]Send CCP message to all: CCP_MSG_APP_STATE
May 02 06:40:41.882 [INFO]snort application status is changed from up to down

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
Case Study 2: Snort engine failure
• Syslog logs:
FTD1# show logging | include 7481

May 02 06:40:41 %FTD-3-748101: Clustering: Peer unit FTD2(1) reported its snort application status is down
May 02 06:40:41 %FTD-3-748103: Clustering: Asking data node FTD2 to quit due to snort Application health check failure, and data node's
application state is down
May 02 06:40:41 %FTD-3-748101: Clustering: Peer unit FTD2(1) reported its diskstatus application status is up
May 02 06:40:41 %FTD-3-748101: Clustering: Peer unit FTD2(1) reported its snort application status is down

• Firepower logs from FTD2:


root@FTD2:/ngfw/var/log# less messages | grep snort
May 2 06:40:41 firepower SF-IMS[20813]: [20813] pm:process [INFO] Calling crash command '/ngfw/usr/local/sf/bin/snort3-save-crashinfo.py' for process
'37c0a584-e89c-11ed-b8ee-1a2fcfb32d3f'.
May 2 06:40:41 firepower SF-IMS[20964]: [21102] ndclientd:ndclientd [WARN] [snort] Received a signal of snort failure
May 2 06:40:41 firepower SF-IMS[20964]: [21102] ndclientd:ndclientd [WARN] [snort] Critical process failures have exceeded the threshold!
May 2 06:40:41 firepower SF-IMS[20964]: [21069] ndclientd:ndclientd [WARN] [snort] Service has failed, stopping Notification Daemon heartbeats.
May 2 06:40:41 firepower SF-IMS[20964]: [21069] ndclientd:ndclientd [WARN] [snort] sending version [2] HB stop message
May 2 06:40:41 firepower Notification Daemon[20963]: Notification Daemon :NGFW-1.0-snort-1.0--->OFFLINE
May 2 06:40:41 firepower Notification Daemon[20963]: Notification Daemon: Sending a Status Down for NGFW-1.0-snort-1.0 with failure reason More than 50
percent of snort instances are down
May 2 06:40:41 firepower SF-IMS[20964]: [21069] ndclientd:ndclientd [INFO] [snort] Received valid HeartbeatStop response from ND.
May 2 06:40:41 firepower snort3-save-crashinfo.py: /snort3-save-crashinfo.py: Successfully generated snort3 crash information to the
file/ngfw/var/log/crashinfo/snort3-crashinfo.1682531972.473076

May 2 06:40:42 firepower snort[56909]: --------------------------------------------------


May 2 06:40:42 firepower snort[56909]: o")~ Snort++ 3.1.36.100-2
May 2 06:40:42 firepower snort[56909]: --------------------------------------------------

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Next Actions:
• By default, snort engine and disk status are monitored by ndclientd process as part
of the cluster health-check.
• If snort fails or disk is full, the unit is removed from the cluster as it is not healthy.
• For snort failure:
• Check the /ngfw/var/log/messages file for failure reason
• Snort traceback, core files can be collected from /ngfw/var/log/crashinfo and
/ngfw/var/data/cores respectively.
• Engage TAC with a troubleshooting file for further RCA.
• In case High Disk usage is detected, remove unnecessary files to free disk space.
• Troubleshoot Excessive Disk Utilization

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
MTU Issues
Case Study 3: CCL MTU mismatch
In the following example FTD2 is not joining the cluster due to CCL MTU
test failure.

FTD1# show run mtu


mtu inside 1500 Troubleshooting commands:
mtu outside 1500 Switching
mtu cluster 1600
Firewall:
Infra
CCL Po10 CCL Po12
• Show cluster history
• show cluster info trace
MTU 1600 MTU 1600 • show running-config mtu
• show interface
CCL Po11 • show interface port-channel ID
127.2.1.1 127.2.2.1 127.2.3.1
MTU 1500
• ping <interface-name> <IP> size <value>
CCL Po48 CCL Po48 CCL Po48

Switch:
• show interface Ethernet x/x
FTD1 FTD2 FTD3 • show port-channel xx
Data Po1 Data Po1 Data Po1 • show run | in mtu

FTD3# show run mtu


FTD2# show run mtu mtu inside 1500
mtu inside 1500 Switching mtu outside 1500
mtu outside 1500 Infra mtu cluster 1600
mtu cluster 1600

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Case Study 3: CCL MTU mismatch
• show cluster history command:
FTD1> show cluster history

21:22:17 UTC May 7 2023


CONTROL_NODE CONTROL_NODE Event: Cluster unit FTD2 state
is DATA_NODE_APP_SYNC
[output omitted]

21:22:27 UTC May 7 2023


CONTROL_NODE CONTROL_NODE Event: CCL MTU test to unit FTD2
failed

• show cluster info trace command:


FTD1# show cluster info trace | inc MTU
May 07 21:45:08.500 [WARN]CCL MTU test to unit FTD2 failed
May 07 21:22:27.183 [WARN]CCL MTU test to unit FTD2 failed
May 07 21:09:45.853 [WARN]CCL MTU test to unit FTD2 failed

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Case Study 3: CCL MTU size mismatch
• Ping test over the CCL to verify if the MTU:
FTD2# ping cluster 127.2.2.1 size 1600
Type escape sequence to abort.
Sending 5, 1600-byte ICMP Echos to 127.2.2.1, timeout is 2 seconds:
?????
Success rate is 0 percent (0/5)

FTD1# ping cluster 127.2.1.1 size 1600


Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 127.2.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms

• Console logs at the moment of the joining failure:


FTD2# WARNING: Unit FTD2 is not reachable in CCL jumbo frame ICMP test, please check cluster interface and switch MTU configuration
The data unit has left the cluster because application configuration sync is timed out on this unit. Disabling cluster now!
Cluster disable is performing cleanup..done.
Unit FTD2 is quitting due to system failure for 1 time(s) (last failure is data unit application configuration sync timeout).
Rejoin will be attempted after 5 minutes.
All data interfaces have been shutdown due to clustering being disabled. To recover either enable clustering or remove cluster group
configuration.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
Case Study 3: CCL MTU size mismatch
• Check interface MTU configuration on FTD2 and switch:
FTD2# show interfaces port-channel 1
Interface Port-channel1 "inside", is up, line protocol is up
Hardware is EtherSVI, BW 80000 Mbps, DLY 1600 usec
MAC address f8e5.7e1f.418e, MTU 1500
IP address 172.20.1.1, subnet mask 255.255.255.0

FTD2# show interface port-channel 48


Interface Port-channel48 "cluster", is up, line protocol is up
Hardware is EtherSVI, BW 40000 Mbps, DLY 10 usec
Description: Clustering Interface
MAC address 0015.c500.028f, MTU 1600
IP address 192.168.2.1, subnet mask 255.255.0.0

Nexus1# show interface port-channel 10


port-channel10 is up
admin state is up,
vPC Status: Up, vPC number: 10
Hardware: Port-Channel, address: 4488.1618.ee24 (bia 4488.1618.ee24)
MTU 1600 bytes, BW 40000000 Kbit , DLY 10 usec

Nexus1# show interface port-channel 11 MTU wrongly configured


port-channel11 is up,
admin state is up, on CCL Po11 in the
vPC Status: Up, vPC number: 11
Hardware: Port-Channel, address: 88fc.5dba.d788 (bia 88fc.5dba.d788)
switch
MTU 1500 bytes, BW 40000000 Kbit , DLY 10 usec

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Case Study 4: Database connections timeout through
the Firewall
In the following example, users report intermittent connection problems
between application and database.

Troubleshooting commands:

• Define a flow (Source IP / Destination IP/ Destination Port/


Protocol)
• Apply packet captures on all units on ingress, egress and asp
drop.
• Packet captures on Source and destination endpoints when
possible.
• Review cluster configuration
• show run cluster
• show run mtu
• cluster exec show conn detail <IP_address>
• cluster exec show conn long address <IP address>
• Collect syslogs

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Case Study 4: Database connections timeout through
the Firewall
• show conn detail command for the specific flow:
FTD1# show conn detail port 46638
TCP Inside:172.16.20.1/46638 Outside: 192.168.100.2/1524,
flags UIO , idle 12m5s, uptime 12m5s, timeout 1h0m, bytes 28576, cluster sent/rcvd bytes
[output omitted]
From director/backup FTD2: 16858 (23 byte/s)
Initiator: 172.16.20.1, Responder: 192.168.100.2
Connection lookup keyid: 1345113130

• ASP drops captures:


FTD1# show capture

capture ASP type asp-drop all circular-buffer headers-only [Capturing - 3700 bytes]
match ip host 172.16.20.1 host 192.168.100.2

FTD1# cluster exec show cap ASP | i 172.16.20.1.46638

FTD1(LOCAL):******************************************************

FTD2:*************************************************************
1: 19:31:40.797093 Outside P0 192.168.100.2.1524 > 172.16.20.1.46638: P 81975167:81975360(193) ack
17763954 win 122 Drop-reason: (tcp-not-syn) First TCP packet not SYN,
Drop-location: frame 0x000055d587d1c36a flow (NA)/NA

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
Case Study 4: Database connections timeout through
the Firewall
• Syslogs about this flow:
FTD1> show logging | inc 192.168.100.2

May 23 19:26:53 10.129.10.34 : %FTD-6-302023: Teardown director TCP connection for


Inside:172.16.20.1/46638 to Outside:192.168.100.2/1524 duration 0:01:15 forwarded bytes 16858 Cluster
flow with CLU removed from due to idle timeout

May 23 19:25:38 10.129.10.34 : %FTD-6-302022: Built director stub TCP connection for Inside:/46638
(172.16.20.1/46638) to Outside:192.168.100.2/1524 (192.168.100.2/1524)

May 23 19:25:38 10.129.10.34 : %FTD-6-302023: Teardown forwarder TCP connection for


Outside:192.168.100.2/1524 to unknown:172.16.20.1/46638 duration 0:00:00 forwarded bytes 0 Forwarding or
redirect flow removed to create director or backup flow

May 23 19:25:38 10.129.10.34 : %FTD-6-302022: Built forwarder stub TCP connection for
Outside:192.168.100.2/1524 (192.168.100.2/1524) to unknown:172.16.20.1/46638 (172.16.20.1/46638)

May 23 19:25:38 10.129.10.33 : %FTD-6-302013: Built inbound TCP connection 796624636 for
Inside:172.16.20.1/46638 (172.16.20.1/46638) to Outside:192.168.100.2/1524 (192.168.100.2/1524)

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
Case Study 4: Database connections timeout through
the Firewall
• cluster-cflow-clu-timeout
A cluster flow with CLU is considered idle if director/backup
unit no longer receives periodical update from the owner.
• show conn detail confirms there is no director/backup
flow for the connection on FTD2.

FTD1# cluster exec show conn detail port 46638 port 1524
FTD1(LOCAL):******************************************************
TCP Inside: 172.16.20.1/46638 Outside: 192.168.100.2/1524,
flags UIO , idle 12m5s, uptime 12m5s, timeout 1h0m, bytes 28576, cluster sent/rcvd bytes
[output omitted]
From director/backup FTD2: 16858 (23 byte/s)
Initiator: 172.16.20.1, Responder: 192.168.100.2
Connection lookup keyid: 1345113130

FTD2:*************************************************************

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Case Study 4: Database connections timeout
through the Firewall
MTU on FTD1 and FTD2: MTU on switch:
switch# show int po7 | grep MTU
MTU 9000 bytes, BW 10000000 Kbit, DLY 1 usec
FTD1# show run mtu switch# show int po17 | grep MTU
mtu Inside 9000 MTU 9000 bytes, BW 10000000 Kbit, DLY 1 usec
mtu Outside 9000 switch# show int po9 | grep MTU
mtu diagnostic 1500 MTU 9000 bytes, BW 20000000 Kbit, DLY 1 usec
mtu cluster 9184

FTD2# show run mtu


mtu Inside 9000
mtu Outside 9000
mtu diagnostic 1500
mtu cluster 9184

FTD cluster CCL interface MTU is set to 9184


Switch side cluster interface MTU is set to 9000
Make sure:
MTU configuration matches CLU control messages (communication between
cluster members) will follow the MTU settings
between Firewalls and Switch over the CCL.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Next Actions:
The cluster control link traffic includes data packet forwarding, so the cluster
control link needs to accommodate the entire size of a data packet plus
cluster traffic overhead.

• Always make sure:

CCL MTU = Data Interface MTU + at least 100 bytes trailer.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
NAT/PAT Failures
Case Study 5: PAT allocation Imbalance (Firepower 6.6)
In the following example, FTD1 is unable to create new NAT connections
when FTD1 rejoined the cluster after a cluster failure.

Troubleshooting commands:

• Cluster exec show nat pool cluster


• Cluster exec show nat pool
• Cluster exec show xlate | inc address <IP>
• show run nat
• show run all xlate
• cluster exec capture <name> interface <name> trace match <protocol>
host <IP1> host <IP2>
• cluster exec Capture <name> type asp-drop all <protocol> host <IP1>
host <IP2>
• Cluster exec show asp drop

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Case Study 5: PAT allocation Imbalance (Firepower 6.6)
• Before the failure, each unit is owner of an IP address of the pool:
FTD1# show running-config object Server
object network inside-net
subnet 192.168.100.0 255.255.255.0
object network Mapped-IPGroup
range 192.0.2.150 192.0.2.151
Switch
show running-config nat infra
object network inside-net
nat (Inside,Outside) dynamic pat-pool Mapped-IPGroup
Outside
192.0.2.150 192.0.2.151
FTD1# show nat pool cluster
IP outside 192.0.2.150, owner FTD1, backup FTD2 FTD2
FTD1 Data Unit
IP outside 192.0.2.151, owner FTD2, backup FTD1 Control Unit

• show nat pool command shows allocation of both IPs: Inside

FTD1# show nat pool


UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.150, range 1-511, allocated 1 Switch
UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.150, range 512-1023, allocated 2 infra
UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.150, range 1024-65535, allocated 12312
UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.151, range 1-511, allocated 3
UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.151, range 512-1023, allocated 12
Client
UDP PAT pool outside:Mapped-IPGroup, address 192.0.2.151, range 1024-65535, allocated 421

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Case Study 5: PAT allocation Imbalance (Firepower 6.6)
• FTD1 failed and left the cluster. FTD2 becomes now owner of both
IP addresses:
FTD2# show nat pool cluster
IP outside 192.0.2.150, owner FTD2, backup FTD1
IP outside 192.0.2.151, owner FTd2, backup FTD1

• Due to huge amount of traffic, FTD2 created connections


using both public IP addresses.
• FTD1 is recovered and re-joins the cluster.
• The control unit (FTD2) will attempt to find an unused IP
address of the PAT Pool to assign it to FTD1.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Case Study 5: PAT allocation Imbalance (Firepower 6.6)
• Since the PAT pool is composed of only two IP addresses.
FTD2 will keep the ownership of both IP addresses as it
has active xlates.

FTD2# show nat pool cluster


IP outside 192.0.2.150, owner FTD2, backup FTD1
IP outside 192.0.2.151, owner FTD2, backup FTD1

• show xlate command will display the NAT sessions


FTD2# show xlate
138693 in use, 3046971 most used
Flags: D - DNS, e - extended, I - identity, i - dynamic, r - portmap,
s - static, T - twice, N - net-to-net

TCP PAT from inside:192.168.100.10/53740 to outside:192.0.2.150/53740 flags ri idle 0:07:36 timeout 0:00:30
TCP PAT from inside:192.168.100.23/63850 to outside:192.0.2.150/63850 flags ri idle 0:38:16 timeout 0:00:30
TCP PAT from inside:192.168.100.12/63841 to outside:192.0.2.151/33683 flags ri idle 0:42:38 timeout 0:00:30
TCP PAT from inside:192.168.100.114/62036 to outside:192.0.2.151/62036 flags ri idle 2:02:13 timeout 0:00:30

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Next Actions:
• An IP address can be re-balanced when zero xlates exist for that IP
address.
• As workaround, the xlates can be cleared from the IP to make it
available for redistribution.
• Starting Firepower 6.7, cluster uses Port block-based distribution PAT.
• More enhancements were made to Firepower 7.0 and 7.1

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Troubleshooting
Packet Drops
How to troubleshoot Packet drops through a
cluster?
1. Define a specific flow
2. What service is impacted?
3. Define a source host, destination host,
destination port and protocol
4. Define Ingress and egress interface

Source Host: 172.16.10.10


Destination Host: 72.163.4.161 (cisco.com)
Destination port: 443
Protocol: TCP
Ingress Interface: Inside
Egress Interface: Outside

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
5. Collect packet captures: can be applied on Data Plane, CCL and ASP drop.

• cluster exec: enable captures in all cluster members


• reinject-hide: to not show reinjectd packets from the CCL

FTD1# cluster exec capture CAPI interface INSIDE match tcp host 172.16.10.10 host 72.163.4.161 eq 443
FTD1# cluster exec cap CAPO reinject-hide interface OUTSIDE match tcp host 192.0.2.150 host 72.163.4.161 eq 443
FTD1# cluster exec cap ASP type asp-drop all buffer 33554432 headers-only match ip host 172.16.10.10 host 72.163.4.161
FTD1# cluster exec capture capccl interface cluster trace match icmp any any

FTD1# cluster exec show capture CAPI


FTD1(LOCAL):******************************************************
capture CAPI type raw-data buffer interface INSIDE [Capturing - 5140 bytes]
match tcp host 172.16.10.10 host 72.163.4.161 eq 443

FTD2:*************************************************************
capture CAPI type raw-data buffer 33554432 interface INSIDE [Capturing - 260 bytes]
match tcp host 172.16.10.10 host 72.163.4.161 eq 443

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
• trace: to see how packets are handled by the data plane
FTD1# show cap CAPI packet-number 1 trace
25985 packets captured
1: 08:42:09.362697 802.1Q vlan#201 P0 172.16.10.10.45954 > 72.163.4.161.443: S 992089269:992089269(0) win 29200
<mss 1460,sackOK,timestamp 495153655 0,nop,wscale 7>
...
Phase: 4
Type: CLUSTER-EVENT
Subtype:
Result: ALLOW
Config:
Additional Information:
Input interface: 'INSIDE'
Flow type: NO FLOW
I (0) got initial, attempting ownership.

Phase: 5
Type: CLUSTER-EVENT
Subtype:
Result: ALLOW
Config:
Additional Information:
Input interface: 'INSIDE'
Flow type: NO FLOW
I (0) am becoming owner
...

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
6. Connection logs and syslogs:

• show conn command will show the connection table.


FTD1# cluster exec show conn add 172.16.10.10

FTD1(LOCAL):******************************************************
TCP Outside 72.163.4.161:443 INSIDE 172.16.10.10:1526, idle 0:06:32, bytes 15826, flags UIO
FTD2:*************************************************************
TCP Outside 72.163.4.161:443 INSIDE 172.16.10.10:1526, idle 0:06:32, bytes 0, flags Y

• Syslogs are helpful to see events data logs.


Cisco Secure Firewall Threat Defense Syslog Messages

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Key Takeaways

• CCL MTU = Data Interfaces + at least 100 bytes of trailer

• MTU configuration must match between the cluster and switch

• Interface health monitor enabled in all interface by default

• Snort engine and disk status enabled by default

• For packet drops, define and specific source and destination host,
destination port and protocol to set up packet captures

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
Cluster Troubleshooting commands:
show cluster history: To view event history for the cluster
show cluster access-list: Shows hit counters for access policies.
show cluster conn: Shows the aggregated count of in-use connections for all units.
show cluster conn count: Only the connection count is display
show cluster interface-mode: Shows the cluster interface mode, either spanned or individual.
show cluster memory: Shows system memory utilization
show cluster resource usage: Shows system resources and usage.
show cluster traffic: Shows traffic statistics.
show cluster xlate count: Shows current translation information.
show cluster info: Shows cluster information.
show cluster info trace: this command shows the debug information
show cluster info trace module hc: this command shows the debug information regarding health checks
show cluster info health details: To verify the heartbeat frequency
show cluster info conn-distribution: To Shows the connection distribution in the cluster.
show cluster info packet-distribution: Shows packet distribution in the cluster.
cluster exec show nat pool cluster: command to check if the pool is distributed
cluster exec show nat pool: To display statistics of NAT pool usage on all units
cluster exec show conn detail: Displays connections in detail, including translation type and interface
information.
cluster exec show conn long address: Displays connections in long format.
cluster exec capture <name> interface <name> trace match <protocol> host <IP1> host <IP2>: configure captures in
Data Plane and CCL
cluster exec capture <name> type asp-drop all <protocol> host <IP1> host <IP2>: Configure ASP drop captures
Cluster exec show cap <name>: To display the details of the capture
cluster exec show asp drop: To debug the accelerated security path dropped packets or connections.

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
“Simple can be harder than complex.
You have to work hard to get your
thinking clean to make it simple.”
-Steve Jobs

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Fill out your session surveys!

Attendees who fill out a minimum of four session


surveys and the overall event survey will get
Cisco Live-branded socks (while supplies last)!

Attendees will also earn 100 points in the


Cisco Live Game for every survey completed.

These points help you get on the leaderboard and increase your chances of winning daily and grand prizes

#CiscoLive TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
Q&A
• Visit the Cisco Showcase
for related demos

• Book your one-on-one


Meet the Engineer meeting

• Attend the interactive education


with DevNet, Capture the Flag,
Continue and Walk-in Labs

your education • Visit the On-Demand Library


for more sessions at
www.CiscoLive.com/on-demand

TACSEC-2006 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
Thank you

#CiscoLive
#CiscoLive

You might also like