You are on page 1of 27

How To Perform a Gaia

and SecurePlatform
Firewall Health Check

22 July 2014

Classification: [Protected]
2011 Check Point Software Technologies Ltd.
All rights reserved. This product and related documentation are protected by copyright and distributed under
licensing restricting their use, copying, distribution, and decompilation. No part of this product or related
documentation may be reproduced in any form or by any means without prior written authorization of Check
Point. While every precaution has been taken in the preparation of this book, Check Point assumes no
responsibility for errors or omissions. This publication and features described herein are subject to change
without notice.
RESTRICTED RIGHTS LEGEND:
Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph
(c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and FAR
52.227-19.
TRADEMARKS:
Refer to the Copyright page (http://www.checkpoint.com/copyright.html) for a list of our trademarks.
Refer to the Third Party copyright notices (http://www.checkpoint.com/3rd_party_copyright.html) for a list of
relevant copyrights and third-party licenses.

Classification: [Protected]
Important Information
Latest Software
We recommend that you install the most recent software release to stay up-to-date with the latest functional
improvements, stability fixes, security enhancements and protection against new and evolving attacks.

Latest Documentation
The latest version of this document is at:
http://supportcontent.checkpoint.com/documentation_download?ID=12143
For additional technical information, visit the Check Point Support Center
(http://supportcenter.checkpoint.com).

Revision History
Date Description

22 July 2014 Updated to include support for the Gaia operating system

11 May 2011 First release of this document

Feedback
Check Point is engaged in a continuous effort to improve its documentation.
Please help us by sending your comments
(mailto:cp_techpub_feedback@checkpoint.com?subject=Feedback on How To Perform A Gaia and
SecurePlatform Firewall Health Check ).

Classification: [Protected]
Contents
ImportantInformation................................................................................................................. 3
How To Perform a Firewall Health Check ................................................................................... 5
Before You Start ......................................................................................................................... 5
Performing a Firewall Health Check .......................................................................................... 6
Health Check Action Severity Guide ........................................................................................ 6
Section 1 Physical Platform Checks...................................................................................... 6
Date, System Uptime and Clock:......................................................................................... 6
Disk Space ......................................................................................................................... 7
Physical RAM and Swap Space: ......................................................................................... 8
Memory Usage ................................................................................................................... 9
CPU Usage ....................................................................................................................... 10
Interface Errors ................................................................................................................. 12
Fragmentation ................................................................................................................... 13
Checking dmesg and the Messages File ........................................................................... 14
Section 2 Firewall Application Checks:................................................................................ 15
Processes ......................................................................................................................... 15
Capacity Optimization ....................................................................................................... 17
ClusterXL and State Synchronization ................................................................................ 18
SecureXL .......................................................................................................................... 23
Aggressive Ageing ............................................................................................................ 24
HFA Patching.................................................................................................................... 25
Completing the Procedure ....................................................................................................... 27
Verfying .................................................................................................................................... 27

Classification: [Protected]
Health Check Action Severity Guide

How To Perform a Firewall Health


Check
Objective
This document explains the steps for performing a complete health check on a Gaia or SecurePlatform
firewall. The health checks in this document are based on best practice recommendations given by
Check Point experts.

Supported Versions
All supported Check Point versions including R70+

Supported Operating Systems


Gaia and SecurePlatform versions 2.4 and 2.6

Supported Appliances
All supported Check Point appliances and Open Servers

Before You Start


Related Documentation and Assumed Knowledge
The reader should be familiar with using the firewalls command line and have administrator experience.
The articles mentioned in this health check document and other articles can be found by searching the
Check Point Secure Knowledge: https://supportcenter.checkpoint.com/supportcenter/index.jsp

Impact on the Environment and Warnings


The procedure does not add any significant load to the appliance being reviewed and for the best
overall interpretation of the firewalls health should be carried out during normal working hours unless
the firewall is under stress.

How To Perform a SecurePlatform Firewall Health Check Page 5


Health Check Action Severity Guide

Performing a Firewall Health Check


Health Check Action Severity Guide
The health checks described in this report are based around best practice recommendations. The findings in
example output from individual checks are rated as follows:

Serious Needs immediate attention.

Attention Needs investigation

Good No need for any action.

Section 1 Physical Platform Checks


Physical platform health checks are designed to monitor the physical health of the firewall appliance by
examining key components of the system such as memory, CPU and hard disk usage to determine if any
attention is required.

Date, System Uptime and Clock:


Confirm the correct date is set on the system using the date command.

The system uptime can be examined using the command:


uptime

Example output:
Zulu# uptime
09:46:34 up 124 days, 9:40, 1 user, load average: 0.36, 0.19, 0.14

If a low uptime is shown it normally indicates that the firewall has been administratively rebooted but it may
also have been due to a self-reboot, for example due to a panic.

Low uptime - if you suspect the uptime is less than it should be check the
/var/log/messages file for the reason of the last reboot.

Performing a Gaia and SecurePlatform Firewall Health Check Page 6


Section 1 Physical Platform Checks

For state synchronization between cluster members to function properly the clocks on the cluster members
must be set to within 1 minute of each other. The best means of achieving this is to use NTP. To check the
time use the uptime command. (The time command does not show seconds).
If there is a discrepancy of more than 10 seconds on the cluster members there may be an issue with NTP.

Examine the /var/log/messages file to determine if the NTP server updates are working properly:

Sep 6 16:46:42 Zulu ntpdate[6291]: step time server 10.225.227.57 offset 1.633937 sec

Sep 6 16:48:52 Shaka ntpdate[28347]: no server suitable for synchronization found

Manually adjust the time and date if required and fix the NTP configuration or network issue.
To have accurate timestamps on the logs, it is recommended that non-clustered firewalls are also
configured to use NTP to synchronise their clock to a NTP reference clock.

Disk Space
The disk space usage can be examined using the command:
df k

Example output:

[Expert@Zulu]# df k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda5 600832 187800 382512 33% /
none 600832 187800 382512 33% /dev/pts
/dev/sda1 147766 10124 130013 8% /boot
/dev/sda7 1541680 930324 533044 64% /opt
none 2045688 0 2045688 0% /dev/shm
/dev/sda6 1541680 593844 869524 41% /sysimg
/dev/sda8 27024000 5472984 20178264 22% /var
[Expert@Zulu]#

In the above example, all partitions are under 70% usage.

If a partition has a use% that is more than 70% but less than 90%

If the use% is 90% or more


See if the partition can be cleaned up to free up disk space.

/var/opt/CPsuite-RXX/fw1/log may be filled with old log files if the firewall has been logging locally.

/var/log may have old messages files

Performing a Gaia and SecurePlatform Firewall Health Check Page 7


Section 1 Physical Platform Checks

Physical RAM and Swap Space:


Examine the RAM and swap space usage (kilobytes) with:
free k t

Example output:

[Expert@Zulu]# free k -t
total used free shared buffers
cached
Mem: 2058236 971332 1086904 0 95104
268984
-/+ buffers/cache: 607244 1450992
Swap: 4192944 0 4192944
Total: 6251180 971332 5279848
[Expert@Zulu]#

The total column shows the amount of RAM installed in the system (2GB in the above example)
and the amount of disk space allocated for swap space (4GB).

The amount of swap space is normally automatically set to twice the size of the physical memory,
with 4 GB being the maximum.

The used column indicates how much RAM and swap space are being used.

The free column indicates how much RAM and swap space are available.

In the above example output the used column indicates <1 GB of RAM is being used and no
swap space is being used.

If for some reason the amount of free RAM becomes low, the appliance will start to preserve free
RAM by swapping out the contents of the memory to the hard disk (swap space). The performance
will be sub-optimal if swap space is being used due to time and resources spent writing and reading
to the hard-disk.

Example Output:

[Expert@Zulu]# free k -t
total used free shared buffers
cached
Mem: 2055120 1897424 157696 0 98732
697688
-/+ buffers/cache: 1101004 954116
Swap: 4192912 735980 3456932
Total: 6248032 2633404 3614628
[Expert@Zulu]#

Swap space usage may indicate not enough memory is installed in the appliance. The kernel is
32 bit and can use up to 4GB. It is recommended to upgrade the memory if less than 4GB of RAM
are installed.

For further information about the amount of RAM that is supported by SecurePlatform refer to:
sk22343: What is the maximum memory supported by SecurePlatform?

Performing a Gaia and SecurePlatform Firewall Health Check Page 8


Section 1 Physical Platform Checks

Memory Usage
The firewalls memory usage can be examined by using the command:
fw ctl pstat

The output of this command is vast and can be difficult to understand as not all the output is intuitive. The
statistics that need to be checked to ensure memory is healthy are:
hash kernel memory hmem
system kernel memory smem
kernel memory kmem.

Example output:

[Expert@Zulu]# fw ctl pstat | more


Machine Capacity Summary:
Memory used: 7% (128MB out of 1638MB) - below low watermark
Concurrent Connections: 21% (43253 out of 199900) - below low watermark
Aggressive Aging is not active

Hash kernel memory (hmem) statistics:


Total memory allocated: 142606336 bytes in 34782 4KB blocks using 34
pools
Initial memory allocated: 20971520 bytes (Hash memory extended by
121634816 bytes)
Memory allocation limit: 335544320 bytes using 512 pools
Total memory bytes used: 39254196 unused: 103352140 (72.47%) peak:
133739228
Total memory blocks used: 10335 unused: 24447 (70%) peak:
32795
Allocations: 3375437074 alloc, 0 failed alloc, 3375001310 free

System kernel memory (smem) statistics:


Total memory bytes used: 188577580 peak: 227270504
Blocking memory bytes used: 1958392 peak: 2205256
Non-Blocking memory bytes used: 186619188 peak: 225065248
Allocations: 979925174 alloc, 0 failed alloc, 979924513 free, 0 failed
free

Kernel memory (kmem) statistics:


Total memory bytes used: 84876956 peak: 177110948
Allocations: 3375820431 alloc, 0 failed alloc, 3375384380 free, 0 failed
free
External Allocations: 0 for packets, 31589936 for SXL

In the above example there are no hmem, smem, kmem failed allocations.

Presence of hmem failed allocations indicates that the hash kernel memory was full. This is not
a serious memory problem but indicates there is a configuration problem. The value assigned to the
hash memory pool, (either manually or automatically by changing the number concurrent
connections in the capacity optimization section of a firewall) determines the size of the hash kernel
memory. If a low hmem limit was configured it leads to improper usage of the OS memory. See
Capacity Optimization in the Firewall Health Checks section for further information.

Presence of smem failed allocations indicates that the OS memory was exhausted or there are
large non-sleep allocations. This is symptomatic of a memory shortage. If there are failed smem
allocations and the memory is less than 2 GB, upgrading to 2GB may fix the problem. Decreasing
the TCP end timeout and decreasing the number of concurrent connections can also help reduce
memory consumption.

Performing a Gaia and SecurePlatform Firewall Health Check Page 9


Section 1 Physical Platform Checks

Presence of kmem failed allocations means that some applications did not get memory. This is
usually an indication of a memory problem; most commonly a memory shortage. The natural limit is
2GB, since the Kernel is 32bit.)

Memory shortage sometimes indicates a memory leak. In order to troubleshoot memory


shortage, stop the load you need to stop the load and let connections close. If the memory
consumption returns back to normal, you are not dealing with a memory leak. Such shortage might
happen when traffic volumes are too high for the device capacity. If the memory shortage happens
after a change in the system or the environment, undo the change, and check whether kmem
memory consumption goes down.

For optimum performance there should not be any failed memory allocations.

CPU Usage
CPU usage on single and multicore platforms can be checked with the command:
Top

Example top output from a badly optimized multi-core system:

Explanation of the above output:

%us: Time spent running non-kernel code (User)


%sy: Time spent running kernel code (System)
%ni: Nice time
%id: Time spent idle
%wa: Time spent waiting for IO
%hi: hardware interrupt
%si: Software interrupt
%st: stealth time (Involuntary wait time)
The idle value (%id) shows how busy the appliance is. If the value is 0, the CPU is maxed out. With the
firewall under load, examine the output of idle column (%id) for each CPU and determine if core usage is
spread out evenly.

Performing a Gaia and SecurePlatform Firewall Health Check Page 10


Section 1 Physical Platform Checks

In the above example the core usage is uneven; some cores are maxed out while other cores
are mostly idle. The core allocation (sim affinity) may require tuning to optimize the usage of the
cores and improve the performance.

For information on core tuning, refer to:


sk33250: Automatic SIM Affinity on Multi-Core CPU Systems

The CPU usage is broken down into:

High CPU in user time (%us) indicates that some daemon process is consuming high CPU;
security server processes like fwssd and in.ahttpd have been offenders in the past. (Figure out
which process it is from the output of ps or top.)

High CPU usage in system (%sy) indicates that the Check Point kernel (traffic being inspected
by Check Point or SmartDefense) is consuming CPU. Certain configurations in SmartDefense and
web-Intelligence can cause this to occur by disabling SecureXL templating or completely disabling
SecureXL acceleration.

High CPU in wait time (%wa) occurs when the CPU was idle due to the system waiting for an
outstanding disk I/O request to complete. This indicates your system is probably low on physical
memory and is swapping out memory (paging)*. The CPU is not actually busy if this number is
spiking; the CPU is blocked from doing any useful work waiting for an I/O event to complete.

A high value against software interrupt (%si) indicates that there is probably a high load of traffic
on the appliance. The interface errors (netstat i) should be examined to see if this is a cause of
concern.

* The occurrence of paging can be determined by running vmstat -n 5 5 and checking the swapped
in (si) and swapped out (so) statistics. Disregard the first line as it is an average value since the
appliance started.

Performing a Gaia and SecurePlatform Firewall Health Check Page 11


Section 1 Physical Platform Checks

Interface Errors
Interface statistics are displayed using the command:
netstat i

Example output:

[Expert@Zulu]# netstat -i
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 29597525 0 0 0 42570398 0 0 0 BMRU
eth1 1500 0 1032315302 0 3976 0 1615311511 0 0 0 BMRU
eth2 1500 0 1624715902 0 12111 0 1025019332 0 0 0 BMRU
eth6 1500 0 26828076 0 0 0 477906370 0 0 0 BMRU
lo 16436 0 5922470 0 0 0 5922470 0 0 0 LRU
[Expert@Zulu]#

In the above example the, RX-DRP indicates that the appliance is dropping packets at the
network. This is not ideal but as a percentage of received packets, the amount of RX-DRP packets
is insignificant and can therefore be disregarded as a source of concern. If the ratio is higher than
0.5% attention is required!

The RX and TX columns show how many packets have been received or transmitted error-free (RX-OK/TX-
OK) or damaged (RX-ERR/TX-ERR); how many were dropped (RX-DRP/TX-DRP); and how many were lost
because of an overrun (RX-OVR/TX-OVR).

RX-ERR/TX-ERR errors usually indicate a mismatch in duplex setting, mtu size, bad cabling or
possibly a faulty interface card. Check the switch settings and fix the speed and duplex settings if
there is a mismatch, check cabling and try a spare interface.

RX-DRP implies the appliance is dropping packets at the network. If the ratio of RX-DRP to RX-
OK is greater than 0.5% attention is required as it is a sign that the firewall does not have enough
FIFO memory buffer (descriptors) to hold the packets while waiting for a free interrupt to process
them.

When the FIFO buffer is full the appliance will drop new packets as it does not have any spare buffer to hold
them. A possible solution is to use Link Aggregation or tune the driver by increasing the descriptors, see:
sk25921: Tuning Intel PRO/1000 family NICs driver parameters for maximal throughput

TX-DRP usually indicates that there is a downstream issue and the firewall has to drop the
packets as it is unable to put them on the wire fast enough. Increasing the bandwidth through link
aggregation or introducing flow control may be a possible solution to this problem.

Performing a Gaia and SecurePlatform Firewall Health Check Page 12


Section 1 Physical Platform Checks

Fragmentation
Excessive fragmentation will have a detrimental impact on the firewalls performance. When packets are
fragmented by the network the kernel may receive them out of order. The kernel has to wait until it has
received all the fragments before it can re-assemble the fragments and then inspect the re-assembled
packet. Fragmented traffic can not be accelerated by the performance pack (SecureXL).

To examine the level of fragmentation run the following command:


fw ctl pstat

Find the section in the output for fragmentation and if there is fragmentation, examine the expired and
failures values.

Example fw ctl pstat fragmentation output (truncated):

Fragments:
130963 fragments, 64066 packets, 2337 expired, 0 short,
4 large, 304 duplicates, 0 failures

Expired denotes how many fragments were expired when the firewall failed to reassemble
them in a 20 seconds time frame or when due to memory exhaustion, they could not be kept in
memory anymore.

Failures denotes the number of fragmented packets that were received that could not be
successfully re-assembled.

The number of failures should be viewed in context with the amount of fragmentation occurring and relative
to the total packet throughput (netstat i). The values in pstat are accumulative and large values may
actually be relatively small to the total packet throughput. However, if there is a significant number against
failures then the cause of the issue should be traced to determine if there is a way to mitigate it.

In the above example output 1.8% of fragments that were received had to be expired by the
firewall but as there were no failures it implies that the fragments were subsequently re-transmitted
and successfully re-assembled by the firewall so no packets were lost.

If the source of fragmentation is external there is little that can be done to alleviate the problem
but if it is internal, reducing the mtu size on the offending server may resolve the problem.

Performing a Gaia and SecurePlatform Firewall Health Check Page 13


Section 1 Physical Platform Checks

Checking dmesg and the Messages File


The output of the dmesg command and the /var/log/messages file should be examined for tell-tale
messages:

Neighbour table overflow

If this message is seen it indicates that the default limit of the kernel ARP cache (1024) is set too low. This
will only occur if there is a large subnet connected directly to the firewall or cluster. If the message is seen it
is possible to increase the size of the table by editing the /etc/sysctl.conf file to include the lines:

net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

This will increase the ARP cache to 4096 after the firewall has been re-booted.

FW-1: State synchronization is in risk. Please examine your


synchronization network to avoid further problems!

If this message is seen it indicates that there is an issue with the state synchronization network which can
impede network performance. Consult the State Synchronization section in the Firewall Application
Checks for further information.

By default all services are state synchronized but some services do not need syncing and may cause
excessive load on the sync network (e.g. DNS). Disable state sync for all short lived connections and/or
services which dont require state full failover.

FW-1: SecureXL: Connection templates are not possible for the


installed policy (network quota is active). Please refer to the
documentation for further details.'

If this message is seen it indicates that there is a SmartDefense option active (in this case network quota)
that has disabled templating of connections in SecureXL. Disabling SecureXL templates restricts the
performance of SecureXL and is therefore undesirable. In this case, disabling the network quota option
would restore the ability to produce templates and increase the performance of the firewall.

Out of Memory: Killed process <PID> (<PROCESS_NAME>)

If this message is seen it means there is no more memory available in the user space. As a result, Gaia
or SecurePlatform starts to kill processes.

From time to time other messages of a similar nature may appear in dmesg, the /var/log/messages file and
on the console. It is always a good idea to research the message in the Check Point Secure Knowledge if
you are unsure of the meaning.

For further information see: sk33219: Critical error messages and logs

Performing a Gaia and SecurePlatform Firewall Health Check Page 14


Section 2 Firewall Application Checks:

Section 2 Firewall Application Checks:


Firewall application checks are designed to monitor the heath of the firewall application.

Processes
A list of processes running on the firewall can be displayed with the following commands:
top
ps auxw

Use the top command to check if any process is hogging CPU or Memory and to see if there are any
Zombie processes.

Example output:

[Expert@Zulu]# top
09:46:44 up 24 days, 9:40, 1 user, load average: 0.30, 0.19, 0.14
55 processes: 50 sleeping, 2 running, 3 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 15.0% 0.0% 1.0% 10.0% 24.0% 0.0% 150.0%
cpu00 7.0% 0.0% 0.0% 0.0% 1.0% 0.0% 92.0%
cpu01 8.0% 0.0% 1.0% 10.0% 23.0% 0.0% 58.0%
Mem: 4091376k av, 1390028k used, 2701348k free, 0k shrd, 90864k buff
786476k active, 140320k inactive
Swap: 4192944k av, 0k used, 4192944k free 278224k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
1526 root 25 0 97280 95M 11396 R 15.8 2.3 2590m 1 fw
1 root 15 0 512 512 452 S 0.0 0.0 0:17 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration
4 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd
6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd
9 root 25 0 0 0 0 SW 0.0 0.0 0:00 1 bdflush
7 root 15 0 0 0 0 SW 0.0 0.0 0:10 0 kswapd
8 root 15 0 0 0 0 SW 0.0 0.0 0:12 0 kscand
10 root 15 0 0 0 0 SW 0.0 0.0 0:14 0 kupdated
17 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 scsi_eh_0
22 root 15 0 0 0 0 SW 0.0 0.0 0:14 0 kjournald
90 root 25 0 0 0 0 SW 0.0 0.0 0:00 1 khubd

The above example output indicates there are 3 zombie processes but there are no resource
hogging processes. The Zombie processes should be identified to see if there is any cause for
action.

Performing a Gaia and SecurePlatform Firewall Health Check Page 15


Section 2 Firewall Application Checks:

Use ps auxw | more to examine the value in the START column of the process INIT, check the
START column of cpd, fwd and vpnd processes and other daemons to see if they have restarted since the
last boot. Identify any Zombie processes.

Example output:

[Expert@Zulu]# ps auxw | more


USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1524 512 ? S Jun13 0:17 init
root 731 0.0 0.0 1524 476 ? S Jun13 0:00 klogd -x -c 1
root 1174 0.0 0.0 3040 1348 ? S Jun13 0:00 /usr/sbin/sshd -4
root 1212 0.0 0.0 1572 620 ? S Jun13 0:00 crond
root 1265 0.0 0.0 2724 904 ? S Jun13 0:00 /bin/sh
/opt/spwm/bin/cpwmd_wd
root 1269 0.0 0.1 34412 7348 ? S Jun13 0:18 cpwmd -D -app SPLATWebUI
root 1389 0.0 0.1 7948 4608 ? S Jun13 0:00 /opt/CPshrd-R65/bin/cprid
root 1402 0.0 0.0 9120 3908 ? S Jun13 2:30 /opt/CPshrd-R65/bin/cpwd
root 1416 0.2 4.9 331348 204012 ? S Jun13 88:42 cpd
root 1526 7.3 2.3 422392 97280 ? S Jun13 2590:42 fwd
root 1578 0.0 1.6 220252 66864 ? S Jun13 0:42 in.asessiond 0
root 1579 0.0 1.6 220220 66800 ? S Jun13 0:43 in.aufpd 0
root 1580 0.1 1.7 240988 69844 ? S Jun13 57:51 vpnd 0
root 1586 0.2 0.1 11508 6172 ? S Jun13 95:09 dtlsd 0
root 1680 0.0 2.0 273760 82716 ? S Jun13 15:20 rtmd

No daemons in the ps auxw output have restarted.

Any daemon processes that have restarted may not necessarily indicate a fault because somebody may
have restarted it, for example by performing cpstop;cpstart. Normally the cause of a process restart
can be determined by looking at the /var/log/messages file or by examining the daemons error log file
(cpd.elg, fwd.elg, vpnd.elg etc).

In the above example of top output there were 3 Zombie processes. Zombie processes do
not consume resources but should not be present. Check the process list to identify the Zombie
(Stat: z) processes and determine if action is required.

[Expert@Zulu]# ps auxw | more


USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18374 0.5 0.0 4680 1932 ttyp0 S 09:46 0:00 cpinfo -n -z -o BCCF-CWH-EXT.cpinfo
root 18399 0.0 0.0 0 0 ttyp0 Z 09:46 0:00 [cpprod_util <defunct>]
root 18403 0.2 0.0 0 0 ttyp0 Z 09:46 0:00 [cpprod_util <defunct>]
root 18413 0.4 0.0 0 0 ttyp0 Z 09:46 0:00 [cpprod_util <defunct>]

The process cpprod_util was called by a process used by CPinfo to gather Ethernet stats. The Zombie
process is also marked defunct which means the same as Zombie. A defunct or Zombie process is a
process that has finished but still depends on a parent which is still alive. After the completion and
termination of the parent process these Zombie processes should terminate and no longer be shown in the
process list. If the Zombie processes are still there after completion of the CPinfo, killing the parent process
will be required to remove them from the process list.

Sometimes Zombie processes are the result of an error in the daemon coding. For example if a
Zombie vpnd process is seen there is a hotfix for it, refer to:
sk33941: "Zombie" vpnd process

Performing a Gaia and SecurePlatform Firewall Health Check Page 16


Section 2 Firewall Application Checks:

Capacity Optimization
The maximum number of concurrent connections that a firewall can handle is configured in the Capacity
Optimization section of the firewall or cluster object. It is recommended under normal circumstances to use
the automatic hash table size and memory pool configuration when increasing or decreasing the number of
maximum concurrent connections (default 25,000).

To check what value the maximum number of concurrent connections has been set to either check the
setting in the GUI firewall/cluster object or run the following command on the firewall:
fw tab t connections | grep limit

Example output:

[Expert@Zulu] #fw tab t connections | grep limit


dynamic, id 8158, attributes: keep, sync, aggressive aging, expires 25,
refresh, limit 100000, hashsize 534288, kbuf 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31, free function c0b98510 0, post sync handler c0b9a370

The number (100000) directly after limit is the maximum value as set in the Capacity Optimization
page on the firewall or cluster object (GUI).

To check the number of concurrent connections (#VALS) and the peak value (#PEAK) use the following
command on the firewall:
fw tab t connections s

Example output:

[Expert@Zulu]# fw tab t connections -s


HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 23055 77921 29141
[Expert@Zulu]#

The values that we are interested in are the limit and peak values. Ensure that there is about 15-20%
headroom before Aggressive Ageing is activated to ensure there is adequate spare capacity in the
connections table to cope with an increase in connections. If necessary, change the value in the capacity
optimization section on the firewall object and push the policy to make it effective. Greatly over-prescribing
the maximum concurrent connections is not recommended as it can lead to inefficient use of memory.

In the above example, a maximum of 100,000 concurrent connections has been set in the
Capacity Optimization section for the firewall and the peak number of connections (#PEAK) was
77,921 over the last 124 days (uptime).

The headroom above the #PEAK is set too low because the Aggressive Ageing default threshold of 80% will
be activated at 80,000. Increase the concurrent connections limit to around 120,000 connections to give
between 15-20% head-room before Aggressive Ageing becomes active.

If NAT is performed on the module check the fwx_cache table using the command:
fw tab t fwx_cache -s

Example output:

[Expert@Zulu]# fw tab t fwx_cache -s


HOST NAME ID #VALS #PEAK #SLINKS
localhost fwx_cache 8116 10000 10000 0
[Expert@Zulu]#

In the above example, the value of #PEAK is equal to 10,000 it indicates that the NAT cache
table (default 10,000) was full at some time. (#VALS equal to 10,000 indicates that the NAT cache
table is still full.)

For improved NAT cache performance the size of the NAT cache should be increased or the time
entries are held in the table decreased. For further information see: sk21834: How to modify the
values of the properties related to the NAT cache table

Performing a Gaia and SecurePlatform Firewall Health Check Page 17


Section 2 Firewall Application Checks:

ClusterXL and State Synchronization


The health of ClusterXL can be examined using a number of different commands:

cphaprob a if
cphaprob state
cphaprob list
cpstat ha f all | more
fw ctl pstat

Use the cphaprob a if command on the cluster members to check which interfaces have been
configured for state synchronization and verify the sync mode is consistent on the cluster members:

Example output:

[Expert@Zulu]# cphaprob a if
eth1c0 non sync(non secured)
eth2c0 non sync(non secured)
eth3c0 non sync(non secured)
eth4c0 sync(secured), multicast

Virtual cluster interfaces: 3

eth1c0 192.168.1.1
eth2c0 192.168.2.1
eth3c0 10.1.1.1
[Expert@Zulu]#

[Expert@Shaka]# cphaprob a if
eth1c0 non sync(non secured)
eth2c0 non sync(non secured)
eth3c0 non sync(non secured)
eth4c0 sync(secured), broadcast

Virtual cluster interfaces: 3

eth1c0 192.168.1.1
eth2c0 192.168.2.1
eth3c0 10.1.1.1
[Expert@Shaka]#

In the above example, interface eth4c0 has been configured on both cluster members for state
sync but the sync mode is inconsistent, one is using multicast and the other broadcast mode.
Ensure the cluster members use the same mode. (The default mode is multicast.)

The following document explains how to change between broadcast and multicast
mode:
sk20576: How to set ClusterXL Control Protocol (CCP) in broadcast mode in ClusterXL

Performing a Gaia and SecurePlatform Firewall Health Check Page 18


Section 2 Firewall Application Checks:

Use the cphaprob state command to check if state sync is up and running. The local and remote
state synchronization IP addresses should be displayed and their state should be shown as Active on
the HA Master and Standby on the HA Backup. In a load-sharing cluster the state should be shown as
Active on both the local and remote firewalls:

Example output - HA:

[Expert@Zulu]# cphaprob state


Cluster Mode: New High Availability (Active Up)

Number Unique Address Assigned Load State

1 (local) 1.1.1.1 100% Active


2 1.1.1.2 0% Standby
[Expert@Zulu]#

In a HA cluster configuration (above), one member should be Active and the other Standby.

Example output Load-Sharing:

[Expert@Dingaan]# cphaprob state


Cluster Mode: New High Availability (Active Up)

Number Unique Address Assigned Load State

1 (local) 1.1.1.3 50% Active


2 1.1.1.4 50% Active
[Expert@Dingaan]#

In a load-sharing cluster configuration (above), both members should be shown as Active.

Example output HA or Load-Sharing:

[Expert@Zulu]# cphaprob state


Cluster Mode: New High Availability (Active Up)

Number Unique Address Assigned Load State

1 (local) 1.1.1.1 100% Active


[Expert@Zulu]#

Remote cluster partner is missing!

If the remote partner is not shown it will be usually be due to one of the following:

There is no network connectivity between the members of the cluster on the state sync network
The partner does not have state synchronization enabled
One partner is using broadcast mode and the other is using multicast mode
One of the monitored processes has an issue, such as no policy loaded
The partner firewall is down.

Performing a Gaia and SecurePlatform Firewall Health Check Page 19


Section 2 Firewall Application Checks:

Example output - HA or Load-Sharing:

[Expert@Zulu]# cphaprob state


Cluster Mode: New High Availability (Active Up)

Number Unique Address Assigned Load State

1 (local) 1.1.1.1 100% Active


2 1.1.1.2 0% Ready
[Expert@Zulu]#

Partner is in the Ready state. If one of the partners is in the Ready state it indicates that
there is an issue with state synchronization.

The Ready state is normally caused by another member of the cluster running a higher version of
code or HFA, for example, as would happen during an upgrade. This state is also seen when
CoreXL has been configured to use a different number of cores on the individual cluster members.
For further information see:
sk42096: Cluster member with CoreXL is in 'Ready' state

The Ready state can also occur if a cluster member receives state synchronization traffic from a
different cluster that is using the same mac magic number and the other cluster is running a higher
version of code. For further information see:
sk36913: Connecting several clusters on the same network

Example output - HA or Load-Sharing:

[Expert@Zulu]# cphaprob state


Cluster Mode: New High Availability (Active Up)

Number Unique Address Assigned Load State

1 (local) 1.1.1.1 100% Active


2 1.1.1.2 0% Down
[Expert@Zulu]#

A remote cluster member is in the Down state indicates that there is either a problem on the
remote member or the state synchronization network between the cluster members is broken.

To investigate why a member shows itself to be locally Down use the cpstat ha f all | more
command on the firewall that shows Down. This command displays the Problem Notification Table and
the state of health of the monitored processes:

Example output (truncated):

[Expert@Zulu]# cpstat ha f all | more


Problem Notification table
-------------------------------------------------
|Name |Status |Priority|Verified|Descr|
-------------------------------------------------
|Synchronization|OK | 0| 3383| |
|Filter |OK | 0| 3383| |
|cphad |OK | 0| 0| |
|fwd |OK | 0| 0| |
-------------------------------------------------
All monitored processes have the OK status.

Performing a Gaia and SecurePlatform Firewall Health Check Page 20


Section 2 Firewall Application Checks:

Example output (truncated):

[Expert@Shaka]# cpstat ha f all | more


Problem Notification table
-------------------------------------------------
|Name |Status |Priority|Verified|Descr|
-------------------------------------------------
|Synchronization|problem| 0| 3383| |
|Filter |problem| 0| 3383| |
|cphad |OK | 0| 0| |
|fwd |OK | 0| 0| |
-------------------------------------------------

State synchronization is in a problem state because the policy is unloaded on this cluster
member. Installing the policy will fix this issue.

Alternatively, the cphaprob list command displays the same information plus some additional details:

Example output:

[Expert@Zulu]# cphaprob list


Registered Devices:

Device Name: Synchronization


Registration number: 0
Timeout: none
Current state: OK
Time since last report: 12139.6 sec

Device Name: Filter


Registration number: 1
Timeout: none
Current state: OK
Time since last report: 12124.5 sec

Device Name: cphad


Registration number: 2
Timeout: 5 sec
Current state: OK
Time since last report: 0.6 sec

Device Name: fwd


Registration number: 3
Timeout: 5 sec
Current state: OK
Time since last report: 0.6 sec

All monitored processes are shown as OK.

Assuming that state synchronization on the cluster is healthy, use the following command to check if the
state tables are synchronized:

fw tab t connections s

Simultaneously execute the command on both cluster members; compare the values of #VALS. The values
on both firewalls should be similar if the state synchronization mechanism is working unless a lot of delayed
notification is in use.

Performing a Gaia and SecurePlatform Firewall Health Check Page 21


Section 2 Firewall Application Checks:

Example output:
[Expert@Zulu]# fw tab t connections -s
HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 3222 38026 9820
[Expert@Zulu]#

[Expert@Shaka]# fw tab t connections -s


HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 3187 38026 9808
[Expert@Shaka]#

The #PEAK may be different depending on the uptime and when the last peak number of
connections occurred.

The #VALS on a HA pair should always be similar.

Examine the output of the sync section of fw ctl pstat.

Example output:

Sync:
Version: new
Status: Able to Send/Receive sync packets
Sync packets sent:
total : 13880231, retransmitted : 5, retrans reqs : 524, acks : 70
Sync packets received:
total : 692409645, were queued : 720, dropped by net : 517
retrans reqs : 5, received 43019 acks
retrans reqs for illegal seq : 0
dropped updates as a result of sync overload: 0
Callback statistics: handled 42940 cb, average delay : 1, max delay : 4

If the dropped by net counter has incremented then some sync packets have been lost and the
problem needs to be investigated to find the cause.

For further information please refer to:


sk34476: Explanation of Sync section in the output of fw ctl pstat command

Performing a Gaia and SecurePlatform Firewall Health Check Page 22


Section 2 Firewall Application Checks:

SecureXL
For optimum gateway performance SecureXL needs to be enabled, the SmartDefense and Web-Intelligence
or IPS options that are enforced do not interfere with SecureXL and the extent that templating is performed
is maximized by careful rulebase ordering.

For further information, refer to:


sk42401: Factors that adversely affect performance in SecureXL

The following command can be used to determine that SecureXL is turned on and the creation of templates
has not been disabled:

fwaccel stat
Example output showing SecureXL turned on and templating is enabled:-

[Expert@Zulu]# fwaccel stat


Accelerator Status : on
Accept Templates : on
Accelerator Features : Accounting, NAT, Cryptography, Routing,
HasClock, Templates, VirtualDefrag, GenerateIcmp,
IdleDetection, Sequencing, TcpStateDetect,
AutoExpire, DelayedNotif, McastRouting,
WireMode
Cryptography Features : Tunnel, UDPEncapsulation, MD5, SHA1, NULL,
3DES, DES, AES-128, AES-256, ESP, LinkSelection,
DynamicVPN, NatTraversal, EncRouting
[Expert@Zulu]#

If SecureXL is disabled it can be turned on from cpconfig.

Note: SecureXL is incompatible with FloodGate and will be disabled if FloodGate is active.

The following command can be used to examine the SecureXL statistics to get an understanding on how
well SecureXL is configured and performing:
fwaccel stats

Examine the output of fwaccel stats:

Check that templates are being created this number rises and falls as templates are created and
expire.

Examine the ratio of F2F packets to packets being accelerated - for best performance the firewall
should be accelerating the majority of the packets; the amount of packets being forwarded to the
firewall (F2F) should be minimal.

Performing a Gaia and SecurePlatform Firewall Health Check Page 23


Section 2 Firewall Application Checks:

Example output showing the SecureXL statistics:-

Templates are being formed and only a small amount of F2F packets to accel packets.

Aggressive Ageing
Aggressive Aging helps manage the connections table capacity and memory consumption of the firewall to
increase durability and stability; allowing the gateway machine to handle large amounts of unexpected
traffic, especially during a Denial of Service attack.

Aggressive Aging uses short timeouts called aggressive timeouts. When a connection is idle for more than
its aggressive timeout it is marked as "eligible for deletion". When the connections table or memory
consumption reaches a certain user defined threshold (highwater mark), Aggressive Aging begins to delete
eligible for deletion connections, until memory consumption or connections capacity decreases back to the
desired level.

The user defined thresholds are set in the GUI for the specific protection enforced by the firewall
(SmartDefense > Network Security > Denial of Service > Aggressive Ageing).

Performing a Gaia and SecurePlatform Firewall Health Check Page 24


Section 2 Firewall Application Checks:

To check the state of Aggressive Ageing on the firewall use the fw ctl pstat command:

Example output:

[Expert@Zulu]# fw ctl pstat | grep Aggressive


Aggressive Ageing is not active
[Expert@Zulu]#

The above output indicates that Aggressive Ageing has been set in SmartDefense to Protect but the
thresholds have not been reached to make it aggressively close connections that are eligible for deletion.

If Aggressive Aging has been set in SmartDefense to Inactive the output will say that
Aggressive Ageing is disabled:

[Expert@Zulu]# fw ctl pstat | grep Aggressive


Aggressive Ageing is disabled
[Expert@Zulu]#

If Aggressive Aging is in Detect mode the output will say it is monitor only:

[Expert@Zulu]# fw ctl pstat | grep Aggressive


Aggressive Ageing is in monitor only
[Expert@Zulu]#

There were some issues with the Aggressive Ageing mechanism which are fixed in R65 HFA_50:

Improved SecureXL notifications to the firewall resolve a connectivity issue that occurs when the Sequence
Verifier is enabled together with the Aggressive Aging mechanism.

Implementation: An immediate workaround is to disable either the Sequence Verifier or the Aggressive
Aging mechanism.

HFA Patching
Use the fwm ver and fw ver k commands to inspect the patching on the management station and the
firewall modules.

Check that the HFA patching on the module is the same version (HFA_50) or lower that the patching on the
Provider-1 management station. The firewall module must never be patched with a higher version than the
management station.

Ensure patching on cluster members is identical.

Example output:
Provider-1 Management:-

[Expert@Manager]# fwm verThis is Check Point SmartCenter Server NGX (R65)


HFA_50, Hotfix 650 - Build 011
Installed Plug-ins: Connectra NGX R62CM
[Expert@Manager]#

Performing a Gaia and SecurePlatform Firewall Health Check Page 25


Section 2 Firewall Application Checks:

Cluster:-

[Expert@Zulu]# fw ver k
This is Check Point VPN-1(TM) & FireWall-1(R) NGX (R65) HFA_40, Hotfix
640 - Build 091
kernel: NGX (R65) HFA_40, Hotfix 640 - Build 091
[Expert@Zulu]#

[Expert@Shaka]# fw ver k
This is Check Point VPN-1(TM) & FireWall-1(R) NGX (R65) HFA_40, Hotfix
640 - Build 091
kernel: NGX (R65) HFA_40, Hotfix 640 - Build 091
[Expert@Shaka]#

Versions on the clustered firewalls (HFA_40) are identical and the versions are not above the
Provider-1 version (HFA_50)

Although the patching is good in the above example it is out of date. Check Point always recommends
applying the latest HFA and Security Hotfixes on the SmartCenter and firewall modules.

The latest HFAs and Security Hotfix release notes are available on the Check Point website:

http://www.checkpoint.com/downloads/latest/hfa/index.html

CPinfo Package:

For troubleshooting purposes Check Point TAC will require a CPinfo taken from the firewall and
SmartCenter Server or CMA. Ensure the CPinfo package is higher than 911000023 so the full set of
diagnostics from the appliance can be gathered successfully.

CPinfo version 911000023 often hangs during gathering the firewalls connection tables and produces a
truncated output so it should be replaced with the latest version.

The version installed on the appliance can be determined by running the following command:
cpvinfo /opt/CPinfo-10/bin/cpinfo |grep Build

Example output:

[Expert@Zulu]# cpvinfo /opt/CPinfo-10/bin/cpinfo |grep Build


Build number = 911000023
[Expert@Zulu]#

The above version is problematic and should be upgraded.

The most up to date version of CPinfo can be downloaded using the following link:
sk30567: The CPinfo utility

Performing a Gaia and SecurePlatform Firewall Health Check Page 26


Section 2 Firewall Application Checks:

Completing the Procedure


If all items in the health check have been checked and found to be Good then the general
heath of the firewall is considered to be good.

Items identified as Attention should be investigated and addressed as soon as possible.

Items that have been identified as Serious are serious issues that require immediate attention.

Verfying
After fixing any problems that were identified as serious or requiring attention the health check
should be repeated to confirm that all the health checks are now good.

Completing the Procedure Page 27

You might also like