0% found this document useful (0 votes)
63 views41 pages

Monitoring System Technical Proposal

The document is a technical proposal for a monitoring system prepared by NEX4 ICT Solutions for Shwe Bank, detailing hardware requirements and a comprehensive scope of work for implementing a Zabbix monitoring solution. It outlines steps for planning, installation, configuration, data collection, alerting, reporting, testing, and training, along with specific metrics and triggers for various devices including Cisco switches, Palo Alto firewalls, and Check Point firewalls. The proposal emphasizes the importance of monitoring network performance, security events, and resource utilization across the bank's infrastructure.

Uploaded by

waiaung.wa92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views41 pages

Monitoring System Technical Proposal

The document is a technical proposal for a monitoring system prepared by NEX4 ICT Solutions for Shwe Bank, detailing hardware requirements and a comprehensive scope of work for implementing a Zabbix monitoring solution. It outlines steps for planning, installation, configuration, data collection, alerting, reporting, testing, and training, along with specific metrics and triggers for various devices including Cisco switches, Palo Alto firewalls, and Check Point firewalls. The proposal emphasizes the importance of monitoring network performance, security events, and resource utilization across the bank's infrastructure.

Uploaded by

waiaung.wa92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Monitoring System

Technical Proposal
(Shwe Bank)

Prepared by NEX4 ICT Solutions


Hardware List
No. Device Manfacturer Part Number Quantity
1 Access switch Cisco WS-C2960X-48TD-L 12
2 Access switch Cisco WS-C2960X-24TD-L 4
3 MGMT switch Cisco WS-C2960X-24PD-L 2
4 WAN switch Cisco C9200L-48T-4X 1
5 Core Switch Cisco N9K C93180YC-EX 2
6 Core Switch Cisco N7K C7010 2
7 Core Switch Cisco N3K-C3172TQ-1 4
8 CBM ISR Router Cisco ISR4431/K9 2
9 CBM switch Cisco C9200L-24 2
10 Edge Firewall Palo Alto PAN-PA-3410 2
11 Core Firewall Check Point CPAP-SG6700-PLUS-SNBT 2
12 Web Application Firewall F5 F5 R4800 LTM + AWAF 2
13 Hyperflex Servers Cisco HXAF240C-M6SN 4
14 Fabric Interconnect switches Cisco UCS-FI-6454-U 2
If there are additional devices or services which is needed to added in Monitoring tool, NEX4 will provide these additional setting
limited to maximum of 5 devices or service per month.
General Scope of Work

Step : 1 Planning and Design:


➢ Assessment: Conduct a detailed assessment of the current network and server infrastructure to identify specific
monitoring requirements for each component.

➢ Requirement Gathering: Identify the key metrics, thresholds, and events that need to be monitored for each
device type (e.g., CPU, memory, network traffic, security events).

Step : 2 Installation and Configuration:


➢ Zabbix Server Setup: Install and configure the Zabbix server, database, and web interface.

➢ Agent/Proxy Deployment: Deploy and configure Zabbix agents or proxies as necessary on VMware ESXi hosts and
Cisco HyperFlex Servers.

➢ SNMP Configuration: Set up SNMP (Simple Network Management Protocol) for monitoring Cisco Catalyst
Switches, Nexus Switches, and Firewalls (Palo Alto, Checkpoint).
General Scope of Work

Step : 3 Template Creation:


➢ Cisco Catalyst and Nexus Switches: Monitor CPU usage, memory, interface status, traffic, and error rates.
➢ Fabric Interconnect Switches: Monitor network performance, interface statuses, and connectivity issues.
➢ Palo Alto Firewall: Monitor threat logs, session counts, interface traffic, and system resources.
➢ Checkpoint Firewall: Monitor security events, VPN statuses, CPU, memory, and interface traffic.
➢ VMware ESXi: Monitor host performance, VM statuses, CPU, memory, datastore usage, and network metrics.
➢ Cisco HyperFlex servers: Monitor cluster health, storage performance, network performance, and hardware status.
General Scope of Work

Step : 4 Data Collection and Monitoring :


➢ Metric Collection: Configure data collection intervals and retention policies specific to the critical metrics of each
device type.

➢ Network Monitoring: Set up SNMP traps and polling for real-time monitoring of network devices (Cisco Switches,
Firewalls).

➢ Virtualization Monitoring: Configure VMware API integration to monitor ESXi hosts and VMs.

➢ Security Event Monitoring: Configure monitoring for security events and logs from Palo Alto and Checkpoint
Firewalls.
General Scope of Work

Step : 5 Alerts and Notifications:


➢ Thresholds Configuration: Set up specific thresholds for key metrics (e.g., high CPU usage, memory consumption,
network errors) and define corresponding triggers.

➢ Notification Setup: Establish notification channels (email, SMS, etc.) for different severity levels, ensuring timely
alerts to the relevant teams.

➢ Escalation Procedures: Define and configure escalation procedures for unresolved critical alerts.
General Scope of Work

Step : 6 Reporting and Dashboards:


➢ Custom Dashboards: Create real-time dashboards for each device type, providing a centralized view of
performance and health metrics. 10 dashboards per device but need to discuss detail scope in project.

➢ Scheduled Reports: Set up periodic reporting for trend analysis, capacity planning, and compliance with Service
Level Agreements (SLAs).

➢ Historical Analysis: Implement tools for historical data analysis to identify patterns and optimize resource usage.

Step : 7 Testing and Validation:


➢ System Testing: Conduct rigorous testing of the Zabbix setup to ensure accurate monitoring, data collection, and
alerting.

➢ Validation: Verify that all critical infrastructure components are being monitored according to the defined
requirements.
General Scope of Work

Step : 8 Training and Documentation:


➢ Training Sessions: Provide training for network and system administrators on using Zabbix, managing alerts, and
interpreting data.

➢ Documentation: Develop detailed documentation covering the configuration, monitoring templates, and
troubleshooting procedures for each device type.
Dashboard
Information
(Cisco switches)
Collected Items (Cisco Switches)
Name Description
ICMP ping
Uptime (network) The time (in hundredths of a second) since the network management portion of the
system was last re-initialized.
Uptime (hardware) The amount of time since this host was last initialized. Note that this is different from
sysUpTime in the SNMPv2-MIB [RFC1907] because sysUpTime is the uptime of the
network management portion of the system.
SNMP traps (fallback) The item is used to collect all the SNMP traps unmatched by the other snmp trap items.
System contact details The textual identification of the contact person for the managed node (or: this node),
together with the contact information of this person. If no contact information is known,
the value is a zero-length string.
System description The textual description of the entity. This value should include the full name and version
identification number of the system's hardware type, software operating-system, and the
networking software.
Hardware model name MIB: ENTITY-MIB.
Hardware serial number MIB: ENTITY-MIB.
Trigger Events
Unavailable by ICMP ping

High ICMP ping loss

Triggers High ICMP ping response time

Device has been replaced


(Cisco switches) System name has changed

Operating system description has changed

Device has been restarted or reinitialized

No SNMP data collection


Dashboard
Information
(Palo Alto Firewall)
Collected Items (Palo Alto Firewall)

Name Description
Currently installed application definition release date. If no release date
App-ID content date
is found, unknown is returned.
Currently installed application definition version. If no application
App-ID Version
definition is found, 0 is returned.
Chassis type Chassis type for this Palo Alto device.
Currently installed global-protect client package version. If package is
Global Protect Client Version
not installed, 0.0.0 is returned.
GP active tunnels Number of active tunnels.
GP gateway utilization GlobalProtect Gateway utilization percentage.
GP tunnels supported Max tunnels allowed.
Current high-availability mode (disabled, active-passive, or active-
HA Mode
active).
HA Peer State Current peer high-availability state.
HA State Current high-availability state.
HW Version Hardware version of the unit.
ICMP Check Ping to device.
Collected Items (Palo Alto Firewall)
Name Description
Full software version. The first two components of the full version are the
PAN-OS Version major and minor versions. The third component indicates the
maintenance release number.
The average, over the last minute, of the percentage of time that this
Processor 1 Load (mgmt) processor was not idle. Implementations may approximate this one
minute smoothing period if necessary.
The average, over the last minute, of the percentage of time that this
Processor 2 Load (data) processor was not idle. Implementations may approximate this one
minute smoothing period if necessary.
Serial Number The serial number of the unit. If not available, an empty string is returned.
Session table utilization percentage. Values should be between 0 and
Session table utilization
100.
SNMP availability SNMP availability.
A textual description of the entity. This value should include the full name
and version identification of the system's hardware type, software
System Description
operating-system, and networking software. It is mandatory that this only
contain printable ASCII characters.
Collected Items (Palo Alto Firewall)
Name Description
An administratively-assigned name for this managed node. By
System Name
convention, this is the node's fully-qualified domain name.
The time (in hundredths of a second) since the network management
System Uptime
portion of the system was last re-initialized. Preprocessed to seconds.
Currently installed threat definition version. If no threat definition is
Threat Version
found, 0 is returned.
Total active ICMP sessions Total number of active ICMP sessions.
Total active sessions Total number of active sessions.
Trigger Events
App-ID content date

App-ID Version

Chassis type

Triggers
Global Protect Client Version

GP active tunnels

(Palo Firewall) GP gateway utilization

GP tunnels supported

HA Mode

HW Version

ICMP Check

PAN-OS Version
Trigger Events
Processor 1 Load (mgmt)

Processor 2 Load (data)

Serial Number

Triggers
Session table utilization

SNMP availability

(Palo Firewall) System Description

System Name

System Uptime

Threat Version

Total active ICMP sessions

Total active sessions (TCP/UDP)


Type Items Value/Unit
FAN Status of FAN, Speed of FAN

CPU Number of CPUs


CPU Utilization %
Memory Total Memory MB/GB
Active Memory MB/GB
Free Memory MB/GB
Used Memory MB/GB
Dashboard Information Memory Utilization %

(CheckPoint Firewall) Storage Storage size


Storage Utilization
MB/GB
%
Network Network Traffic received bps
Interface
Network Traffic sent bps
Operational status Up/Down
Sessions Concurrent Connection
Peak Concurrent Connection

Overview info SNMP Agent Availability

Uptime
ICMP Ping
Collected Items (CheckPoint Firewall)
Name Description
Appliance product name MIB: CHECKPOINT-MIB
Appliance product name.
Appliance serial number MIB: CHECKPOINT-MIB
Appliance serial number.
Appliance manufacturer MIB: CHECKPOINT-MIB
Appliance manufacturer.
Remote Access users MIB: CHECKPOINT-MIB
Number of remote access users.
System contact details MIB: SNMPv2-MIB
Name and contact information of the contact person for the node. If not
provided, the value is a zero-length string.
System description MIB: SNMPv2-MIB
Full name and version identification of the system's hardware type,
software operating system, and networking software.
Appliance product name MIB: CHECKPOINT-MIB
Appliance product name.
Collected Items (CheckPoint Firewall)
Name Description
System name MIB: SNMPv2-MIB
An administratively-assigned name for the node (the node's fully-
qualified domain name). If not provided, the value is a zero-length string.
System object ID MIB: SNMPv2-MIB
The vendor's authoritative identification of the entity as part of the
vendor's SMI enterprises subtree with the prefix 1.3.6.1.4.1 (e.g., a vendor
with the identifier 1.3.6.1.4.1.4242 might assign a system object with the
OID 1.3.6.1.4.1.4242.1.1).
System uptime MIB: HOST-RESOURCES-V2-MIB
Time since the network management portion of the system was last re-
initialized.
Number of CPUs MIB: CHECKPOINT-MIB
Number of processors.
CPU utilization MIB: CHECKPOINT-MIB
CPU utilization per core in %.
Collected Items (CheckPoint Firewall)
Name Description
Load average (1m avg) MIB: UCD-SNMP-MIB
Average number of processes being executed or waiting over the last
minute.
Load average (5m avg) MIB: UCD-SNMP-MIB
Average number of processes being executed or waiting over the last 5
minutes.
Load average (15m avg) MIB: UCD-SNMP-MIB
Average number of processes being executed or waiting over the last 15
minutes.
CPU user time MIB: CHECKPOINT-MIB
Average time the CPU has spent running user processes that are not
niced.
CPU system time MIB: CHECKPOINT-MIB
Average time the CPU has spent running the kernel and its processes.
Collected Items (CheckPoint Firewall)
Name Description
CPU idle time MIB: CHECKPOINT-MIB
Average time the CPU has spent doing nothing.
Context switches per second MIB: UCD-SNMP-MIB
Number of context switches per second.
CPU interrupts per second MIB: CHECKPOINT-MIB
Number of interrupts processed per second.
Total memory MIB: CHECKPOINT-MIB
Total real memory in bytes. Memory used by applications.
Active memory MIB: CHECKPOINT-MIB
Active real memory (memory used by applications that is not cached to
the disk) in bytes.
Free memory MIB: CHECKPOINT-MIB
Free memory available for applications in bytes.
Used memory Used real memory calculated by total real memory and free real memory
in bytes.
Collected Items (CheckPoint Firewall)

Name Description

Memory utilization Memory utilization in %.


Encrypted packets per second MIB: CHECKPOINT-MIB
Number of encrypted packets per second.
Decrypted packets per second MIB: CHECKPOINT-MIB
Number of decrypted packets per second.
ICMP ping Host accessibility by ICMP.
0 - ICMP ping fails.
1 - ICMP ping successful.
ICMP loss Percentage of lost packets.
ICMP response time ICMP ping response time (in seconds).
SNMP agent availability Availability of SNMP checks on the host. The value of this item corresponds to
the availability icons in the host list.
Possible values:
0 - not available
1 - available
2 - unknown
Trigger Events
Interface Link Down

Interface High Bandwidth usage

Interface High error rate

Device has been replaced

Triggers System name has changed

(CheckPoint Device has been restarted

Firewall) High CPU utilization

Load average is too high

High memory utilization

Unavailable by ICMP ping

High ICMP ping loss


Trigger Events
High ICMP ping response time

No SNMP data collection

Disk space is critically low

Device has been replaced

Triggers Temperature is above critical threshold

(CheckPoint Temperature is above warning threshold

Firewall) Temperature is too low

Power supply is in down state

License expires soon

License has been expired


Dashboard Information
(Cisco Hyperflex servers)
Collected Items (Cisco Hyperflex Servers)
Name Description

Uptime (network) MIB: SNMPv2-MIB


The time in seconds since the network management
portion of the system was last re-initialized.
Uptime (hardware) MIB: HOST-RESOURCES-MIB
The amount of time since this host was last initialized.
Note that this is different from sysUpTime in the SNMPv2-MIB
[RFC1907] because sysUpTime is the uptime of the
network management portion of the system.
SNMP traps (fallback) The item is used to collect all SNMP traps unmatched by other
snmptrap items
SNMP agent availability Availability of SNMP checks on the host. The value of this item
corresponds to availability icons in the host list.
Possible values:
0 - not available
1 - available
2 - unknown
Collected Items (Cisco Hyperflex Servers)
Name Description

Disk_arrays Disk array controller status


Disk_arrays Disk array controller model

Fans Fan status


Inventory Hardware model name
Inventory Hardware serial number
Physical_disks Physical disk status

Physical_disks Physical disk model name

Physical_disks Physical disk media type

Physical_disks Disk size


Collected Items (Cisco Hyperflex Servers)
Name Description

Power_supply Power supply status


Status Overall system health status

Temperature Ambient: Temperature

Temperature Front: Temperature


Temperature Rear: Temperature
Virtual_disks Status
Virtual_disks Layout type

Virtual_disks Disk size


Trigger Events
Host has been restarted

Disk array controller is in critical state

Disk array controller is in warning state

Disk array controller is not in optimal state

Triggers Disk array cache controller battery is in critical state!

(Cisco Hyperflex Servers) Disk array controller is in critical state

Fan is in critical state

Fan is in warning state

Device has been replaced (new serial number


received)
Physical disk failed
Trigger Events
Power supply is in critical state

Power supply is in warning state

Triggers System status is in critical state

System status is in warning state


(Cisco Hyperflex Servers)
Temperature is above warning threshold

Temperature is above critical threshold:

Unavailable by ICMP ping


Dashboard
Information
(VMware ESXI Hypervisor)
Dashboard
Information
(VMware Guest)
Collected Items (VMware Guest)
Name Description
Cluster name Cluster name of the guest VM.

Number of virtual CPUs Number of virtual CPUs assigned to the guest.

CPU ready Time that the virtual machine was ready, but could not get scheduled to
run on the physical CPU during last measurement interval (VMware
vCenter/ESXi Server performance counter sampling interval - 20
seconds)
CPU usage Current upper-bound on CPU usage. The upper-bound is based on the
host the virtual machine is current running on, as well as limits
configured on the virtual machine itself or any parent resource pool. Valid
while the virtual machine is running.
Datacenter name Datacenter name of the guest VM.
Hypervisor name Hypervisor name of the guest VM.
Ballooned memory The amount of guest physical memory that is currently reclaimed through
the balloon driver.
Compressed memory The amount of memory currently in the compression cache for this VM.
Private memory Amount of memory backed by host memory and not being shared.
Collected Items (VMware Guest)
Name Description
Shared memory The amount of guest physical memory shared through transparent page
sharing.
Swapped memory The amount of guest physical memory swapped out to the VM's swap
device by ESX.
Guest memory usage The amount of guest physical memory that is being used by the VM.
Host memory usage The amount of host physical memory allocated to the VM, accounting for
saving from memory sharing with other VMs.
Memory size Total size of configured memory.
Power state The current power state of the virtual machine.
Committed storage space Total storage space, in bytes, committed to this virtual machine across all
datastores.
Uncommitted storage space Additional storage space, in bytes, potentially used by this virtual
machine on all datastores.
Unshared storage space Total storage space, in bytes, occupied by the virtual machine across all
datastores, that is not shared with any other virtual machine.
Collected Items (VMware Guest)
Name Description
Uptime System uptime.

Guest memory swapped Amount of guest physical memory that is swapped out to the swap
space.
Host memory consumed Amount of host physical memory consumed for backing up guest
physical memory pages.
Host memory usage in Percentage of host physical memory that has been consumed.
percents
CPU usage in percents CPU usage as a percentage during the interval.
CPU latency in percents Percentage of time the virtual machine is unable to run because it is
contending for access to the physical CPU(s).
CPU readiness latency in Percentage of time that the virtual machine was ready, but could not get
percents scheduled to run on the physical CPU.
CPU swap-in latency in Percentage of CPU time spent waiting for swap-in.
percents
Uptime of guest OS Total time elapsed since the last operating system boot-up (in seconds).
VMware Trigger Events
VMware Cluster Cluster status is Red

Cluster status is Yellow

Datastore Free space is critically low

Triggers Free space is low

(VMware ESXI VM VM has been restarted

Hypervisor Cluster) VMware Hypervisor Hypervisor is down

The health is Red

The health is Yellow

Hypervisor has been


restarted
Estimate Project Timeline

Kick Off Meeting & Requirement Gathering

Zabbix Implementation

Knowledge Transfer & Deliver Documents

Project Close

5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man 5 Man
Days Days Days Days Days Days Days Days Days Days Days Days Days Days Days

Week 1 Week 2 Week 3 Week 4 Week Week 6 Week 7 Week 8 Week 9 Week 10 Week 11 Week 12 Week 13 Week 14 Week 15
5

LEGENDS High Level Tasks


NEX4 Support Scope Details

No. Support Case Type Description


1. Incidents and Alerts in the Zabbix monitoring tool An issue was reported with the NEX4 monitoring tool that led to
disruptions in the monitoring and management of critical
infrastructure.
2. Incident Triage and Escalation Classification of incidents by severity and impact. Escalation of
complex or high-priority issues to senior engineers or third-party
vendors.
3. Root Cause Analysis Conduct a thorough investigation to determine the underlying
cause of incidents and event problems in Zabbix and provide
solutions to prevent future occurrences.
4. Questionnaire General Service Question cases related to NEX4’s responsible
products.
5. Additional configuration added If there are additional devices or services which is needed to
added in NOC tool, NEX4 will provide these additional setting
limited to maximum of 5 devices or service per month.
NEX4 Premium Support SLAs
Severity Onsite Support
Initial Response Subsequent Support Severity Description
level (if required)

1 Within 30 Minutes 24/7 x 4 hours Every 30 min.


A problem has made a critical unusable or unavailable and no
Critial workaround exists.

2 Within 1 hour 24/7 x 6 hours Every 2 hours A problem has made unavailable, but a workaround exists.
High
3 Within 2 hours 2 Business day Every 4 hours A certain function in a service is degraded
Medium
4 Within 1 Business day Upon availability Weekly General Assistance for Configuration help, Question, etc.
Low

Initial Response is when a ticket is opened and acknowledged by help desk staff. (Note: For Severity 1 and 2 level, we suggest to inform NEX4 Team by
telephone call for faster respond.)

Subsequent Support is the frequency with which the user that logged the ticket is updated on the resolution status.

Onsite Support start if a decision point is made by NEX4 support manager to provide onsite support.
24x7 Emergency Contact
Support@nex4.net +959 683 545 333
Email +959 765 203 073 Phone
Thank You

You might also like