You are on page 1of 5

HA scope

• Unless stated otherwise, ”HA setup” means that a given service can
survive the loss of one node without impacting user experience
– For some services, simple rescheduling and restart “rapidly” (typically, mediation
services). This is the close equivalent of an “active/standby” setup
– For interactive services, necessity of “active/active” setup:
• state externally shared
• Containers spanned over nodes and availability zones, so that a single node
If a node goes down, associated instance is lost and remaining instance serves all traffic

© 2021 MYCOM OSI Confidential


HA limitations
• The following scenarios are typically unsupported
– Loss of more than one node at the same time (i.e. loss of node 2 before node 1 has
been completely replaced)
– Loss of network connectivity
– Workload causing disk or memory full
• This means that no instance, however redundant, will be able to execute the transaction
without crashing. All redundant instances will go down in turn

• Additionally, HA mechanisms can have bugs


– Currently, outages are more caused by malfunctioning HA setups than by ”known
unavailable” services
– To be fixed in priority

© 2021 MYCOM OSI Confidential


HA services with malfunction track record
• Redis
– Massive impact when down (almost all services depend on it)
– All product problems theoretically fixed, recent tests unable to reproduce corrupted state
– Usual limitations apply – attention must be put on monitoring (no plan currently)
– Migration to managed, scalable service planned in EAA 3.7 (March 2022)
• Zookeeper/kafka
– RCA: housekeeping problem causing occasional disk full
– Recently confirmed solved
– Plan needed to ensure monitoring / alerting is in place
• Elastic
– RCA for failure cannot be obtained (running version too old)
– Urgent action : upgrade ASAP all customers to most recent (supported) version

• License manager
– Occasional cases of rescheduling failure (Airtel)
– RCA unknown, but active/active HA would mitigate the problem (see roadmap)
© 2021 MYCOM OSI Confidential
Main services not in HA setup yet
Product Service Customer impact Roadmap
ECP Prometheus No direct impact on customer FY23 (?)
Loss of EAA monitoring data
ECP License Manager No direct impact on customer EAA 3.7
Mar 2022
Proptima Web-reporter, gis, alarm-rest End user UI session aborted EAA 4.0
Jun 2022
Proptima Schedule engine, alarm engine Possible (minor) loss of alarms N/A
Missed scheduled reports
Proptima Hot deployment of adaptors and Moderate impact (latency) – limited to Airtel EAA 4.0
config pipeline Jun 2022
Proassure Service designer Contents creation UI session aborted N/A

Proassure Problem manager End user UI session aborted EAA 3.6


Dec 2021

© 2021 MYCOM OSI Confidential


Main services not in HA setup yet
Product Service Customer impact Roadmap
Netxpert Alert manager End user UI session aborted N/A

Netxpert IDEAS, gateways Contents creation UI session aborted N/A


Some loss of event state updates
Proinsight orchestrator Contents creation UI session aborted N/A

Proactor profiler Contents creation UI session aborted N/A

© 2021 MYCOM OSI Confidential

You might also like