You are on page 1of 10

Consolidated Article for etcd guidelines with

OpenShift Container Platform 4.x


Updated September 16 2022 at 6:21 AM - 
English 

Disclaimers:
The following guidelines are based in official Red Hat Documentation and KB articles,
and intent to provide information to accomplish the system's needs. It must be
reviewed, monitored and revisited by customers and partners according to its own
business requirements, application workloads and new demands.

Links contained herein to external website(s) are provided for convenience only. Red
Hat has not reviewed the links and is not responsible for the content or its availability.
The inclusion of any link to an external website does not imply endorsement by Red Hat
of the website or their entities, products or services. You agree that Red Hat is not
responsible or liable for any loss or expenses that may result due to your use of (or
reliance on) the external site or content.

Table of Contents
 Abstract
 Guidelines
 Metrics
 Alerts
 etcd Documentation
 etcd logs

Abstract
etcd is the key-value store for Red Hat OpenShift Container Platform and Kubernetes,
which persists the state of all resource objects. It is a critical component in the
OpenShift Container Platform control plane.

It is also a CNCF project classified as Graduated maturity level, which, also according to
CNCF, is considered stable and ready to be used in production.
Although it's current stability for production systems, sizing and monitoring your control
plane nodes is key for successful Red Hat OpenShift Container Platform installations.

This article is intended to summarize Red Hat’s guidelines for etcd implementation and
index the most important Red Hat’s KB solutions and articles about this subject.

Guidelines
 Leverage the embedded metrics, dashboards and alerts delivered together
with Red Hat Openshift Container Platform Monitoring stacks
o Keeping systems under the pre-built recommended thresholds contribute
to keep clusters in a healthy state.
o One single metric will not bring the full picture of the system. Additional
metrics and cluster state, through logs and API/CLI calls, must be also
taken into consideration while analyzing RHOCP cluster’s healthy.
o In addition to embbeded dashboards, Red Hat OpenShift Container
Platform allows you querying metrics through Prometheus Query
Language (PromQL) for advanced throubleshooting propouses.
 Each workload can cause different effects into etcd. As new workloads are added
to RHOCP clusters, the capacity of the RHOCP control plane nodes and the
effect over etcd must be reevaluated.
o How much and how often systems can be above that defined thresholds
during install or during daily operations, is unpredictable.
 Do not share etcd drives with non-control plane workloads. Although it is a
supported configuration, control plane drives must not coexist with applications or
other infrastructures.
o Providing support does not indicate function. For example, during support
activities, if Red Hat Support detects that etcd is underperforming or is
negatively impacted, Red Hat Support will make a clear statement to
discuss options. Customers and partners must reproduce the issue in a
suggested configuration with etcd in a healthy state. In these situations,
Red Hat provides support only on a commercially reasonable support
basis. Customers and partners must resolve the underlying infrastructure
bottleneck, and Red Hat provides only general advice by using our
documentation, articles, and solutions.
 For more information, see KB 6271341.
 Use dedicated SSD/NVMe drives for Master/control plane functions. According to
your workload, you may also need to dedicate SSD/NVMe drive for etcd.
 etcd's Backend storage device should require from 1500-2000 sequential IOPS,
with appropriate response time, for normal operations, but could require more
IOPS for heavy loaded clusters as etcd holds more objects.
o See the session Metrics I/O Metrics for response time metrics and KB
6271341 for additional trobleshooting details.
o etcd is a write intensive workload. NVMe and Write intensive SSD drives
are recommended drive types to this kind of workload.
 Size the platform according to application requirements, workload, and following
Red Hat recommendations.
 Benchmark tools like fio and etcd-perf can be used as baseline for the first stage
of a sizing study, but are short tests executed at a specific moment and will not
give an overall view of how performance behaves over several days. In a live
cluster, only the metrics will confirm if the hardware in place is suitable for your
workload in place.

Metrics
 Monitor Leadership changes:
o This is expected as per result of installation/upgrade process or day1/2
operations (as result of Machine Config daemon operations), but we don't
expect to see it happening during normal operations.
o etcdHighNumberOfLeaderChanges alert can help us to identify that
situation.
o Prometheus query could also be
used (sum(rate(etcd_server_leader_changes_seen_total[2m]))).
o If happening during normal operation, I/O and network metrics can help us
to identify the root cause.
 I/O Metrics:
o etcd_disk_backend_commit_duration_seconds_bucket  with p99 duration
less than 25ms
o etcd_disk_wal_fsync_duration_seconds_bucket  with p99 duration less
than 10ms
 Network metrics:
o etcd_network_peer_round_trip_time_seconds_bucket  with p99 duration
should be less than 50ms.
o Network RTT latency: Big network latency and packet drops can also
bring an unreliable etcd cluster state, so network health values (RTT and
packet drops) should be monitored.
 etcd can also suffer poor performance if the keyspace grows excessively large
and exceeds the space quota.
o Some key metrics to monitor are:
o etcd_server_quota_backend_bytes  which is the current quota limit.
o etcd_mvcc_db_total_size_in_use_in_bytes  which indicates the actual
database usage after a history compaction.
o etcd_debugging_mvcc_db_total_size_in_bytes which shows the
database size including free space waiting for defragmentation.
o An etcd database can grow up to 8 GB.

Alerts
 Monitor and analyze the alerts generated by Openshift Monitoring Alerting
Rules https://console-openshift-console.apps.<domain>/monitoring/
alertrules?rowFilter-alerting-rule-
source=platform&orderBy=asc&sortBy=Name&alerting-rule-name=etcd  :
o etcdBackendQuotaLowSpace
o etcdExcessiveDatabaseGrowth
o etcdHighNumberOfFailedGRPCRequests
o etcdGRPCRequestsSlow
o etcdHighCommitDurations
o etcdHighFsyncDurations
o etcdHighNumberOfFailedProposals
o etcdHighNumberOfLeaderChanges
o etcdInsufficientMembers
o etcdMemberCommunicationSlow
o etcdMembersDown
o etcdNoLeader

etcd Documentation
 Red Hat Documentation: Recommended etcd practices
 Red Hat Documentation: Data storage management
 Red Hat Documentation: Other specific application storage recommendations
 KB etcd performance troubleshooting guide for Openshift Container Platform
 KB Backend Performance Requirements for OpenShift etcd
 KB Mounting separate disk for OpenShift 4 etcd
 KB What does the etcd warning "failed to send out heartbeat on time" mean?
 KB How to list number of objects in etcd?
 KB rafthttp: the clock difference against peer XXXX is too high [Xm.XXXs > 1s]
 KB Guidance for Red Hat OpenShift Container Platform Clusters - Deployments
Spanning Multiple Sites(Data Centers/Regions)
 Upstream Project etcd Metrics
 Upstream Project What does the etcd warning “failed to send out heartbeat on
time” mean?
 Upstream Project What does the etcd warning “apply entries took too long”
mean?
 Upstream Project Leader failure

etcd logs
 Together with Alerts and Metrics, there are a few events in the etcd logs that can
also confirm if etcd is performing as per expected.
o The following checks are a summary from the KB etcd performance
troubleshooting guide for Openshift Container Platform. Check that article
for advanced troubleshooting guidelines.
o You can also take into account a script wrote by one of our Openshift
support engineers, available at https://github.com/peterducai/openshift-
etcd-suite. Please, be aware that there is no official support for this
initiative.

Cluster Operators State


In this particular example, this cluster is in a healthy state, and etcd has been running
properly for 12 days.

Raw
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.41 True False 12d Cluster version is 4.8.41

$ oc get co
NAME VERSION AVAILABLE PROGRESSING
DEGRADED SINCE
authentication 4.8.41 True False False
4d
baremetal 4.8.41 True False False
12d
cloud-credential 4.8.41 True False False
12d
cluster-autoscaler 4.8.41 True False False
12d
config-operator 4.8.41 True False False
12d
console 4.8.41 True False False
5d
csi-snapshot-controller 4.8.41 True False False
12d
dns 4.8.41 True False False
12d
etcd 4.8.41 True False False
12d <<<< ###### etcd cluster operator
image-registry 4.8.41 True False False
12d
ingress 4.8.41 True False False
12d
insights 4.8.41 True False False
12d
kube-apiserver 4.8.41 True False False
5d
kube-controller-manager 4.8.41 True False False
12d
kube-scheduler 4.8.41 True False False
12d
kube-storage-version-migrator 4.8.41 True False False
12d
machine-api 4.8.41 True False False
12d
machine-approver 4.8.41 True False False
12d
machine-config 4.8.41 True False False
12d
marketplace 4.8.41 True False False
12d
monitoring 4.8.41 True False False
21h
network 4.8.41 True False False
12d
node-tuning 4.8.41 True False False
12d
openshift-apiserver 4.8.41 True False False
5d
openshift-controller-manager 4.8.41 True False False
11d
openshift-samples 4.8.41 True False False
12d
operator-lifecycle-manager-catalog 4.8.41 True False False
12d
operator-lifecycle-manager-packageserver 4.8.41 True False False
12d
operator-lifecycle-manager 4.8.41 True False False
12d
service-ca 4.8.41 True False False
12d
storage 4.8.41 True False False
12d

$ oc get co etcd -o yaml|grep -A 4 lastTransitionTime


- lastTransitionTime: "2022-04-19T22:33:10Z"
message: |-
NodeControllerDegraded: All master nodes are ready
etcdMembersDegraded: No unhealthy members found
reason: AsExpected
--
- lastTransitionTime: "2022-04-19T22:40:50Z"
message: |-
NodeInstallerProgressing: 3 nodes are at revision 3
etcdMembersProgressing: No unstarted etcd members found
reason: AsExpected
--
- lastTransitionTime: "2022-04-19T22:30:41Z"
message: |-
StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
etcdMembersAvailable: 3 members are available
reason: AsExpected
--
- lastTransitionTime: "2022-04-19T22:28:11Z"
message: All is well
reason: AsExpected
status: "True"
type: Upgradeable
etcd heartbeat
As etcd uses a leader-based consensus protocol for consistent data replication and log
execution, it relies in a heartbeat mechnism to keep the cluster members in a healthy
state.

If etcd is logging messages like failed to send out heartbeat on time, means that
your etcd cluster is facing instabilities, slow response times and may cause unexpected
leadership changes which directly affects RHOCP control plane.

Here is a sample from etcd logs:

Raw
2022-07-06 14:48:38.562412 W | etcdserver: failed to send out heartbeat on time
(exceeded the 100ms timeout for 34.328790ms)

This issue may have multiple causes, but most commonly triggered by a slow or shared
backend storage devices. Although CPU and networking overload in the etcd members
can also cause it, its less frequent. You can also review upstream etcd faq for additional
information.

Healthy etcd clusters must not have any of these events in the logs:

Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'failed to send out heartbeat on time'| wc
-l; done
0
0
0

etcd is likely overloaded


etcd heartbeat messages are followed by server is likely overloaded:

Raw
2022-07-06 17:43:49.342412 W | etcdserver: failed to send out heartbeat on time
(exceeded the 100ms timeout for 34.328790ms
2022-07-06 17:43:49.342430 W | etcdserver: server is likely overloaded

Same as heartbeat messages, healthy etcd clusters must not have any of these events
in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep "server is likely overloaded"| wc -l; done
0
0
0

etcd warning “apply entries took too long”


As per required by its consensus protocol implementation, after a majority of etcd
members agree to commit a request, each etcd server applies the request to its data
store and persists the result to disk. If the average request duration exceeds 100
milliseconds, etcd will warn entries request took too long.):

Raw
2022-06-20T17:51:50.325433212Z {"level":"warn","ts":"2022-06-
20T17:51:50.325Z","caller":"etcdserver/util.go:163","msg":"apply request took too
long","took":"232.747298ms","expected-duration":"100ms","prefix":"read-only range
","request":"key:\"/kubernetes.io/namespaces/default\" keys_only:true
","response":"range_response_count:1 size:54"}

This is issue most commonly triggered by a slow or shared backend storage devices,
but, less frequently, also by CPU overload as well.

It may be observed during upgrades or machine config operations, but, during normal


operations, there should be ideally no messages like that. In this case, additional
metrics and log events must be take into consideration as well.

Ideally, healthy clusters should not have any of these events in the logs:

Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'took too long'| wc -l; done
0
0
0

etcd clock difference


When clocks are out of sync with each other they are causing I/O timeouts and the
liveness probe is failing which makes the etcd pod to restart frequently. The following
log entries can be observed in that sitution:
Raw
2021-09-24T06:39:16.408674158Z 2021-09-24 06:39:16.408617 W | rafthttp: the clock
difference against peer ab37b46865cdfbf7 is too high [4m18.466926704s > 1s]
2021-09-24T06:39:16.465279570Z 2021-09-24 06:39:16.465225 W | rafthttp: the clock
difference against peer ab37b46865cdfbf7 is too high [4m18.463381838s > 1s]

Make sure that Chrony is enabled, running, and in sync with chronyc sources and
chronyc tracking.

Healthy etcd clusters must not have any of these events in the logs:

Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'clock difference'| wc -l; done
0
0
0

etcd database space exceeded


Without periodically compacting this history (e.g., by setting --auto-compaction), etcd
will eventually exhaust its storage space. If etcd runs low on storage space, it raises a
space quota alarm to protect the cluster from further writes. So long as the alarm is
raised, etcd responds to write requests with the error mvcc: database space exceeded.

In RHOCP 4.x, history compaction is performed automatically every five minutes and
leaves gaps in the back-end database. This fragmented space is available for use by
etcd, but is not available to the host file system. You must defragment etcd to make this
space available to the host file system.

Starting in RHOCP 4.9.z, defragmentation occurs automatically.

To recover it from this issue in earlier versions, you can also trigger defragmentation
manually.

Healthy etcd clusters must not have any of these events in the logs:

Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'database space exceeded'| wc -l; done
0
0
0
etcd leadership changes and failures
Leadership changes are expected only during installations, upgrades or machine
config operations. During day to day operation, it must be avoided.

Here is a example from this warning:

Raw
2020-09-15 01:33:49.736399 W | etcdserver: read-only range request
"key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took
too long (1.879359619s) to execute
2020-09-15 01:33:49.736422 W | etcdserver: read-only range request
"key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took
too long (4.851652166s) to execute

During the leader election the cluster cannot process any writes. Write requests sent
during the election are queued for processing until a new leader is elected. Until a new
leader is elected, we are going to observe instabilities, slow response times and
unexpected behaviors affecting RHOCP control plane.

This is issue is a side effect from previous events described in this article, and may
have multiple contributor factors evolved.

Healthy etcd clusters must not have any of these events in the logs during normal
operations:

Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'leader changed'| wc -l; done
0
0
0

You might also like