Professional Documents
Culture Documents
Consolidated Article For Etcd Guidelines With OpenShift Container Platform 4.x
Consolidated Article For Etcd Guidelines With OpenShift Container Platform 4.x
Disclaimers:
The following guidelines are based in official Red Hat Documentation and KB articles,
and intent to provide information to accomplish the system's needs. It must be
reviewed, monitored and revisited by customers and partners according to its own
business requirements, application workloads and new demands.
Links contained herein to external website(s) are provided for convenience only. Red
Hat has not reviewed the links and is not responsible for the content or its availability.
The inclusion of any link to an external website does not imply endorsement by Red Hat
of the website or their entities, products or services. You agree that Red Hat is not
responsible or liable for any loss or expenses that may result due to your use of (or
reliance on) the external site or content.
Table of Contents
Abstract
Guidelines
Metrics
Alerts
etcd Documentation
etcd logs
Abstract
etcd is the key-value store for Red Hat OpenShift Container Platform and Kubernetes,
which persists the state of all resource objects. It is a critical component in the
OpenShift Container Platform control plane.
It is also a CNCF project classified as Graduated maturity level, which, also according to
CNCF, is considered stable and ready to be used in production.
Although it's current stability for production systems, sizing and monitoring your control
plane nodes is key for successful Red Hat OpenShift Container Platform installations.
This article is intended to summarize Red Hat’s guidelines for etcd implementation and
index the most important Red Hat’s KB solutions and articles about this subject.
Guidelines
Leverage the embedded metrics, dashboards and alerts delivered together
with Red Hat Openshift Container Platform Monitoring stacks
o Keeping systems under the pre-built recommended thresholds contribute
to keep clusters in a healthy state.
o One single metric will not bring the full picture of the system. Additional
metrics and cluster state, through logs and API/CLI calls, must be also
taken into consideration while analyzing RHOCP cluster’s healthy.
o In addition to embbeded dashboards, Red Hat OpenShift Container
Platform allows you querying metrics through Prometheus Query
Language (PromQL) for advanced throubleshooting propouses.
Each workload can cause different effects into etcd. As new workloads are added
to RHOCP clusters, the capacity of the RHOCP control plane nodes and the
effect over etcd must be reevaluated.
o How much and how often systems can be above that defined thresholds
during install or during daily operations, is unpredictable.
Do not share etcd drives with non-control plane workloads. Although it is a
supported configuration, control plane drives must not coexist with applications or
other infrastructures.
o Providing support does not indicate function. For example, during support
activities, if Red Hat Support detects that etcd is underperforming or is
negatively impacted, Red Hat Support will make a clear statement to
discuss options. Customers and partners must reproduce the issue in a
suggested configuration with etcd in a healthy state. In these situations,
Red Hat provides support only on a commercially reasonable support
basis. Customers and partners must resolve the underlying infrastructure
bottleneck, and Red Hat provides only general advice by using our
documentation, articles, and solutions.
For more information, see KB 6271341.
Use dedicated SSD/NVMe drives for Master/control plane functions. According to
your workload, you may also need to dedicate SSD/NVMe drive for etcd.
etcd's Backend storage device should require from 1500-2000 sequential IOPS,
with appropriate response time, for normal operations, but could require more
IOPS for heavy loaded clusters as etcd holds more objects.
o See the session Metrics I/O Metrics for response time metrics and KB
6271341 for additional trobleshooting details.
o etcd is a write intensive workload. NVMe and Write intensive SSD drives
are recommended drive types to this kind of workload.
Size the platform according to application requirements, workload, and following
Red Hat recommendations.
Benchmark tools like fio and etcd-perf can be used as baseline for the first stage
of a sizing study, but are short tests executed at a specific moment and will not
give an overall view of how performance behaves over several days. In a live
cluster, only the metrics will confirm if the hardware in place is suitable for your
workload in place.
Metrics
Monitor Leadership changes:
o This is expected as per result of installation/upgrade process or day1/2
operations (as result of Machine Config daemon operations), but we don't
expect to see it happening during normal operations.
o etcdHighNumberOfLeaderChanges alert can help us to identify that
situation.
o Prometheus query could also be
used (sum(rate(etcd_server_leader_changes_seen_total[2m]))).
o If happening during normal operation, I/O and network metrics can help us
to identify the root cause.
I/O Metrics:
o etcd_disk_backend_commit_duration_seconds_bucket with p99 duration
less than 25ms
o etcd_disk_wal_fsync_duration_seconds_bucket with p99 duration less
than 10ms
Network metrics:
o etcd_network_peer_round_trip_time_seconds_bucket with p99 duration
should be less than 50ms.
o Network RTT latency: Big network latency and packet drops can also
bring an unreliable etcd cluster state, so network health values (RTT and
packet drops) should be monitored.
etcd can also suffer poor performance if the keyspace grows excessively large
and exceeds the space quota.
o Some key metrics to monitor are:
o etcd_server_quota_backend_bytes which is the current quota limit.
o etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual
database usage after a history compaction.
o etcd_debugging_mvcc_db_total_size_in_bytes which shows the
database size including free space waiting for defragmentation.
o An etcd database can grow up to 8 GB.
Alerts
Monitor and analyze the alerts generated by Openshift Monitoring Alerting
Rules https://console-openshift-console.apps.<domain>/monitoring/
alertrules?rowFilter-alerting-rule-
source=platform&orderBy=asc&sortBy=Name&alerting-rule-name=etcd :
o etcdBackendQuotaLowSpace
o etcdExcessiveDatabaseGrowth
o etcdHighNumberOfFailedGRPCRequests
o etcdGRPCRequestsSlow
o etcdHighCommitDurations
o etcdHighFsyncDurations
o etcdHighNumberOfFailedProposals
o etcdHighNumberOfLeaderChanges
o etcdInsufficientMembers
o etcdMemberCommunicationSlow
o etcdMembersDown
o etcdNoLeader
etcd Documentation
Red Hat Documentation: Recommended etcd practices
Red Hat Documentation: Data storage management
Red Hat Documentation: Other specific application storage recommendations
KB etcd performance troubleshooting guide for Openshift Container Platform
KB Backend Performance Requirements for OpenShift etcd
KB Mounting separate disk for OpenShift 4 etcd
KB What does the etcd warning "failed to send out heartbeat on time" mean?
KB How to list number of objects in etcd?
KB rafthttp: the clock difference against peer XXXX is too high [Xm.XXXs > 1s]
KB Guidance for Red Hat OpenShift Container Platform Clusters - Deployments
Spanning Multiple Sites(Data Centers/Regions)
Upstream Project etcd Metrics
Upstream Project What does the etcd warning “failed to send out heartbeat on
time” mean?
Upstream Project What does the etcd warning “apply entries took too long”
mean?
Upstream Project Leader failure
etcd logs
Together with Alerts and Metrics, there are a few events in the etcd logs that can
also confirm if etcd is performing as per expected.
o The following checks are a summary from the KB etcd performance
troubleshooting guide for Openshift Container Platform. Check that article
for advanced troubleshooting guidelines.
o You can also take into account a script wrote by one of our Openshift
support engineers, available at https://github.com/peterducai/openshift-
etcd-suite. Please, be aware that there is no official support for this
initiative.
Raw
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.41 True False 12d Cluster version is 4.8.41
$ oc get co
NAME VERSION AVAILABLE PROGRESSING
DEGRADED SINCE
authentication 4.8.41 True False False
4d
baremetal 4.8.41 True False False
12d
cloud-credential 4.8.41 True False False
12d
cluster-autoscaler 4.8.41 True False False
12d
config-operator 4.8.41 True False False
12d
console 4.8.41 True False False
5d
csi-snapshot-controller 4.8.41 True False False
12d
dns 4.8.41 True False False
12d
etcd 4.8.41 True False False
12d <<<< ###### etcd cluster operator
image-registry 4.8.41 True False False
12d
ingress 4.8.41 True False False
12d
insights 4.8.41 True False False
12d
kube-apiserver 4.8.41 True False False
5d
kube-controller-manager 4.8.41 True False False
12d
kube-scheduler 4.8.41 True False False
12d
kube-storage-version-migrator 4.8.41 True False False
12d
machine-api 4.8.41 True False False
12d
machine-approver 4.8.41 True False False
12d
machine-config 4.8.41 True False False
12d
marketplace 4.8.41 True False False
12d
monitoring 4.8.41 True False False
21h
network 4.8.41 True False False
12d
node-tuning 4.8.41 True False False
12d
openshift-apiserver 4.8.41 True False False
5d
openshift-controller-manager 4.8.41 True False False
11d
openshift-samples 4.8.41 True False False
12d
operator-lifecycle-manager-catalog 4.8.41 True False False
12d
operator-lifecycle-manager-packageserver 4.8.41 True False False
12d
operator-lifecycle-manager 4.8.41 True False False
12d
service-ca 4.8.41 True False False
12d
storage 4.8.41 True False False
12d
If etcd is logging messages like failed to send out heartbeat on time, means that
your etcd cluster is facing instabilities, slow response times and may cause unexpected
leadership changes which directly affects RHOCP control plane.
Raw
2022-07-06 14:48:38.562412 W | etcdserver: failed to send out heartbeat on time
(exceeded the 100ms timeout for 34.328790ms)
This issue may have multiple causes, but most commonly triggered by a slow or shared
backend storage devices. Although CPU and networking overload in the etcd members
can also cause it, its less frequent. You can also review upstream etcd faq for additional
information.
Healthy etcd clusters must not have any of these events in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'failed to send out heartbeat on time'| wc
-l; done
0
0
0
Raw
2022-07-06 17:43:49.342412 W | etcdserver: failed to send out heartbeat on time
(exceeded the 100ms timeout for 34.328790ms
2022-07-06 17:43:49.342430 W | etcdserver: server is likely overloaded
Same as heartbeat messages, healthy etcd clusters must not have any of these events
in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep "server is likely overloaded"| wc -l; done
0
0
0
Raw
2022-06-20T17:51:50.325433212Z {"level":"warn","ts":"2022-06-
20T17:51:50.325Z","caller":"etcdserver/util.go:163","msg":"apply request took too
long","took":"232.747298ms","expected-duration":"100ms","prefix":"read-only range
","request":"key:\"/kubernetes.io/namespaces/default\" keys_only:true
","response":"range_response_count:1 size:54"}
This is issue most commonly triggered by a slow or shared backend storage devices,
but, less frequently, also by CPU overload as well.
Ideally, healthy clusters should not have any of these events in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'took too long'| wc -l; done
0
0
0
Make sure that Chrony is enabled, running, and in sync with chronyc sources and
chronyc tracking.
Healthy etcd clusters must not have any of these events in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'clock difference'| wc -l; done
0
0
0
In RHOCP 4.x, history compaction is performed automatically every five minutes and
leaves gaps in the back-end database. This fragmented space is available for use by
etcd, but is not available to the host file system. You must defragment etcd to make this
space available to the host file system.
To recover it from this issue in earlier versions, you can also trigger defragmentation
manually.
Healthy etcd clusters must not have any of these events in the logs:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'database space exceeded'| wc -l; done
0
0
0
etcd leadership changes and failures
Leadership changes are expected only during installations, upgrades or machine
config operations. During day to day operation, it must be avoided.
Raw
2020-09-15 01:33:49.736399 W | etcdserver: read-only range request
"key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took
too long (1.879359619s) to execute
2020-09-15 01:33:49.736422 W | etcdserver: read-only range request
"key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took
too long (4.851652166s) to execute
During the leader election the cluster cannot process any writes. Write requests sent
during the election are queued for processing until a new leader is elected. Until a new
leader is elected, we are going to observe instabilities, slow response times and
unexpected behaviors affecting RHOCP control plane.
This is issue is a side effect from previous events described in this article, and may
have multiple contributor factors evolved.
Healthy etcd clusters must not have any of these events in the logs during normal
operations:
Raw
$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc
logs $POD -c etcd -n openshift-etcd| grep 'leader changed'| wc -l; done
0
0
0