You are on page 1of 6

Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

Photo by Christian Wiediger on Unsplash

Solving Node DNS issues and other things


Rafael Piovesan C Machado Follow
Jun 16 · 8 min read

If you’ve got here because the title caught your attention, then chances are you’ve
struggled before with some DNS related issues using Node.js. These might appear as the
infamous EAI_AGAIN or even the widely popular ETIMEDOUT, which happened to me
because I’ve set a timeout limit to the HTTP requests.

In my case, my company’s service recently experienced a sudden increase in usage,


which led these problems to occur more often, even to the point of causing outages. Our
service architecture follows a very common pattern, in which, in order to fulfill one user
request, we have to call a handful of APIs, then process their results and finally reach
back to the user with a proper response.
1 of 6 30/12/20, 3:41 pm
Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

With the spike in traffic, we started to see a lot of ETIMEDOUT errors, and when we
looked closely into it, we noticed that requests were not reaching the target hosts,
meaning they weren’t even been made by the client. All of the timeouts were occurring
while trying to establish the connection, more precisely, while trying to resolve the
servers hostnames to IPs addresses.

Whatever the symptoms you’re facing or may have come across, you should probably
know by now that, although the HTTP calls in Node can be asynchronous, the hostname
resolution is usually made by calling the also asynchronous dns.lookup(), which in turn,
makes a synchronous call to a low-level function running on a fixed number of threads.

For more information on this, take a look at: https://nodejs.org


/api/dns.html#dns_implementation_considerations

Also, from the same document, we can see that:

Though the call to dns.lookup() will be asynchronous from JavaScript’s perspective, it is


implemented as a synchronous call to getaddrinfo(3) that runs on libuv’s threadpool. This
can have surprising negative performance implications for some applications, see the
UV_THREADPOOL_SIZE documentation for more information.

And that’s where the problem lies. By default, there’ll only be 4 threads available for
each Node process, as stated here: https://nodejs.org
/api/cli.html#cli_uv_threadpool_size_size

Because libuv’s threadpool has a fixed size, it means that if for whatever reason any of these
APIs takes a long time, other (seemingly unrelated) APIs that run in libuv’s threadpool will
experience degraded performance. In order to mitigate this issue, one potential solution is
to increase the size of libuv’s threadpool by setting the ‘UV_THREADPOOL_SIZE’
environment variable to a value greater than 4 (its current default value).

The implications are that, as the text says, seemingly unrelated APIs calls might start to
fail because of a race condition during the hostname resolution. Here’s one great article
that describes this exact problem faced by Uber: https://eng.uber.com/denial-by-dns/

Resolving localhost to ::1 (which is needed to connect to the local sidecar) involves calling a
synchronous getaddrinfo(3). This operation is done in a dedicated thread pool (with a
default of size 4 in Node.js). We discovered that these long DNS responses made it
impossible for the thread pool to quickly serve localhost to ::1 conversions.
2 of 6 30/12/20, 3:41 pm
Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

As a result, none of our DNS queries went through (even for localhost), meaning that our
login service was not able to communicate with the local sidecar to test username and
password combinations, nor call other providers. From the Uber app perspective, none of
the login methods worked, and the user was unable to access the app.

That’s why, in this situation, DNS issues are even more aggravating, because of the
snowball effect caused by the unavailability of one service that might affect other
(seemingly) unrelated services.

With only this information, one could think of the short-term solution consisting of only
increasing the number of threads by setting UV_THREADPOOL_SIZE to a reasonable
value. And it might work … in some cases. But it didn’t for me, though.

So, I continued my search and found this other great article: https://medium.com
/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

It explains in better details all the points addressed so far (and served as inspiration
while writing this article), also it talks about their experience messing around with and
fine tuning the threadpool size:

While preparing for upcoming ultimate shopping mania sprees, we where running load
tests across our whole system composed of bunch of different services. One team came back
with report that they had serious issues with latency of dns lookups and error rates with
coredns service in our kuberntes cluster even with many replicas and of course they already
had UV_THREADPOOL_SIZE magic number fine tuned. Their solution to the problem was
to include https://www.npmjs.com/package/lookup-dns-cache package in order to cache
dns lookups. In their load tests it showed amazing results by improving performance 2x.

Finally it offers a great solution to these problems, by enabling the caching service on
node level on their kubernetes cluster.

More on this here: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

So far, so good. It all made sense. And I happily explained the situation to the Ops Team
expecting it to be an easy fix. But they were worried that this change could cause some
side effects and even affect other services running on the same cluster. So, it was a no-
go.

Back on my search, I tumbled upon another helpful article not entirely related to DNS
issues, but its insights could be applied to solve this kind of problems and also provide
3 of 6 30/12/20, 3:41 pm
Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

other benefits. It described the upsides of reusing HTTP connections with the HTTP
Keep-Alive functionality.

Here’s the text: https://lob.com/blog/use-http-keep-alive

As it goes:

One of the best ways to minimize HTTP overhead is to reuse connections with HTTP Keep-
Alive. This feature is commonly enabled by default for many HTTP clients. These clients will
maintain a pool of connections — each connection initializes once and handles multiple
requests until the connection is closed. Reusing a connection avoids the overhead of making
a DNS lookup, establishing a connection, and performing an SSL handshake. However, not
all HTTP clients, including the default client of Node.js, enable HTTP Keep-Alive.

One of Lob’s backend services is heavily dependent on internal and external APIs to verify
addresses, dispatch webhooks, start AWS Lambda executions, and more. This Node.js server
has a handful of endpoints that make several outgoing HTTP requests per incoming request.
Enabling connection reuse for these outgoing requests led to a 50% increase in maximum
inbound request throughput, significantly reduced CPU usage, and lowered response
latencies. It also eliminated sporadic DNS lookup errors.

The key benefit here was the fact that, by reusing HTTP connections, the number of
calls made to the DNS service decreased (a lot), eliminating the race condition that
caused all my problems. And, as an added bonus, it also increases the performance by
avoiding the costs of establishing a new HTTP connection, like the SSL handshake and
the slow-start of the TCP protocol (more on this here https://hpbn.co/building-blocks-
of-tcp/#slow-start).

Following the article’s recommendation, we changed the APIs calls and started to use
the agentkeepalive lib (https://github.com/node-modules/agentkeepalive). The results
were amazing. All the DNS issues and timeouts were gone.

We were very happy with the results, but then we started to see some side effects (it’s as
they say: “no good deed goes unpunished”). But nothing as bad as before. Actually, it
was a small price to pay for the improvements we’d made so far and, as it turned out,
was for the better. As the article about the HTTP Keep-Alive mentioned
(https://lob.com/blog/use-http-keep-alive):

In some cases, reusing connections can lead to hard-to-debug issues. Problems can arise
4 of 6 30/12/20, 3:41 pm
Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

when a client assumes that a connection is alive and well, only to discover that, upon
sending a request, the server has terminated the connection. In Node, this problem surfaces
as an Error: socket hang up.

To mitigate this, check the idle socket timeouts of both the client and the server. This value
represents how long a connection will be kept alive when no data is sent or received. Make
sure that the idle socket timeout of the client is shorter than that of the server. This should
ensure that the client closes a connection before the server, preventing the client from
sending a request down an unknowingly dead connection.

So, all that was left, was to check for a possible closed connection before trying to send
a new request down a reused socket. But, while trying to implement this change we
noticed that our, at the time, current HTTP lib didn’t offer any easy way to handle
specific errors that could occur during the HTTP request. We were using the broadly
known request/request-promise (https://www.npmjs.com/package/request), which
is now deprecated. And that’s why we decided to make a change.

The alternative we chose, is the also very popular, actively maintained and feature-rich
lib called got (https://github.com/sindresorhus/got). Just to name a few interesting
features, here’s a list of the ones that appealed to us:

Retries on failure (https://github.com/sindresorhus/got#retry)

HTTP Keep-Alive (https://github.com/sindresorhus/got#agent)

Timeout handling (https://github.com/sindresorhus/got#timeout)

Caching (https://github.com/sindresorhus/got#cache-1)

DNS caching (https://github.com/sindresorhus/got#dnscache)

Hooks (https://github.com/sindresorhus/got#hooks)

The changes we’ve made, besides the HTTP lib per se, involved defining and applying a
set of default options to all the HTTP calls. Basically, we implemented a custom retry
handler function (represented by the option calculateDelay as explained here
https://github.com/sindresorhus/got#retry), which automatically retries any HTTP
request in case of common connection errors (ECONNRESET, EADDRINUSE,
ECONNREFUSED, EPIPE, ENOTFOUND, ENETUNREACH, EAI_AGAIN) or certain HTTP
Status Codes (429, 500, 502, 503, 504, 521, 522 and 524).

5 of 6 30/12/20, 3:41 pm
Solving Node DNS issues and other things | by Rafael Pioves... https://medium.com/swlh/solving-node-dns-issues-and-other-...

We also took the opportunity to put in place a set of more reasonable timeout and delay
values. For example, while using the old request/request-promise, we could only
define a single timeout value which would be applied to the HTTP request as a whole.
Using the new got lib and creating a custom retry handler function, we make use of the
error metadata provided by the lib and check if the timeout happened during the
‘lookup’, ‘connect’, ‘secureConnect’ or ‘socket’ phase and, if so, we apply a retry policy
(number of retries and delay until the next retry) differently from the timeout that
might occur during the ‘response’ phase.

It’s also possible to define different timeout values for each specific phase of the HTTP
request, as described here: https://github.com/sindresorhus/got#timeout.

The final result is better than what we expected. We started by looking at a problem of
what seemed to be a misbehaving service with performance issues due to the first
timeout errors we saw, and ended up with a more resilient and fault-tolerant system
(due to the error handling), the added bonus of an increase in performance (by reusing
the HTTP connections) and better user experience (due to the revised and more
reasonable timeout values).

Conclusion

It was a journey for us and a first hand lesson about how a seemingly well-known and
debated problem could pose itself in a different light and that not always the first
solution you find will be the best one. More importantly, the mindset you have when
approaching
Node a problem,
Nodejs DNS even if it isn’tMicroservices
Networking a new one, is what will make the difference. A
curious mind should never settle with an easy answer.

About Help Legal

Get the Medium app

6 of 6 30/12/20, 3:41 pm

You might also like