Professional Documents
Culture Documents
Platform
Contents
1 What's in this document 4
1.1 What's not in this document 4
2 Use of Anycast in a Velocix CDN 5
3 Anycast in an ISP network 7
4 Route Withdrawal 9
4.1 Link-state 9
4.2 HTTP Probe 9
5 Anycast vs. HA in DNS-based Redirection 10
5.1 Anycast in DNS-based Redirection 10
5.2 HA in DNS-based Redirection 12
5.3 Comparison 13
6 Anycast vs. HA in HTTP-based Redirection 15
6.1 Anycast in HTTP-based Redirection 15
6.2 HA in HTTP-based Redirection 17
6.3 Comparison 18
7 Anycast vs. HA in Console (Web Portal) 19
7.1 Anycast in Console (Web Portal) 19
7.2 HA in Console (Web Portal) 19
7.3 Multiple HA in Console (Web Portal) 19
7.4 Externally Load-Balanced Console (Web Portal) 20
7.5 Comparison 20
8 Conclusion 21
You do not have to use Anycast in your Velocix CDN; but the pros and cons should be
carefully considered. There are three places where Anycast can help your CDNs
availability:
l DNS-based Redirection
l HTTP-based Redirection
l The Asset Control Portal (Console)
You can use Anycast for all, none, or some of the above.
Some Definitions:
l A consumer is a human. They wish to consume some content off the CDN
of the ISP
4 Route Withdrawal
It is important to ensure that if a Service Node fails, the router upstream of it withdraws its
advertisement promptly - and re-advertises it on recovery. For this to work, the router must
have a mechanism whereby it can know if the Service Node is properly operational.
For a Service Node to be "failed" it must have lost all of its Routing Applications (usually
there are two) in that node. This would usually be due to a large-scale failure such as loss of
power to the Service Node, or a network failure isolating the Service Node (which would
presumably isolate the upstream router, automatically withdrawing the Anycast route).
The Velocix CDN supports two mechanisms to detect Service Node failure. It is possible to
change mechanisms on a live system, if you are careful.
4.1 Link-state
This method is very simple, and may not work with all routers. Most routers will, if link is lost
on a port, withdraw any routes that go via that port (assuming they don't have another path
to it). Should power fail to a Service Node, or the Ethernet cables get accidentally removed,
etc., then there would be no link on the cable coming out of the Service Node chassis. If that
were directly connected to the router, the router would then see the loss of link.
This method is not appropriate if the link remains up at the router (maybe because of
switching infrastructure between the Service Node and its router) or if the router does not
automatically withdraw the route on link failure.
Link-state detection is usually the least-preferred method.
l In example.net:
ns0.cdn IN A 203.0.113.1
cdn IN NS ns0.cdn
Here, when a consumer wishes to browse to www.example.com (or otherwise use it):
l The client sends a DNS request for www.example.com to its upstream DNS Server (1)
l The consumers' upstream DNS Server (the recursive server) checks with the root
servers to find out which DNS Server(s) are responsible for example.com (2a and 2b)
l The consumers upstream DNS Server asks the example.com DNS Server(s) for
www.example.com (3)
l The example.com DNS Servers respond saying look at wp-1234.id.cdn.example.net
instead (4)
l The consumers upstream DNS Server checks with the root servers to find out which
DNS Server(s) are responsible for example.net (5a and 5b)
l The consumers upstream DNS Server asks the example.net DNS Server(s) for wp-
1234.id.cdn.example.net (6)
l The example.com DNS Server(s) say to check with ns0.cdn.example.net (203.0.113.1)
(7)
l The consumers upstream DNS Server asks 203.0.113.1 for wp-
1234.id.cdn.example.net (8)
l The Velocix CDN responds with the IP addresses of some appropriate Delivery
Applications, and that the information can be used for sixty seconds (9)
l The consumers upstream DNS Server tells the client to use the provided Delivery
Application addresses (10)
l The client connects to one of the Delivery Applications, and requests the content (11)
l The Anycast address route to the failed Service Node has been withdrawn, and the
request goes to the remaining Service Node.
The DNS messages involved are very small, so they will not cause a transition from UDP
DNS to TCP DNS. As such, there is no problem with the Anycast route changing from one
active Service Node to another during active requests.
l In example.net:
ns1.cdn IN A 192.0.2.4
ns2.cdn IN A 198.51.100.4
cdn IN NS ns1.cdn
cdn IN NS ns2.cdn
Here, when a consumer wishes to browse to www.example.com (or otherwise use it):
l The client sends a DNS request for www.example.com to its upstream DNS Server (1)
l The consumers' upstream DNS Server (the recursive server) checks with the root
servers to find out which DNS Server(s) are responsible for example.com (2a and 2b)
l The consumers upstream DNS Server asks the example.com DNS Server(s) for
www.example.com (3)
l The example.com DNS Servers respond saying look at wp-1234.id.cdn.example.net
instead (4)
l The consumers upstream DNS Server checks with the root servers to find out which
DNS Server(s) are responsible for example.net (5a and 5b)
l The consumers upstream DNS Server asks the example.net DNS Server(s) for wp-
1234.id.cdn.example.net (6)
l The example.com DNS Server(s) say to check with ns1.cdn.example.net (192.0.2.4) or
ns2.cdn.example.net (198.51.100.4) (7)
l The consumers upstream DNS Server asks either 192.0.2.4 or 198.51.100.4 for wp-
1234.id.cdn.example.net (8)
l The Velocix CDN responds with the IP addresses of some appropriate Delivery
Applications, and that the information can be used for sixty seconds (9)
l The consumers upstream DNS Server tells the client to use the provided Delivery
Application addresses (10)
l The client connects to one of the Delivery Applications, and requests the content (11)
5.3 Comparison
A failure of a Service Node sufficiently problematic as to impact delivery is a highly unlikely
event - a single blade failure inside a Service Node is not enough; there must be a complete
networking failure, loss of power, or multiple failures.
In the Anycast case, the worst-case scenario is a persistent, erroneous advertisement of
the Anycast address will isolate clients who get routed to that node, and they will become
unable to use the CDN at all. This failure mode can be mitigated by making the
advertisement of the route dependent on one of the monitoring techniques discussed in the
section on advertisement withdrawal. When using one of those methods effectively, this
failure mode is exceedingly unlikely
In the Anycast case, the second-worst-case scenario is the failure of a Service Node, which
then takes a long convergence time to effect the routing change - long enough for the
recursive DNS Server to give up. The route change convergence time is something only the
ISP can quantify - but ten seconds would normally be considered a very long time. This
could result in a period of no service - possibly resulting in the consumer having to perform a
"reload" operation (when using their client).
In the HA case, the worst-case scenario is that there are periodic ~2s delays while one
node is failed. It will not be every request that is delayed - due to the fact that 50% of
requests the recursive DNS server makes will go to the up node, and that it will cache
responses. In an extended or scheduled outage, the ISP can change the NS records in
example.net, to remove the failed node.
The client may disguise the ~2s delays from the consumer. For example, a video streaming
application may have a video buffer of more than two seconds.
We normally recommend the use of HA for the DNS-based redirection, rather than
Anycast, due to the better reliability during failure.
l The Anycast address route to the failed Service Node has been withdrawn, and the
request goes to the remaining Service Node.
If the Anycast route switches from one active Service Node to another active Service Node
in the middle of an HTTP request, but before the "GET" has arrived at the Service Node,
the connection may break and need to be retried, or timed out. Once the "GET" (the third
packet the client will send) gets to the Service Node, it does not matter if the route changes,
as the reply is so short that it will not require any ACKs to be received before it is finished.
So, if the SYN goes to one Service Node, and the ACK to a different one; or the ACK to one
and the "GET" to another, then the request may fail or be delayed. Most clients will disguise
this behaviour by retrying. The following four diagrams show the possible combinations.
In the following scenarios, I have ignored the possibility of the Service Node that has failed,
managed to be recovered and put back in service, during the lifetime of this process.
If a Service Node should fail, and the client is badly-behaved, there are three possibilities:
l The client happens to connect to the node that is up, and succeeds. Or,
l The client happens to connect to the node that is down. The request times-out. This
client gives up. This results in a delay, then an error to the consumer. Or,
l The client happens to connect to the node that is down. The request times-out. The
client tries the same Service Node again - repeatedly. This results in a long delay, then
an error to the consumer.
In the unlikely event that a Service Node fails, and the client is well-behaved, there are three
possibilities:
l The client happens to connect to the node that is up, and succeeds. Or,
l The client happens to connect to the node that is down. The request times out. The client
tries the other address. This results in a delay. Or,
l The client remembers that one of the addresses was unavailable, and so prefers the
other address.
6.3 Comparison
In the Anycast case, the worst-case scenario is again the persistent, erroneous
advertisement to a failed Service Node - as in the DNS-based Redirection case. It is still
highly unlikely.
In the Anycast case, the second-worst-case scenario is the failure of a Service Node, which
then takes a long convergence time to effect the routing change - long enough for the client
to give up. The route change convergence time is something only the ISP can quantify - but
ten seconds would normally be considered a very long time. This could result in a period of
no service - possibly resulting in the consumer having to perform a "reload" operation
(whatever that may mean, when using their client).
In the HA case, the worst-case scenario is that a badly-behaved client will consistently try
the same failed Service Node, over and over again - including on a "reload" operation.
In the HA case, the worst-case scenario with a well-behaved client is that there are periodic
~15s delays while one node is failed. It will not be every request that is delayed - due to the
fact that some of requests the client makes will go to the up node, and that it will re-use a
successful address many times. In an extended or scheduled outage, the ISP can change
the A records in cdn.example.net, to remove the failed node.
We normally recommend the use of Anycast for the HTTP-based redirection, rather than
HA, unless the client base is a closed set of clients, which are known to be well-behaved.
In this scenario, a user would specify which Service Node they wished to use. If it failed,
they would try the other.
This is only really practical if there are no external content-providing customers, and the ISP
would otherwise not have to deploy Anycast at all. That is, that only the ISP and Velocix
themselves would be using the Console, as the experience (manually selecting a Service
Node) is rather ugly.
7.5 Comparison
Here, Anycast is usually the simplest solution that works well. HA does not work, Multiple
HA is a poor solution (as it requires the user to switch between Service Nodes), and an
External Load Balancer requires an external system.
We would normally recommend Anycast for the Console (Web Portal).
8 Conclusion
The ISP can choose between using Anycast or HA, for three different use-cases
independently. Additionally, the Web Portal has two extra options - Multi-HA, and a Load
Balancer. The use-cases are:
l DNS-based Redirection, where we generally recommend HA,
l HTTP-based Redirection, where we generally recommend Anycast, and
l The Web Portal (Console), where we generally recommend Anycast