You are on page 1of 40

SEATTLE

- A Scalable Ethernet Architecture


for Large Enterprises

M.Sc. Pekka Hippeläinen


IBM
phippela@gmail

1.10.2012 T-110.6120 – Special Course in Future Internet Technologies 1


SEATTLE

 Based on and pictures borrowed


from:Changhoon,K;Caesar,M;Rexford,J.
Floodless in SEATTLE: A Scalable Ethernet
Architecture for Large Enterprises

 Is it possible to build a protocol that maintains the


same configuration-free properties as Ethernet
bridging, yet scales to large networks?

1.10.2012 2
Contents

 Motivation: network management challenge


 Ethernet features: ARP and DHCP broadcasts
 1) Ethernet Bridging
 2) Scaling with Hybrid networks
 3) Scaling with VLANs
 Distributed Hashing
 SEATTLE approach
 Results
 Conclusions

1.10.2012 3
Network management
challenge
 IP Networks require massive effort to configure
and manage
 Even 70% of an enterprise network’s cost goes to
maintenance and configuration
 Ethernet is much simpler to manage
 However Ethernet does not scale well beyond
small LANs
 SEATTLE architecture aims to provide scalability
of IP with simplicity of Ethernet management

1.10.2012 4
Why Ethernet is so wonderful
? Easy to setup, easy to manage
 DHCP server, some hubs, plug’n play

1.10.2012 5
Flooding query 1: DHCP
requests
 Lets say node A joins the ethernet
 To get IP / confirm IP – node A sends a DHCP request as a
broadcast
 Request floods through the broadcast domain

18.09.2012 6
Flooding query 2: ARP
 In order for node A to communicate to node B in
the same broadcast domain, the sender needs
MAC address of the node B
 Lets assume that node B IP is know
 Node A sends and Address Request Protocol (ARP)
broadcast – to find out MAC address of node B
 Similarly to DHCP broadcast – the request is
flooded through the whole broadcast domain
 This is basically {IP -> MAC} mapping

1.10.2012 7
Why flooding is bad ?
 Large Ethernet deployments contain vast number
of hosts and thousands of bridges
 Ethernet was not designed to such a scale
 Virtualization and mobile deployments can cause
many dynamic events – causing control traffic
 Broadcast messages need to be processed in the
end hosts – interrupting cpu
 The bridges forwarding tables grow roughly
linearly with number of hosts

1.10.2012 8
1) Ethernet bridging
 Ethernet consists of segments each comprising a
single physical layer
 Ethernet bridges are used to interconnect
segments to multi-hop network i.e. LAN
 This forms a single broadcast domain
 Bridge learns how to reach a host – by inspecting
the incoming frames and associating the source
MAC with the incoming port
 A bridge stores this information to a forwarding
table – using the table to forward packets to
correct direction
1.10.2012 9
Bridge spanning tree
 One bridge is configured to be the root bridge
 Other bridges collectively compute a spanning
tree based on the distance to the root
 Thus traffic is not routed through shortest path
but along the spanning tree
 This approach avoids broadcast storms

1.10.2012 10
1.10.2012 11
2) Hybrid IP/Ethernet
 In this approach multiple LANs are
interconnected with IP routing
 In hybrid networks each LAN contains at most a
few hundred of hosts that form IP subnet
 IP subnet is associated with the IP prefix
 Assigning IP prefixes to subnet and associating
subnets with router interfaces is a manual process
 Unlike MAC which is host identifier – IP address
denotes the hosts current location in the network

1.10.2012 12
1.10.2012 13
Drawbacks of Hybrid
approach
 Biggest drawback is the configuration overhead
 Router interfaces must be configured
 Host must have correct IP address corresponding to
the subnet it is located (DHCP can be used)
 Networking policies are defined usually per
network prefix i.e. topology
 When network changes the policies must be updated
 Limited mobility support
 Mobile users & virtualized hosts at datacenters
 If IP is constant – the user should stay on the same
subnet
1.10.2012 14
3) Virtual LANs
 Overcomes some problems of Ethernet and IP
Networks
 Administrators can logically groups hosts into
same broadcast domain
 VLANS can be configured to overlap – configuring
bridges not the hosts
 Now broadcast overhead can be reduced by the
isolates domains
 Mobility is simplified – IP address can be retained
while moving between bridges
1.10.2012 15
Virtual LANs
 Traffic from B1 to B2 can be ‘trunked’ over
multiple bridges
 Inter domain traffic needs to be routed

1.10.2012 16
Drawbacks of VLANs
 Trunk configuration overhead
 Extending VLAN across multiple bridges requires
VLAN to be configured at each of the bridges
participating. Often manual work.
 Limited control plane scalability
 Forwarding table entries and broadcast traffic for
every active host and every VLAN visible
 Insufficient data plane efficiency
 Single spanning tree is still used within each VLAN
 Inter-VLAN traffic must be routed via IP gateways

1.10.2012 17
Distributed Hash Tables
 Hash tables are used to store {key -> value} pairs
 In case of multiple nodes there is nice way to
make
 Nodes symmetric
 Distribute the hash table entries evenly among nodes
 Keep reshuffling of entries small in case of
adding/removing nodes
 Idea is to calculate H(key) that is mapped to a
host – one can visualize this to mapping to an
angle (or to a point on a circle)
1.10.2012 18
Distributed Hash Tables
 Each node is mapped to randomly distributed
points on the circle
 Thus each node is mapped to multiple buckets
 One calculates the H(key) – and stores the entry
to the node owning this bucket
 If node is removed – the values are now assigned
to next buckets
 If node is added – entries are moved to the new
buckets
1.10.2012 19
SEATTLE approach 1/2
 1) Switches calculate shortest
path among themselves
 This is link state protocol – basically Dijkstra
 Switch level discovery protocol – Ethernet hosts do
not respond
 Switch topology much more stable than at host level
 Much more scalable than at host level
 Each switch has an ID – one MAC address of the
switch interfaces

1.10.2012 20
SEATTLE approach 2/2
 2) DHT used in switches
 {IP->MAC} mapping
 This is essentially ARP request avoiding flooding
 {MAC->location} mapping
 When switch is located – routing along the shortest path
can be used
 DCHP Service location can also be stored
 SEATTLE thus reduces flooding, allows usage of
shortest path and offers a nice way to locate
DHCP service
1.10.2012 21
SEATTLE
 Control overhead reduced with consistent
hashing
 When set of switches changes due to network failure
or recovery – only some entries must be moved
 Balancing load with virtual switches
 If some switches are more powerful – the switch can
represent itself as many – getting more load
 Enabling flexible service discovery
 This is mainly DHCP – but could be something like
{“PRINTER”->location}

1.10.2012 22
Topology changes
 Adding and removing switches/links can alter
topology
 Switch/link failures and recoveries can also lead
to partitioning events (more rare)
 Non-partitioning link failures are easy to handle
– the resolver for hash entry is not changed

1.10.2012 23
Switch failures
 If switch fails or recovers hash entries need to be
moved
 The switch that published value – monitors the
liveliness of resolver. Republishing entry when
needed
 The entries have TTL

1.10.2012 24
Partitioning events
 Each switch has to book keep also locally-stored
location entries
 If switch s_old is removed / not reachable – all the
switches need to remove these location entries
 This approach correctly handles partitioning
events

1.10.2012 25
Scaling:
location
 Hosts use directory service to publish and maintain
{mac->location} mappings
 When host a with mac_a arrives – it accesses switch
S_a (steps 1-3)
 Switch s_a publishes {mac_a,location}, by calculating the
correct bucket F(mac_a) i.e. switch/resolver
 When node b wants to send message to node a
 F(mac_a) is calculated to fetch the location
 ’Reactive resolution’ – also cache misses do not lead
flooding
1.10.2012 26
Scaling:
ARP
 When node b makes ARP request – SEATTLE
converts this to a {F(IP_a) -> mac_a} request
 The resolver/switch for F(IP_a) is usually
different from F(mac_a)
 Optimization for hosts making ARP request
 F(IP_a) address resolver can also store mac_a and S_a
 When node b makes F(IP_a) ARP request also mac_a-
>S_a mapping is cached to S_b
 Shortest path (-> path 10) can now be used
1.10.2012 27
Handling host dynamics
 Location change
 Wireless handoff
 VM moved but retaining MAC
 Host MAC address changes
 NIC card replaced
 Failover event
 VM migration forcing MAC change
 Host changes IP
 DHCP leave expires
 Manual reconfiguration
1.10.2012 28
Insert, delete and update
 Location change
 Host h moves from s_old to s_new
 s_new updates the existing mac-to-location entry
 MAC change
 IP-to-MAC update
 MAC-to-location deletion (old) and insertion (new)
 IP change
 S_h deletes old IP-to-MAC and inserts new IP-to-MAC

1.10.2012 29
Ethernet: Bootstrapping
hosts
 Host discovered by access switches
 SEATTLE switches snoop ARP requests
 Most OSes generate ARP request at boot up / if up
 Aldo DHCP messages or host down can be used
 Host configuration without broadcast
 DHCP_SERVER hashes string “DHCP_SERVER” and
stores the location to the switches
 The “DHCP_SERVER” string is used to locate service
 No need to broadcast for ARP or DHCP
1.10.2012 30
Scalable and flexible VLANs

 To support broadcasts – the authors suggest


using groups
 Similar to VLAN - groups is defined as a set of
hosts who share the same broadcast domain
 The groups are not limited to layer-2 reachability
 Multicast-based group-wide broadcasting
 Multicast tree with broadcast root for each group
 F(group_id) used for broadcast root location

1.10.2012 31
Simulations
 1) Campus ~40 000 students
 517 routers and switches
 2) AP-Large (Access Provider)
 315 routers
 3) Datacenter (DC)
 4 core routes with 21 aggregation switches
 Routers were converted to SEATTLE switches

1.10.2012 32
Cache timeout and AP-large
with 50k hosts
 Shortest path cache timeout
has impact on number of
location lookups
 Even with 60s time out 99.98%
packets were forwarded without lookup
 Control overhead (blue) decreases very fast – where as the
table size increases only moderately
 Shortest path is used in majority of routing in these
simulations

1.10.2012 33
Table size increase in DC

 Ethernet bridges stores entry


for each destination ~ O(sh)
behavior across network
 SEATTLE requires only ~O(h) state since only access
and resolver switches need to store and location
information for each hosts
 With this topology the table size was reduced by factor of 22
 In AP-large case the factor was increased to 64

1.10.2012 34
Control overhead in AP-
large
 Number of control messages
over all links in the topology
divided by the number switches
and duration of the trace
 SEATTLY significantly reduces control overhead in
the simulations
 This is mainly because Ethernet generates network
wide floods for a significant number of packets

1.10.2012 35
Effect of switch failure in
DC
 Switches were allowed to fail
randomly
 The average recover time was
30 seconds
 SEATTLE can use all the links in the topology, where
as Ethernet is restricted to the spanning tree
 Ethernet must re-compute the tree causing outages

1.10.2012 36
Effect of host mobility in
Campus
 Hosts were randomly moved
between access switches
 For high mobility rates,
SEATLLES loss rate was
lower than Ethernet
 On Ethernet it takes sometime for switches to evict
the stale information location information and re-
learn the new location
 SEATTLE provided low loss and broadcast overhead

1.10.2012 37
What was omitted
 Authors suggest multi-level one-hop DHTs
 With large dynamic networks – it can be beneficial that
entries are stored close
 This is achieved with regions and backbone – border
switches connect to the backbone switches
 With topology changes
 Approach to seamless mobility is described in the paper
 Updating remote host caches is required with switch
based MAC revocation lists
 Some simulation results
 Authors also made sample implementation
1.10.2012 38
Conlusions
 Operators today face challenges in managing and
configuring large networks. This is largely to
complexity of administering IP networks.
 Ethernet is not a viable alternative
 poor scaling and inefficient path selection
 SEATTLE promises scalable self-configuring routing
 Simulations suggest efficient routing, low latency with
quick recovery
 Host mobility supported with low control overhead
 Ethernet stacks at end hosts are not modified

1.10.2012 39
Thank you for your attention!
Questions? Comments?

1.10.2012 40

You might also like