Packet Walks in Kubernetes-V4

Packet Walk(s) In
Kubernetes
Don Jayakody, Big Switch Networks
@techjaya
linkedin.com/in/jayakody
Legacy Next-Gen
About
About Big Switch Why Not Abstract the Network?
• We are in the business of
“Abstracting Networks” (As one
Big Switch) using Open
Networking Hardware
Kubernetes:
About Me Abstracted Compute Domain
• Spent 4 years in Engineering
building our products. Now in
Technical Product
Management
http://www.youtube.com/watch?v=HlAXp0-M6SY&t=0m43s
Agenda
Intro: K8S Networking Calico: Networking Cilium: Networking
• Namespace/Pods/CNIs? • Architecture • Architecture

• What’s that “Pause” Container really • IP-IP Mode (Route formation / Pod-to- • Overlay Network Mode
do? pod communication / ARP Resolution/ (Configuration/ Pod-to-pod
• Flannel: Intro / Packet Flows Packet Flow) communication/ Datapath / ARP
• BGP Mode (Peering requirements/ Resolution/ Packet Flow)
• Exposing Services
Packet Flow) • Direct Routing Mode
K8S Networking: Basics
root namespace eth0
Namespaces
• Linux kernel has 6 types of
bridge
namespaces: pid, net, mnt, uts, ipc, user
• Network namespaces provide a brand-new veth0 veth1
network stack for all the processes within the
namespace
• That includes network interfaces, routing eth0 eth0
tables and iptables rules namespace-1 namespace-2
Node1
Pods
• Lowest common denominator in K8S. Pod is
comprised of one or more containers along with a
“pause” container
• Pause container act as the “parent” container for
other containers inside the pod. One of it’s primary
responsibilities is to bring up the network
namespace
• Great for the redundancy: Termination of other
containers do not result in termination of the
network namespace
Accessing Pod
Namespaces
• Multiple ways to access pod
namespaces
• ‘kubectl exec --it'
• ‘docker exec --it’
• nsenter (“namespace enter”, let
you run commands that are
installed on the host but not on
the container)
Both containers belong to the same pod => Same Network Namespace => same ’ip a’ output
K8S Networking: Basics Container
Runtime
Container Networking
Interface : CNI Container Networking Interface
• Interface between container runtime and network
implementation
CNI
• Network plugin implements the CNI spec. It takes Plugin
a container runtime and configure (attach/detach)
it to the network
• CNI plugin is an executable (in: /opt/cni/bin) Configures Networking
• When invoked it reads in a JSON config &
Environment Variables to get all the required
parameters to configure the container with the
network
Credit:https://www.slideshare.net/weaveworks/introduction-to-the-container-network-interface-cni
Topology
Topology Info root bond0 root bond0
• Node IP belongs to 2
different subnets:
25.25.25.0/24 &
35.35.35.0/24
• Gateway is configured with
.254 in the network for
each network segment
• Bond0 interface is created
by bonding two 10G
interfaces Node-121 Node-122
25.25.25.121 35.35.35.122
K8S
Flannel
Flannel CNI Plugin
Intro
• To make networking easier, Kubernetes does
away with port-mapping and assigns a unique IP
address to each pod FlannelD
• If a host cannot get an entire subnet to itself things (VXLAN/UDP..)
get pretty complicated
• Flannel aims to solve this problem by creating an
overlay mesh network that provisions a subnet to
each server
Network
Flannel
Default Config root bond0 root bond0
• “Flannel.1” Is the VXLAN flannel.1 flannel.1

interface
cni0 cni0
• CNI0 is the bridge
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
root bond0 root bond0

Pod-to-Pod flannel.1 flannel.1
Communication cni0 cni0
• Brought up 2 pods
veth-x veth-y
pod1 eth0 pod2 eth0
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
notice the index on veth.
eg: “eth0@if12” on pod1 ns corresponds to 12th interface on the root ns
Flannel
Pod-to-Pod
Communication root bond0 root bond0
• VETH Interface with ”veth-” flannel.1 flannel.1

got created on the root cni0 cni0
namespace
• Other end is attached to veth-x veth-y
the pod namespace
pod1 eth0 pod2 eth0
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
ARP Handling flannel.1 flannel.1

cni0 cni0
ARP Reply
• ’cni0’ bridge is replying to 2
ARP requests from the pod veth-x veth-y

1
pod1 eth0 pod2 eth0
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
Packet Flow flannel.1 flannel.1

cni0 cni0
• Routing table lookup to 2
figure out where to send veth-x veth-y

the packet
1
pod1 eth0 pod2 eth0
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel 6 7
5 8
root bond0 4 root bond0 9
Packet Flow flannel.1 flannel.1

cni0 3
cni0 10
• Flannel.1 device does 2 11
VXLAN encap/decap veth-x veth-y

• Traffic from pod1 is going 1 12
through “flannel.1” device

pod1 eth0 pod2 eth0
before exiting
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Felix Bird
• The primary Calico agent that runs on each • BGP Client: responsible of route distribution
machine that hosts endpoints. • When Felix inserts routes into the Linux kernel
• Responsible for programming routes and ACLs, FIB, Bird will pick them up and distribute them to
and anything else required on the host the other nodes in the deployment
Calico
Architecture
eth0 eth0
• Felix's primary responsibility is to program the Bird Bird
route flannel route flannel
host's iptables and routes to provide the iptables
table
iptables
table
0 0
connectivity to pods on that host.
• Bird is a BGP agent for Linux that is used to cbr0 cbr0
exchange routing information between the hosts.
Felix Felix
veth1 veth1
The routes that are programmed by Felix are
picked up by bird and distributed among the
cluster hosts
eth0 eth1 eth0 eth1
pod1 pod2 pod3 pod4
Node1 Node2
*etcd/confd components are not shown for clarity
Calico

Default iptables iptables
Configuration tunl0 tunl0
route table route table
• Node-to-node mesh
• IP-IP encapsulation
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
“ipip” tunnel

Default iptables tunl0 iptables tunl0
Configuration 192.168.83.64 192.168.243.0
• Default: Node Mesh with

route table
Routeroute table
to the other node’s tunnel
IP-IP tunnel
• Route table has entries to
all the other tunl0
interfaces through other
node IPs
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
“ipip” tunnel

Default iptables iptables tunl0
Configuration tunl0
Route
route tableto the other node’s tunnel route table
192.168.243.0
• Default: Node Mesh with

IP-IP tunnel
• Route table has entries to
all the other tunl0
interfaces through other
node IPs
Node-121 Node-122
25.25.25.121 35.35.35.122
The The
image image
part with part with
relationshi relationshi
Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
veth interfaces
relationshi relationshi relationshi relationshi

Pod-to-Pod iptables iptables
tunl0 tunl0
Communication 192.168.83.64 192.168.243.0
• Brought up 2 pods
• “calicoctl get wep” cali-x cali-y
(”workloadendpoints”)
shows the endpoints in
pod1 eth0 pod2 eth0
calico end
192.168.83.67 192.168.243.2
Node-121 Node-122
25.25.25.121 35.35.35.122
The The
image image
part with part with
relationshi relationshi
Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
relationshi relationshi relationshi relationshi

Pod-to-Pod iptables iptables
tunl0 tunl0
Communication 192.168.83.64 192.168.243.0
• VETH Interface with “cali-
xxx” got created on the root cali-x cali-y
namespace
• Other end is attached to
pod1 eth0 pod2 eth0
the pod namespace
192.168.83.67 192.168.243.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
ARP Resolution iptables tunl0 iptables tunl0

192.168.83.64 192.168.243.0
• How does ARP gets
resolved? cali-x cali-y
• pod1 & pod2 default route
is pointing to private IPv4
pod1 eth0 pod2 eth0
“169.254.1.1”
192.168.83.67 192.168.243.2
Node-121 Node-122
25.25.25.121 35.35.35.122
2 tcpdump on root veth 3
root veth is replying with its own MAC
Calico

1
Pinging from pod1 to pod2
ARP Resolution iptables tunl0 iptables tunl0
192.168.83.64 192.168.243.0
• Initiate ping from pod1-to
pod2 cali-x cali-y
ARP Reply
• ARP request is send to 1
default GW 169.254.1.1
pod1 eth0 pod2 eth0
• cali interface replies to
ARP request with it’s own 192.168.83.67 192.168.243.2
MAC Node-121 Node-122

25.25.25.121 35.35.35.122
No pod-MAC/IPs will be learned on the network. Only
Node-IPs
Calico
ARP Resolution root bond0 root bond0
• Why “Cali-” VETH iptables tunl0 iptables tunl0

Interfaces replies to the
192.168.83.64 192.168.243.0
ARP request? route table route table
• Reason: Proxy-ARP is cali-x

Proxy ARP is enabled on Cali veth:
cali-y
enabled on the veth
interface on root
namespace pod1 eth0 pod2 eth0
• Network is only learning 192.168.83.67 192.168.243.2
the Node IP/MACs, not pod
macs Node-121 Node-122
25.25.25.121 35.35.35.122
Notice the 2 IP headers: IP-IP protocl
Calico
Once the ARP is resolved, routing table rules kicks in

Packet iptables iptables
tunl0 tunl0
Forwarding 192.168.83.64 192.168.243.0
• After ARP Resolution
packet gets forwarded cali-x cali-y
according to the routing
table entries
pod1 eth0 pod2 eth0
• Packets will get
encapsulated via IP-IP 192.168.83.67 192.168.243.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico 5 6
4 7
Packet iptables iptables

tunl0 tunl0
Forwarding 192.168.83.64 192.168.243.0
• After ARP Resolution
packet gets forwarded cali-x 2
cali-y 9
according to the routing 10

1
table entries
pod1 eth0 pod2 eth0
• Packets will get
encapsulated via IP-IP 192.168.83.67 192.168.243.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Packet root bond0 root bond0
Forwarding in iptables tunl0 iptables tunl0
the same host route table

192.168.83.64
route table
192.168.243.0
• /32 routes are present in cali-x 2 cali-z cali-y

the routing table for all the
containers 1 3
• Packets get forwarded to pod1 eth0 eth0 pod3 eth0 pod2 eth0
the appropriate calico veth 192.168.243.2
192.168.83.67 192.168.83.68
interface based on the
routing rules Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Non IP-IP iptables

tunl0
iptables
tunl0
• Disable IP-IP Mode
Node-121 Node-122
25.25.25.121 35.35.35.122
No more tunl0. Pod routes are directly pointing to bond0
Calico
Non IP-IP iptables iptables

❌
❌
tunl0 tunl0
• Disable IP-IP Mode. tunl0
interface is not present cali-x cali-y
anymore
• Routes are pointing to the
pod1 eth0 pod2 eth0
bond0 interface
• Bring up 2 pods as before 192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Packets won’t go
3
❓ Reason: No routes to the pod networks
Non IP-IP iptables iptables

❌
2
❌
tunl0 tunl0
• Ping from pod1 to pod2 is
unsuccessful cali-x cali-y
• Reason: Routes are not 1
advertised to the Network

pod1 eth0 pod2 eth0
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
root bond0 Bird root bond0 Bird

BGP mode iptables iptables
❌
❌
tunl0 tunl0
• Need the Calico nodes to
peer with the network cali-x cali-y
fabric
pod1 eth0 pod2 eth0
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Create a BGP configuration with AS:63400 Create a ”global” BGP Peer
Calico
BGP Mode iptables iptables

❌
❌
tunl0 tunl0
• Create a global BGP
configuration cali-x cali-y
• Create the network as a
BGP Peer (**Assuming an
pod1 eth0 pod2 eth0
abstracted cloud network.
Config will vary depending 192.168.83.69 192.168.243.4
on vendor)
Node-121 Node-122
Both Nodes are peering with the global BGP Peer

25.25.25.121 35.35.35.122
Calico
Network is peering with both the Calico Nodes
BGP Mode iptables iptables

❌
❌
tunl0 tunl0
• Configure the Calico nodes
as BGP Neighbors on the cali-x
Network is learning pod network routes via BGP
cali-y
network
• As a result network will get
pod1 eth0 pod2 eth0
to learn about
192.168.83.64/26 & 192.168.83.69 192.168.243.4
192.168.243.0/26 routes
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico 4
5
3 6
BGP Mode iptables

2
iptables
7

• Packets goes across the
network without any cali-x cali-y
encapsulations
1 8
pod1 eth0 pod2 eth0
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Architecture eth0
Cilium
Agent
flannel
• Cilium Agent, Cilium CLI Client, CNI Plugin will be 0
running on every node
cbr0
• Cilium agent compiles BPF programs and make
the kernel runs these programs at key points in the veth1
network stack to have visibility and control over all BPF Program BPF Program
network traffic in/out of all containers
• Cilium interacts with the Linux kernel to install BPF eth0 eth1
program which will then perform networking tasks pod1 pod2
and implement security rules
Node1
*etcd/monitor components are not shown for clarity

Cilium: Networking
Overlay Network Mode Direct/Native Routing Mode
• All nodes form a mesh of tunnels using the UDP • In direct routing mode, Cilium will hand all packets
based encapsulation protocols: VXLAN (default) or which are not addressed for another local endpoint
Geneve to the routing subsystem of the Linux kernel
• Simple: Only requirement is cluster nodes should • Packets will be routed as if a local process would
be able to reach each other using IP/UDP have emitted the packet
• Auto-configured: Kubernetes is being run with the- • Admins can use routing daemon such as Zebra,
”--allocate-node-cidrs” option, Cilium can form an Bird , BGPD. The routing protocols will announce
overlay network automatically without any the node allocation prefix via the node’s IP to all
configuration by the user other nodes.
Cilium

Default cilium_health cilium_health
Configuration cilium_net
cilium_vxlan
cilium_net
cilium_vxlan
• Overlay Networking Mode

cilium_host cilium_host
192.168.2.1
• VXLAN encapsulation 192.168.1.1
• Both VETH/IPVLAN is
supported (Higher
Performance gains with
IPVLAN)
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Default ‘cilium_vxlan’ interface is in Metadata mode.
Configuration No IP assigned to this interface
• ”cilium_vxlan” interface is in
metadata mode. It can send & cilium_health cilium_health
receive on multiple addresses cilium_vxlan cilium_vxlan
cilium_net cilium_net
• Cilium addressing model allows
to derive the node address from cilium_host cilium_host
each container address. 192.168.1.1 192.168.2.1
• This is also used to derive the

VTEP address, so you don’t
need to run a control plane
protocol to distribute these
addresses. All you need to have
are routes to make the node Node-121 Node-122
addresses routable
25.25.25.121 35.35.35.122
Cilium
Pod-to-Pod root bond0 root bond0
Communication cilium_health
cilium_vxlan
cilium_health
cilium_vxlan
• VETH Interface with “lxc- cilium_net cilium_net
xxx” got created on the root

namespace lxc-x lxc-y
• Other end is attached to

the pod namespace pod1 eth0 pod2 eth0
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
XDP Hooks
Datapath
• Cilium datapath uses eBPF cilium_health cilium_health
hooks to load BPF programs cilium_vxlan cilium_vxlan
• XDP BPF hook is at the earliest cilium_net cilium_net
point possible in the networking
driver and triggers a run of the cilium_host
lxc-x
cilium_host
lxc-y
BPF program upon packet
TC Ingress Hooks
reception
• Traffic Control (TC) Hooks: BPF pod1 eth0 pod2 eth0
programs are attached to the
TC Ingress hook of host side of 192.168.1.231 192.168.2.251
the VETH pair for monitoring &
Node-121 Node-122
enforcement
25.25.25.121 35.35.35.122
BPF program is responding with to the ARP with VETH’s (lxc-x) MAC
Cilium
ARP Resolution
• Default GW of the cilium_health cilium_health
container is pointing to the cilium_vxlan cilium_vxlan
cilium_net
IP of “cilium_host”
Default GW of the pod is pointing to
cilium_net
’cilium_host’ IP
cilium_host
• BPF Program is installed to cilium_host
lxc-x lxc-y
reply to the ARP request 192.168.1.1
ARP Reply
• LXC Interface MAC is used 1
for the ARP reply pod1 eth0 pod2 eth0
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium 5 6
4 7
Packet Flow
• cilium_vxlan device does cilium_health cilium_health

VXLAN encap/decap cilium_vxlan cilium_vxlan
• Traffic from pod1 is going
through “cilium_vxlan” cilium_host cilium_host
lxc-x 2
lxc-y 9
device before exiting
1 10
pod1 eth0 pod2 eth0
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Direct Routing
root bond0 Routing root bond0 Routing
• No VXLAN/GENEVE Daemon Daemon
cilium_health cilium_health
overlays
• As an admin you are able
to run your own flavor of cilium_host cilium_host
lxc-x lxc-y
routing daemon
(Bird/BGPD/Zebra etc) to
distribute routes
pod1 eth0 pod2 eth0
• You can make the Pod IP’s
192.168.1.231
routable as a result 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Yes! Back to the Basics…

Services Service
IP
• Pods are mortal
• Need a higher level
abstractions: Services
• “Service” in Kubernetes is
Kube
a conceptual concept.
Service is not a Proxy
process/daemon. Outside
networks doesn’t learn
Service IP addresses
Pod1 Pod2 Pod3
• Implemented through Kube
Proxy with IPTables rules
K8S Networking: Basics NodePort: Service is accessed via ‘NodeIP:port’
Exposing Services
• If Services are an abstracted concept
without any meaning outside of the
K8S cluster how do we access?
• NodePort / LoadBalancer /
Ingress etc.
Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics LoadBalancer: Service is accessed via Loadbalancer
Exposing Services
• Load Balancer: Spins up a load
balancer and binds service IPs to
Load Balancer VIP
• Very common in public cloud
environments
• For baremetal workloads: ‘MetalLB’
(Up & coming project, load-balancer
implementation for bare
metal K8S clusters, using standard
routing protocols)
Exposing Services
• Ingress: K8S Concept that lets you
decide how to let traffic into the cluster
• Sits in front of multiple services and
act as a ”router”
• Implemented through an ingress
controller (NGINX/HA Proxy)
Exposing Services
• Ingress: Network (“Abstracted Virtual-IP
Network”) can really help you out here
• Ingress controllers are deployed in
some of your “public” nodes in your
cluster
• Eg: Big Cloud Fabric (by Big
Switch), can expose a Virtual IP in
front of the Ingress Controllers and Kube-1862 Kube-1863
perform Load Balancing/Health
Checks/Analytics
Thanks!
• Repo for all the command outputs/PDF slides:
https://github.com/jayakody/ons-2019
• Credits:
• Inspired by Life of a Packet- Michael Rubin, Google (https://www.youtube.com/watch?v=0Omvgd7Hg1I)
• Sarath Kumar, Prashanth Padubidry : Big Switch Engineering
• Thomas Graf, Dan Wendlandt: Isovalent (Cilium Project)
• Project Calico Slack Channel: special shout out to: Casey Davenport, Tigera
• And so many other folks who took time to share knowledge around this emerging space through different mediums
(Blogs/YouTube videos etc)

Packet Walks in Kubernetes-V4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Packet Walks in Kubernetes-V4

Uploaded by

Copyright:

Available Formats

Packet Walk(s) In

• Namespace/Pods/CNIs? • Architecture • Architecture

Topology Info root bond0 root bond0

Default Config root bond0 root bond0

• “Flannel.1” Is the VXLAN flannel.1 flannel.1

root bond0 root bond0

pod1 eth0 pod2 eth0

• VETH Interface with ”veth-” flannel.1 flannel.1

pod1 eth0 pod2 eth0

root bond0 root bond0

ARP Handling flannel.1 flannel.1

ARP requests from the pod veth-x veth-y

pod1 eth0 pod2 eth0

root bond0 root bond0

Packet Flow flannel.1 flannel.1

figure out where to send veth-x veth-y

pod1 eth0 pod2 eth0

root bond0 4 root bond0 9

Packet Flow flannel.1 flannel.1

• Flannel.1 device does 2 11

VXLAN encap/decap veth-x veth-y

through “flannel.1” device

root bond0 root bond0

root bond0 root bond0

• Default: Node Mesh with

root bond0 root bond0

• Default: Node Mesh with

root bond0 root bond0

root bond0 root bond0

root bond0 root bond0

ARP Resolution iptables tunl0 iptables tunl0

root bond0 root bond0

MAC Node-121 Node-122

• Why “Cali-” VETH iptables tunl0 iptables tunl0

• Reason: Proxy-ARP is cali-x

root bond0 root bond0

root bond0 3 root bond0 8

Packet iptables iptables

according to the routing 10

Packet root bond0 root bond0

Forwarding in iptables tunl0 iptables tunl0

the same host route table

• /32 routes are present in cali-x 2 cali-z cali-y

root bond0 root bond0

Non IP-IP iptables

root bond0 root bond0

Non IP-IP iptables iptables

Non IP-IP iptables iptables

advertised to the Network

root bond0 Bird root bond0 Bird

pod1 eth0 pod2 eth0

root bond0 root bond0

BGP Mode iptables iptables

Both Nodes are peering with the global BGP Peer

BGP Mode iptables iptables

root bond0 root bond0

BGP Mode iptables

route table route table

pod1 eth0 pod2 eth0

*etcd/monitor components are not shown for clarity

root bond0 root bond0

• Overlay Networking Mode

• This is also used to derive the

Pod-to-Pod root bond0 root bond0