You are on page 1of 50

Packet Walk(s) In

Kubernetes
Don Jayakody, Big Switch Networks
@techjaya
linkedin.com/in/jayakody
Legacy Next-Gen
About
About Big Switch Why Not Abstract the Network?
• We are in the business of
“Abstracting Networks” (As one
Big Switch) using Open
Networking Hardware

Kubernetes:
About Me Abstracted Compute Domain
• Spent 4 years in Engineering
building our products. Now in
Technical Product
Management
http://www.youtube.com/watch?v=HlAXp0-M6SY&t=0m43s
Agenda
Intro: K8S Networking Calico: Networking Cilium: Networking

• Namespace/Pods/CNIs? • Architecture • Architecture


• What’s that “Pause” Container really • IP-IP Mode (Route formation / Pod-to- • Overlay Network Mode
do? pod communication / ARP Resolution/ (Configuration/ Pod-to-pod
• Flannel: Intro / Packet Flows Packet Flow) communication/ Datapath / ARP
• BGP Mode (Peering requirements/ Resolution/ Packet Flow)
• Exposing Services
Packet Flow) • Direct Routing Mode
K8S Networking: Basics
root namespace eth0
Namespaces
• Linux kernel has 6 types of
bridge
namespaces: pid, net, mnt, uts, ipc, user
• Network namespaces provide a brand-new veth0 veth1
network stack for all the processes within the
namespace
• That includes network interfaces, routing eth0 eth0
tables and iptables rules namespace-1 namespace-2
Node1
K8S Networking: Basics
Pods
• Lowest common denominator in K8S. Pod is
comprised of one or more containers along with a
“pause” container
• Pause container act as the “parent” container for
other containers inside the pod. One of it’s primary
responsibilities is to bring up the network
namespace
• Great for the redundancy: Termination of other
containers do not result in termination of the
network namespace
K8S Networking: Basics
Accessing Pod
Namespaces
• Multiple ways to access pod
namespaces
• ‘kubectl exec --it'
• ‘docker exec --it’
• nsenter (“namespace enter”, let
you run commands that are
installed on the host but not on
the container)
Both containers belong to the same pod => Same Network Namespace => same ’ip a’ output
K8S Networking: Basics Container
Runtime

Container Networking
Interface : CNI Container Networking Interface
• Interface between container runtime and network
implementation
CNI
• Network plugin implements the CNI spec. It takes Plugin
a container runtime and configure (attach/detach)
it to the network
• CNI plugin is an executable (in: /opt/cni/bin) Configures Networking
• When invoked it reads in a JSON config &
Environment Variables to get all the required
parameters to configure the container with the
network
Credit:https://www.slideshare.net/weaveworks/introduction-to-the-container-network-interface-cni
Topology

Topology Info root bond0 root bond0

• Node IP belongs to 2
different subnets:
25.25.25.0/24 &
35.35.35.0/24
• Gateway is configured with
.254 in the network for
each network segment
• Bond0 interface is created
by bonding two 10G
interfaces Node-121 Node-122
25.25.25.121 35.35.35.122
K8S

Flannel
Flannel CNI Plugin
Intro
• To make networking easier, Kubernetes does
away with port-mapping and assigns a unique IP
address to each pod FlannelD
• If a host cannot get an entire subnet to itself things (VXLAN/UDP..)
get pretty complicated
• Flannel aims to solve this problem by creating an
overlay mesh network that provisions a subnet to
each server
Network
Flannel

Default Config root bond0 root bond0

• “Flannel.1” Is the VXLAN flannel.1 flannel.1


interface
cni0 cni0
• CNI0 is the bridge

Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel

root bond0 root bond0


Pod-to-Pod flannel.1 flannel.1
Communication cni0 cni0
• Brought up 2 pods
veth-x veth-y

pod1 eth0 pod2 eth0

10.244.1.2 10.244.2.2

Node-121 Node-122
25.25.25.121 35.35.35.122
notice the index on veth.
eg: “eth0@if12” on pod1 ns corresponds to 12th interface on the root ns

Flannel
Pod-to-Pod
Communication root bond0 root bond0

• VETH Interface with ”veth-” flannel.1 flannel.1


got created on the root cni0 cni0
namespace
• Other end is attached to veth-x veth-y
the pod namespace

pod1 eth0 pod2 eth0

10.244.1.2 10.244.2.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel

root bond0 root bond0

ARP Handling flannel.1 flannel.1


cni0 cni0
ARP Reply
• ’cni0’ bridge is replying to 2

ARP requests from the pod veth-x veth-y


1

pod1 eth0 pod2 eth0

10.244.1.2 10.244.2.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel

root bond0 root bond0

Packet Flow flannel.1 flannel.1


cni0 cni0
• Routing table lookup to 2

figure out where to send veth-x veth-y


the packet
1

pod1 eth0 pod2 eth0

10.244.1.2 10.244.2.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel 6 7

5 8

root bond0 4 root bond0 9

Packet Flow flannel.1 flannel.1


cni0 3
cni0 10

• Flannel.1 device does 2 11

VXLAN encap/decap veth-x veth-y


• Traffic from pod1 is going 1 12

through “flannel.1” device


pod1 eth0 pod2 eth0
before exiting
10.244.1.2 10.244.2.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Felix Bird
• The primary Calico agent that runs on each • BGP Client: responsible of route distribution
machine that hosts endpoints. • When Felix inserts routes into the Linux kernel
• Responsible for programming routes and ACLs, FIB, Bird will pick them up and distribute them to
and anything else required on the host the other nodes in the deployment
Calico
Architecture
eth0 eth0
• Felix's primary responsibility is to program the Bird Bird
route flannel route flannel
host's iptables and routes to provide the iptables
table
iptables
table
0 0
connectivity to pods on that host.
• Bird is a BGP agent for Linux that is used to cbr0 cbr0
exchange routing information between the hosts.
Felix Felix
veth1 veth1
The routes that are programmed by Felix are
picked up by bird and distributed among the
cluster hosts
eth0 eth1 eth0 eth1
pod1 pod2 pod3 pod4
Node1 Node2
*etcd/confd components are not shown for clarity
Calico

root bond0 root bond0


Default iptables iptables
Configuration tunl0 tunl0
route table route table
• Node-to-node mesh
• IP-IP encapsulation

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
“ipip” tunnel

root bond0 root bond0


Default iptables tunl0 iptables tunl0
Configuration 192.168.83.64 192.168.243.0

• Default: Node Mesh with


route table
Routeroute table
to the other node’s tunnel
IP-IP tunnel
• Route table has entries to
all the other tunl0
interfaces through other
node IPs
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
“ipip” tunnel

root bond0 root bond0


Default iptables iptables tunl0
Configuration tunl0
Route
route tableto the other node’s tunnel route table
192.168.243.0

• Default: Node Mesh with


IP-IP tunnel
• Route table has entries to
all the other tunl0
interfaces through other
node IPs
Node-121 Node-122
25.25.25.121 35.35.35.122
The The
image image
part with part with
relationshi relationshi

Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
veth interfaces
relationshi relationshi relationshi relationshi

root bond0 root bond0


Pod-to-Pod iptables iptables
tunl0 tunl0
Communication 192.168.83.64 192.168.243.0
route table route table
• Brought up 2 pods
• “calicoctl get wep” cali-x cali-y
(”workloadendpoints”)
shows the endpoints in
pod1 eth0 pod2 eth0
calico end
192.168.83.67 192.168.243.2

Node-121 Node-122
25.25.25.121 35.35.35.122
The The
image image
part with part with
relationshi relationshi

Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
relationshi relationshi relationshi relationshi

root bond0 root bond0


Pod-to-Pod iptables iptables
tunl0 tunl0
Communication 192.168.83.64 192.168.243.0
route table route table
• VETH Interface with “cali-
xxx” got created on the root cali-x cali-y
namespace
• Other end is attached to
pod1 eth0 pod2 eth0
the pod namespace
192.168.83.67 192.168.243.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico

root bond0 root bond0

ARP Resolution iptables tunl0 iptables tunl0


192.168.83.64 192.168.243.0
route table route table
• How does ARP gets
resolved? cali-x cali-y
• pod1 & pod2 default route
is pointing to private IPv4
pod1 eth0 pod2 eth0
“169.254.1.1”
192.168.83.67 192.168.243.2

Node-121 Node-122
25.25.25.121 35.35.35.122
2 tcpdump on root veth 3
root veth is replying with its own MAC

Calico

root bond0 root bond0


1
Pinging from pod1 to pod2
ARP Resolution iptables tunl0 iptables tunl0
192.168.83.64 192.168.243.0
route table route table
• Initiate ping from pod1-to
pod2 cali-x cali-y
ARP Reply
• ARP request is send to 1

default GW 169.254.1.1
pod1 eth0 pod2 eth0
• cali interface replies to
ARP request with it’s own 192.168.83.67 192.168.243.2

MAC Node-121 Node-122


25.25.25.121 35.35.35.122
No pod-MAC/IPs will be learned on the network. Only
Node-IPs

Calico
ARP Resolution root bond0 root bond0

• Why “Cali-” VETH iptables tunl0 iptables tunl0


Interfaces replies to the
192.168.83.64 192.168.243.0
ARP request? route table route table

• Reason: Proxy-ARP is cali-x


Proxy ARP is enabled on Cali veth:
cali-y
enabled on the veth
interface on root
namespace pod1 eth0 pod2 eth0
• Network is only learning 192.168.83.67 192.168.243.2
the Node IP/MACs, not pod
macs Node-121 Node-122
25.25.25.121 35.35.35.122
Notice the 2 IP headers: IP-IP protocl

Calico
Once the ARP is resolved, routing table rules kicks in

root bond0 root bond0


Packet iptables iptables
tunl0 tunl0
Forwarding 192.168.83.64 192.168.243.0
route table route table
• After ARP Resolution
packet gets forwarded cali-x cali-y
according to the routing
table entries
pod1 eth0 pod2 eth0
• Packets will get
encapsulated via IP-IP 192.168.83.67 192.168.243.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico 5 6

4 7

root bond0 3 root bond0 8

Packet iptables iptables


tunl0 tunl0
Forwarding 192.168.83.64 192.168.243.0
route table route table
• After ARP Resolution
packet gets forwarded cali-x 2
cali-y 9

according to the routing 10


1
table entries
pod1 eth0 pod2 eth0
• Packets will get
encapsulated via IP-IP 192.168.83.67 192.168.243.2

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico

Packet root bond0 root bond0

Forwarding in iptables tunl0 iptables tunl0

the same host route table


192.168.83.64
route table
192.168.243.0

• /32 routes are present in cali-x 2 cali-z cali-y


the routing table for all the
containers 1 3

• Packets get forwarded to pod1 eth0 eth0 pod3 eth0 pod2 eth0
the appropriate calico veth 192.168.243.2
192.168.83.67 192.168.83.68
interface based on the
routing rules Node-121 Node-122
25.25.25.121 35.35.35.122
Calico

root bond0 root bond0

Non IP-IP iptables


tunl0
iptables
tunl0
route table route table
• Disable IP-IP Mode

Node-121 Node-122
25.25.25.121 35.35.35.122
No more tunl0. Pod routes are directly pointing to bond0

Calico

root bond0 root bond0

Non IP-IP iptables iptables




tunl0 tunl0
route table route table
• Disable IP-IP Mode. tunl0
interface is not present cali-x cali-y
anymore
• Routes are pointing to the
pod1 eth0 pod2 eth0
bond0 interface
• Bring up 2 pods as before 192.168.83.69 192.168.243.4

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Packets won’t go
3
❓ Reason: No routes to the pod networks
root bond0 root bond0

Non IP-IP iptables iptables



2

tunl0 tunl0
route table route table
• Ping from pod1 to pod2 is
unsuccessful cali-x cali-y
• Reason: Routes are not 1

advertised to the Network


pod1 eth0 pod2 eth0

192.168.83.69 192.168.243.4

Node-121 Node-122
25.25.25.121 35.35.35.122
Calico

root bond0 Bird root bond0 Bird


BGP mode iptables iptables


tunl0 tunl0
route table route table
• Need the Calico nodes to
peer with the network cali-x cali-y
fabric

pod1 eth0 pod2 eth0

192.168.83.69 192.168.243.4

Node-121 Node-122
25.25.25.121 35.35.35.122
Create a BGP configuration with AS:63400 Create a ”global” BGP Peer

Calico

root bond0 root bond0

BGP Mode iptables iptables




tunl0 tunl0
route table route table
• Create a global BGP
configuration cali-x cali-y
• Create the network as a
BGP Peer (**Assuming an
pod1 eth0 pod2 eth0
abstracted cloud network.
Config will vary depending 192.168.83.69 192.168.243.4
on vendor)
Node-121 Node-122

Both Nodes are peering with the global BGP Peer


25.25.25.121 35.35.35.122
Calico
Network is peering with both the Calico Nodes
root bond0 root bond0

BGP Mode iptables iptables




tunl0 tunl0
route table route table
• Configure the Calico nodes
as BGP Neighbors on the cali-x
Network is learning pod network routes via BGP
cali-y
network
• As a result network will get
pod1 eth0 pod2 eth0
to learn about
192.168.83.64/26 & 192.168.83.69 192.168.243.4
192.168.243.0/26 routes
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico 4
5

3 6

root bond0 root bond0

BGP Mode iptables


2
iptables
7

route table route table


• Packets goes across the
network without any cali-x cali-y
encapsulations
1 8

pod1 eth0 pod2 eth0

192.168.83.69 192.168.243.4

Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Architecture eth0
Cilium
Agent
flannel
• Cilium Agent, Cilium CLI Client, CNI Plugin will be 0
running on every node
cbr0
• Cilium agent compiles BPF programs and make
the kernel runs these programs at key points in the veth1
network stack to have visibility and control over all BPF Program BPF Program
network traffic in/out of all containers
• Cilium interacts with the Linux kernel to install BPF eth0 eth1
program which will then perform networking tasks pod1 pod2
and implement security rules
Node1

*etcd/monitor components are not shown for clarity


Cilium: Networking
Overlay Network Mode Direct/Native Routing Mode
• All nodes form a mesh of tunnels using the UDP • In direct routing mode, Cilium will hand all packets
based encapsulation protocols: VXLAN (default) or which are not addressed for another local endpoint
Geneve to the routing subsystem of the Linux kernel
• Simple: Only requirement is cluster nodes should • Packets will be routed as if a local process would
be able to reach each other using IP/UDP have emitted the packet
• Auto-configured: Kubernetes is being run with the- • Admins can use routing daemon such as Zebra,
”--allocate-node-cidrs” option, Cilium can form an Bird , BGPD. The routing protocols will announce
overlay network automatically without any the node allocation prefix via the node’s IP to all
configuration by the user other nodes.
Cilium

root bond0 root bond0


Default cilium_health cilium_health
Configuration cilium_net
cilium_vxlan
cilium_net
cilium_vxlan

• Overlay Networking Mode


cilium_host cilium_host
192.168.2.1
• VXLAN encapsulation 192.168.1.1

• Both VETH/IPVLAN is
supported (Higher
Performance gains with
IPVLAN)
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Default ‘cilium_vxlan’ interface is in Metadata mode.
Configuration No IP assigned to this interface
root bond0 root bond0
• ”cilium_vxlan” interface is in
metadata mode. It can send & cilium_health cilium_health
receive on multiple addresses cilium_vxlan cilium_vxlan
cilium_net cilium_net
• Cilium addressing model allows
to derive the node address from cilium_host cilium_host
cilium_host cilium_host
each container address. 192.168.1.1 192.168.2.1

• This is also used to derive the


VTEP address, so you don’t
need to run a control plane
protocol to distribute these
addresses. All you need to have
are routes to make the node Node-121 Node-122
addresses routable
25.25.25.121 35.35.35.122
Cilium

Pod-to-Pod root bond0 root bond0

Communication cilium_health
cilium_vxlan
cilium_health
cilium_vxlan
• VETH Interface with “lxc- cilium_net cilium_net

xxx” got created on the root


cilium_host cilium_host
namespace lxc-x lxc-y

• Other end is attached to


the pod namespace pod1 eth0 pod2 eth0

192.168.1.231 192.168.2.251

Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
XDP Hooks
Datapath
root bond0 root bond0
• Cilium datapath uses eBPF cilium_health cilium_health
hooks to load BPF programs cilium_vxlan cilium_vxlan
• XDP BPF hook is at the earliest cilium_net cilium_net
point possible in the networking
driver and triggers a run of the cilium_host
lxc-x
cilium_host
lxc-y
BPF program upon packet
TC Ingress Hooks
reception
• Traffic Control (TC) Hooks: BPF pod1 eth0 pod2 eth0
programs are attached to the
TC Ingress hook of host side of 192.168.1.231 192.168.2.251
the VETH pair for monitoring &
Node-121 Node-122
enforcement
25.25.25.121 35.35.35.122
BPF program is responding with to the ARP with VETH’s (lxc-x) MAC

Cilium
ARP Resolution
root bond0 root bond0
• Default GW of the cilium_health cilium_health
container is pointing to the cilium_vxlan cilium_vxlan
cilium_net
IP of “cilium_host”
Default GW of the pod is pointing to
cilium_net

’cilium_host’ IP
cilium_host
• BPF Program is installed to cilium_host
lxc-x lxc-y
reply to the ARP request 192.168.1.1
ARP Reply
• LXC Interface MAC is used 1

for the ARP reply pod1 eth0 pod2 eth0

192.168.1.231 192.168.2.251

Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium 5 6

4 7

Packet Flow
root bond0 3 root bond0 8

• cilium_vxlan device does cilium_health cilium_health


VXLAN encap/decap cilium_vxlan cilium_vxlan
cilium_net cilium_net
• Traffic from pod1 is going
through “cilium_vxlan” cilium_host cilium_host
lxc-x 2
lxc-y 9
device before exiting
1 10

pod1 eth0 pod2 eth0

192.168.1.231 192.168.2.251

Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Direct Routing
root bond0 Routing root bond0 Routing
• No VXLAN/GENEVE Daemon Daemon
cilium_health cilium_health
overlays
cilium_net cilium_net
• As an admin you are able
to run your own flavor of cilium_host cilium_host
lxc-x lxc-y
routing daemon
(Bird/BGPD/Zebra etc) to
distribute routes
pod1 eth0 pod2 eth0
• You can make the Pod IP’s
192.168.1.231
routable as a result 192.168.2.251

Node-121 Node-122
25.25.25.121 35.35.35.122
Yes! Back to the Basics…

K8S Networking: Basics


Services Service
IP
• Pods are mortal
• Need a higher level
abstractions: Services
• “Service” in Kubernetes is
Kube
a conceptual concept.
Service is not a Proxy
process/daemon. Outside
networks doesn’t learn
Service IP addresses
Pod1 Pod2 Pod3
• Implemented through Kube
Proxy with IPTables rules
K8S Networking: Basics NodePort: Service is accessed via ‘NodeIP:port’
Exposing Services
• If Services are an abstracted concept
without any meaning outside of the
K8S cluster how do we access?
• NodePort / LoadBalancer /
Ingress etc.

Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics LoadBalancer: Service is accessed via Loadbalancer

Exposing Services
• Load Balancer: Spins up a load
balancer and binds service IPs to
Load Balancer VIP
• Very common in public cloud
environments
• For baremetal workloads: ‘MetalLB’
(Up & coming project, load-balancer
implementation for bare
metal K8S clusters, using standard
routing protocols)

Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics
Exposing Services
• Ingress: K8S Concept that lets you
decide how to let traffic into the cluster
• Sits in front of multiple services and
act as a ”router”
• Implemented through an ingress
controller (NGINX/HA Proxy)

Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics
Exposing Services
• Ingress: Network (“Abstracted Virtual-IP
Network”) can really help you out here
• Ingress controllers are deployed in
some of your “public” nodes in your
cluster
• Eg: Big Cloud Fabric (by Big
Switch), can expose a Virtual IP in
front of the Ingress Controllers and Kube-1862 Kube-1863
perform Load Balancing/Health
Checks/Analytics
Thanks!
• Repo for all the command outputs/PDF slides:
https://github.com/jayakody/ons-2019

• Credits:
• Inspired by Life of a Packet- Michael Rubin, Google (https://www.youtube.com/watch?v=0Omvgd7Hg1I)
• Sarath Kumar, Prashanth Padubidry : Big Switch Engineering
• Thomas Graf, Dan Wendlandt: Isovalent (Cilium Project)
• Project Calico Slack Channel: special shout out to: Casey Davenport, Tigera
• And so many other folks who took time to share knowledge around this emerging space through different mediums
(Blogs/YouTube videos etc)

You might also like