Professional Documents
Culture Documents
Kubernetes
Don Jayakody, Big Switch Networks
@techjaya
linkedin.com/in/jayakody
Legacy Next-Gen
About
About Big Switch Why Not Abstract the Network?
• We are in the business of
“Abstracting Networks” (As one
Big Switch) using Open
Networking Hardware
Kubernetes:
About Me Abstracted Compute Domain
• Spent 4 years in Engineering
building our products. Now in
Technical Product
Management
http://www.youtube.com/watch?v=HlAXp0-M6SY&t=0m43s
Agenda
Intro: K8S Networking Calico: Networking Cilium: Networking
Container Networking
Interface : CNI Container Networking Interface
• Interface between container runtime and network
implementation
CNI
• Network plugin implements the CNI spec. It takes Plugin
a container runtime and configure (attach/detach)
it to the network
• CNI plugin is an executable (in: /opt/cni/bin) Configures Networking
• When invoked it reads in a JSON config &
Environment Variables to get all the required
parameters to configure the container with the
network
Credit:https://www.slideshare.net/weaveworks/introduction-to-the-container-network-interface-cni
Topology
• Node IP belongs to 2
different subnets:
25.25.25.0/24 &
35.35.35.0/24
• Gateway is configured with
.254 in the network for
each network segment
• Bond0 interface is created
by bonding two 10G
interfaces Node-121 Node-122
25.25.25.121 35.35.35.122
K8S
Flannel
Flannel CNI Plugin
Intro
• To make networking easier, Kubernetes does
away with port-mapping and assigns a unique IP
address to each pod FlannelD
• If a host cannot get an entire subnet to itself things (VXLAN/UDP..)
get pretty complicated
• Flannel aims to solve this problem by creating an
overlay mesh network that provisions a subnet to
each server
Network
Flannel
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
notice the index on veth.
eg: “eth0@if12” on pod1 ns corresponds to 12th interface on the root ns
Flannel
Pod-to-Pod
Communication root bond0 root bond0
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel
10.244.1.2 10.244.2.2
Node-121 Node-122
25.25.25.121 35.35.35.122
Flannel 6 7
5 8
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Felix Bird
• The primary Calico agent that runs on each • BGP Client: responsible of route distribution
machine that hosts endpoints. • When Felix inserts routes into the Linux kernel
• Responsible for programming routes and ACLs, FIB, Bird will pick them up and distribute them to
and anything else required on the host the other nodes in the deployment
Calico
Architecture
eth0 eth0
• Felix's primary responsibility is to program the Bird Bird
route flannel route flannel
host's iptables and routes to provide the iptables
table
iptables
table
0 0
connectivity to pods on that host.
• Bird is a BGP agent for Linux that is used to cbr0 cbr0
exchange routing information between the hosts.
Felix Felix
veth1 veth1
The routes that are programmed by Felix are
picked up by bird and distributed among the
cluster hosts
eth0 eth1 eth0 eth1
pod1 pod2 pod3 pod4
Node1 Node2
*etcd/confd components are not shown for clarity
Calico
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
“ipip” tunnel
Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
veth interfaces
relationshi relationshi relationshi relationshi
Node-121 Node-122
25.25.25.121 35.35.35.122
The The
image image
part with part with
relationshi relationshi
Calico The
image
part with
The
image
part with
The
image
part with
The
image
part with
relationshi relationshi relationshi relationshi
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Node-121 Node-122
25.25.25.121 35.35.35.122
2 tcpdump on root veth 3
root veth is replying with its own MAC
Calico
default GW 169.254.1.1
pod1 eth0 pod2 eth0
• cali interface replies to
ARP request with it’s own 192.168.83.67 192.168.243.2
Calico
ARP Resolution root bond0 root bond0
Calico
Once the ARP is resolved, routing table rules kicks in
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico 5 6
4 7
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
• Packets get forwarded to pod1 eth0 eth0 pod3 eth0 pod2 eth0
the appropriate calico veth 192.168.243.2
192.168.83.67 192.168.83.68
interface based on the
routing rules Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Node-121 Node-122
25.25.25.121 35.35.35.122
No more tunl0. Pod routes are directly pointing to bond0
Calico
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
Packets won’t go
3
❓ Reason: No routes to the pod networks
root bond0 root bond0
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Calico
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Create a BGP configuration with AS:63400 Create a ”global” BGP Peer
Calico
3 6
192.168.83.69 192.168.243.4
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Architecture eth0
Cilium
Agent
flannel
• Cilium Agent, Cilium CLI Client, CNI Plugin will be 0
running on every node
cbr0
• Cilium agent compiles BPF programs and make
the kernel runs these programs at key points in the veth1
network stack to have visibility and control over all BPF Program BPF Program
network traffic in/out of all containers
• Cilium interacts with the Linux kernel to install BPF eth0 eth1
program which will then perform networking tasks pod1 pod2
and implement security rules
Node1
• Both VETH/IPVLAN is
supported (Higher
Performance gains with
IPVLAN)
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Default ‘cilium_vxlan’ interface is in Metadata mode.
Configuration No IP assigned to this interface
root bond0 root bond0
• ”cilium_vxlan” interface is in
metadata mode. It can send & cilium_health cilium_health
receive on multiple addresses cilium_vxlan cilium_vxlan
cilium_net cilium_net
• Cilium addressing model allows
to derive the node address from cilium_host cilium_host
cilium_host cilium_host
each container address. 192.168.1.1 192.168.2.1
Communication cilium_health
cilium_vxlan
cilium_health
cilium_vxlan
• VETH Interface with “lxc- cilium_net cilium_net
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
XDP Hooks
Datapath
root bond0 root bond0
• Cilium datapath uses eBPF cilium_health cilium_health
hooks to load BPF programs cilium_vxlan cilium_vxlan
• XDP BPF hook is at the earliest cilium_net cilium_net
point possible in the networking
driver and triggers a run of the cilium_host
lxc-x
cilium_host
lxc-y
BPF program upon packet
TC Ingress Hooks
reception
• Traffic Control (TC) Hooks: BPF pod1 eth0 pod2 eth0
programs are attached to the
TC Ingress hook of host side of 192.168.1.231 192.168.2.251
the VETH pair for monitoring &
Node-121 Node-122
enforcement
25.25.25.121 35.35.35.122
BPF program is responding with to the ARP with VETH’s (lxc-x) MAC
Cilium
ARP Resolution
root bond0 root bond0
• Default GW of the cilium_health cilium_health
container is pointing to the cilium_vxlan cilium_vxlan
cilium_net
IP of “cilium_host”
Default GW of the pod is pointing to
cilium_net
’cilium_host’ IP
cilium_host
• BPF Program is installed to cilium_host
lxc-x lxc-y
reply to the ARP request 192.168.1.1
ARP Reply
• LXC Interface MAC is used 1
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium 5 6
4 7
Packet Flow
root bond0 3 root bond0 8
192.168.1.231 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Cilium
Direct Routing
root bond0 Routing root bond0 Routing
• No VXLAN/GENEVE Daemon Daemon
cilium_health cilium_health
overlays
cilium_net cilium_net
• As an admin you are able
to run your own flavor of cilium_host cilium_host
lxc-x lxc-y
routing daemon
(Bird/BGPD/Zebra etc) to
distribute routes
pod1 eth0 pod2 eth0
• You can make the Pod IP’s
192.168.1.231
routable as a result 192.168.2.251
Node-121 Node-122
25.25.25.121 35.35.35.122
Yes! Back to the Basics…
Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics LoadBalancer: Service is accessed via Loadbalancer
Exposing Services
• Load Balancer: Spins up a load
balancer and binds service IPs to
Load Balancer VIP
• Very common in public cloud
environments
• For baremetal workloads: ‘MetalLB’
(Up & coming project, load-balancer
implementation for bare
metal K8S clusters, using standard
routing protocols)
Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics
Exposing Services
• Ingress: K8S Concept that lets you
decide how to let traffic into the cluster
• Sits in front of multiple services and
act as a ”router”
• Implemented through an ingress
controller (NGINX/HA Proxy)
Credit: https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0
K8S Networking: Basics
Exposing Services
• Ingress: Network (“Abstracted Virtual-IP
Network”) can really help you out here
• Ingress controllers are deployed in
some of your “public” nodes in your
cluster
• Eg: Big Cloud Fabric (by Big
Switch), can expose a Virtual IP in
front of the Ingress Controllers and Kube-1862 Kube-1863
perform Load Balancing/Health
Checks/Analytics
Thanks!
• Repo for all the command outputs/PDF slides:
https://github.com/jayakody/ons-2019
• Credits:
• Inspired by Life of a Packet- Michael Rubin, Google (https://www.youtube.com/watch?v=0Omvgd7Hg1I)
• Sarath Kumar, Prashanth Padubidry : Big Switch Engineering
• Thomas Graf, Dan Wendlandt: Isovalent (Cilium Project)
• Project Calico Slack Channel: special shout out to: Casey Davenport, Tigera
• And so many other folks who took time to share knowledge around this emerging space through different mediums
(Blogs/YouTube videos etc)