Kubernetes without kube-proxy¶
This guide explains how to provision a Kubernetes cluster without
and to use Cilium to fully replace it. For simplicity, we will use
bootstrap the cluster.
kubeadm and for more provisioning options please refer to
the official kubeadm documentation.
Cilium’s kube-proxy replacement depends on the Host-Reachable Services feature, therefore a v4.19.57, v5.1.16, v5.2.0 or more recent Linux kernel is required. We recommend a v5.3 or more recent Linux kernel as Cilium can perform additional optimizations in its kube-proxy replacement implementation.
Note that v5.0.y kernels do not have the fix required to run the kube-proxy replacement since at this point in time the v5.0.y stable kernel is end-of-life (EOL) and not maintained anymore on kernel.org. For individual distribution maintained kernels, the situation could differ. Therefore, please check with your distribution.
Initialize the control-plane node via
kubeadm init, set a pod network
CIDR and skip the
Afterwards, join worker nodes by specifying the control-plane node IP address and
the token returned by
kubeadm join <..>
First, make sure you have Helm 3 installed.
If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).
Setup helm repository:
helm repo add cilium https://helm.cilium.io/
Next, generate the required YAML files and deploy them. Important: Replace
API_SERVER_PORT below with the concrete control-plane
node IP address and the kube-apiserver port number reported by
(usually, it is port
Specifying this is necessary as
kubeadm init is run explicitly without setting
up kube-proxy and as a consequence while it exports
KUBERNETES_SERVICE_PORT with a ClusterIP of the kube-apiserver service
to the environment, there is no kube-proxy in our setup provisioning that service.
The Cilium agent therefore needs to be made aware of this information through below
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.kubeProxyReplacement=strict \ --set global.k8sServiceHost=API_SERVER_IP \ --set global.k8sServicePort=API_SERVER_PORT
This will install Cilium as a CNI plugin with the BPF kube-proxy replacement to implement handling of Kubernetes services of type ClusterIP, NodePort, ExternalIPs and LoadBalancer.
Finally, as a last step, verify that Cilium has come up correctly on all nodes and is ready to operate:
kubectl -n kube-system get pods -l k8s-app=cilium NAME READY STATUS RESTARTS AGE cilium-fmh8d 1/1 Running 0 10m cilium-mkcmb 1/1 Running 0 10m
Note, in above helm configuration the
kubeProxyReplacement has been set to
strict mode. This means that the Cilium agent will bail out in case the
underlying Linux kernel support is missing.
Without explicitly specifying a
kubeProxyReplacement option, helm uses
probe by default which would automatically
disable a subset of the features to implement the kube-proxy replacement instead
of bailing out if the kernel support is missing. This makes the assumption that
Cilium’s BPF kube-proxy replacement would co-exist with kube-proxy on the system
to optimize Kubernetes services. Given we’ve used kubeadm to explicitly deploy
a kube-proxy-free setup, the
strict mode has been used instead to ensure
that we do not rely on a (non-existing) fallback.
Cilium’s BPF kube-proxy replacement is supported in direct routing as well as in tunneling mode.
Validate the Setup¶
After deploying Cilium with above Quick-Start guide, we can first validate that the Cilium agent is running in the desired mode:
kubectl exec -it -n kube-system cilium-fmh8d -- cilium status | grep KubeProxyReplacement KubeProxyReplacement: Strict [NodePort (SNAT, 30000-32767), ExternalIPs, HostReachableServices (TCP, UDP)]
As a next, optional step, we deploy nginx pods, create a new NodePort service and validate that Cilium installed the service correctly.
The following yaml is used for the backend pods:
apiVersion: apps/v1 kind: Deployment metadata: name: my-nginx spec: selector: matchLabels: run: my-nginx replicas: 2 template: metadata: labels: run: my-nginx spec: containers: - name: my-nginx image: nginx ports: - containerPort: 80
Verify that the nginx pods are up and running:
kubectl get pods -l run=my-nginx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-756fb87568-gmp8c 1/1 Running 0 62m 10.217.0.149 apoc <none> <none> my-nginx-756fb87568-n5scv 1/1 Running 0 62m 10.217.0.107 apoc <none> <none>
In the next step, we create a NodePort service for the two instances:
kubectl expose deployment my-nginx --type=NodePort --port=80 service/my-nginx exposed
Verify that the NodePort service has been created:
kubectl get svc my-nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE my-nginx NodePort 10.104.239.135 <none> 80:31940/TCP 24m
With the help of the
cilium service list command, we can validate that
Cilium’s BPF kube-proxy replacement created the new NodePort service under
kubectl exec -it -n kube-system cilium-fmh8d -- cilium service list ID Frontend Service Type Backend [...] 4 10.104.239.135:80 ClusterIP 1 => 10.217.0.107:80 2 => 10.217.0.149:80 5 10.217.0.181:31940 NodePort 1 => 10.217.0.107:80 2 => 10.217.0.149:80 6 0.0.0.0:31940 NodePort 1 => 10.217.0.107:80 2 => 10.217.0.149:80 7 192.168.178.29:31940 NodePort 1 => 10.217.0.107:80 2 => 10.217.0.149:80
At the same time we can inspect through
iptables in the host namespace
iptables rule for the service is present:
iptables-save | grep KUBE-SVC [ empty line ]
Last but not least, a simple
curl test shows connectivity for the exposed
31940 as well as for the ClusterIP:
curl 127.0.0.1:31940 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> [....]
curl 10.104.239.135:80 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> [....]
As can be seen, the Cilium’s BPF kube-proxy replacement is set up correctly.
This section covers a few advanced configuration modes for the kube-proxy replacement that go beyond the above Quick-Start guide and are entirely optional.
Direct Server Return (DSR)¶
By default, Cilium’s BPF NodePort implementation operates in SNAT mode. That is, when node-external traffic arrives and the node determines that the backend for the NodePort or ExternalIPs service is at a remote node, then the node is redirecting the request to the remote backend on its behalf by performing SNAT. This does not require any additional MTU changes at the cost that replies from the backend need to make the extra hop back that node in order to perform the reverse SNAT translation there before returning the packet directly to the external client.
This setting can be changed through the
global.nodePort.mode helm option to
dsr in order to let Cilium’s BPF NodePort implementation operate in DSR mode.
In this mode, the backends reply directly to the external client without taking
the extra hop, meaning, backends reply by using the service IP/port as a source.
DSR currently requires Cilium to be deployed in Direct / Native Routing Mode, i.e.
it will not work in either tunneling mode.
Another advantage in DSR mode is that the client’s source IP is preserved, so policy can match on it at the backend node. In the SNAT mode this is not possible. Given a specific backend can be used by multiple services, the backends need to be made aware of the service IP/port which they need to reply with. Therefore, Cilium encodes this information as an IPv4 option or IPv6 extension header at the cost of advertising a lower MTU. For TCP services, Cilium only encodes the service IP/port for the SYN packet.
Above helm example configuration in a kube-proxy-free environment with DSR enabled would look as follows:
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.tunnel=disabled \ --set global.autoDirectNodeRoutes=true \ --set global.kubeProxyReplacement=strict \ --set global.nodePort.mode=dsr \ --set global.k8sServiceHost=API_SERVER_IP \ --set global.k8sServicePort=API_SERVER_PORT
NodePort Device, Port and Bind settings¶
By default, the NodePort implementation prevents application
to NodePort service ports. In such case, the application will typically see a
bind: Operation not permitted error. This happens either globally for older
kernels or starting from v5.7 kernels only for the host namespace by default
and therefore not affecting any application pod
bind(2) requests anymore. In
order to opt-out from this behavior in general, this setting can be changed for
expert users by switching
kube-proxy Hybrid Modes¶
Cilium’s BPF kube-proxy replacement can be configured in several modes, i.e. it can replace kube-proxy entirely or it can co-exist with kube-proxy on the system if the underlying Linux kernel requirements do not support a full kube-proxy replacement.
This section therefore elaborates on the various
global.kubeProxyReplacement=strict: This option expects a kube-proxy-free Kubernetes setup where Cilium is expected to fully replace all kube-proxy functionality. Once the Cilium agent is up and running, it takes care of handling Kubernetes services of type ClusterIP, NodePort, ExternalIPs and LoadBalancer. If the underlying kernel version requirements are not met (see Kubernetes without kube-proxy note), then the Cilium agent will bail out on start-up with an error message.
global.kubeProxyReplacement=probe: This option is intended for a hybrid setup, that is, kube-proxy is running in the Kubernetes cluster where Cilium partially replaces and optimizes kube-proxy functionality. Once the Cilium agent is up and running, it probes the underlying kernel for the availability of needed BPF kernel features and, if not present, disables a subset of the functionality in BPF by relying on kube-proxy to complement the remaining Kubernetes service handling. The Cilium agent will emit an info message into its log in such case. For example, if the kernel does not support Host-Reachable Services, then the ClusterIP translation for the node’s host-namespace is done through kube-proxy’s iptables rules.
global.kubeProxyReplacement=partial: Similarly to
probe, this option is intended for a hybrid setup, that is, kube-proxy is running in the Kubernetes cluster where Cilium partially replaces and optimizes kube-proxy functionality. As opposed to
probewhich checks the underlying kernel for available BPF features and automatically disables components responsible for the BPF kube-proxy replacement when kernel support is missing, the
partialoption requires the user to manually specify which components for the BPF kube-proxy replacement should be used. When
global.kubeProxyReplacementis set to
partialmake sure to also set
false, so that the Cilium agent does not start the NodePort health check server. Similarly to
strictmode, the Cilium agent will bail out on start-up with an error message if the underlying kernel requirements are not met. For fine-grained configuration,
global.hostPort.enabledcan be set to
true. By default all four options are set to
false. A few example configurations for the
partialoption are provided below.
The following helm setup below would be equivalent to
global.kubeProxyReplacement=strictin a kube-proxy-free environment:
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.kubeProxyReplacement=partial \ --set global.hostServices.enabled=true \ --set global.nodePort.enabled=true \ --set global.externalIPs.enabled=true \ --set global.k8sServiceHost=API_SERVER_IP \ --set global.k8sServicePort=API_SERVER_PORT
The following helm setup below would be equivalent to the default Cilium service handling in v1.6 or earlier in a kube-proxy environment, that is, serving ClusterIP for pods:
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.kubeProxyReplacement=partial
The following helm setup below would optimize Cilium’s ClusterIP handling for TCP in a kube-proxy environment (
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.kubeProxyReplacement=partial \ --set global.hostServices.enabled=true \ --set global.hostServices.protocols=tcp
The following helm setup below would optimize Cilium’s NodePort and ExternalIPs handling for external traffic ingressing into the Cilium managed node in a kube-proxy environment:
helm install cilium cilium/cilium --version 1.7.15 \ --namespace kube-system \ --set global.kubeProxyReplacement=partial \ --set global.nodePort.enabled=true \ --set global.externalIPs.enabled=true
global.kubeProxyReplacement=disabled: This option disables any Kubernetes service handling by fully relying on kube-proxy instead, except for ClusterIP services accessed from pods if cilium-agent’s flag
--disable-k8s-servicesis set to
In Cilium’s helm chart, the default mode is
For existing Cilium deployments in version v1.6 or prior, please consult the 1.7 Upgrade Notes.
The current Cilium kube-proxy replacement mode can also be introspected through the
cilium status CLI command:
kubectl exec -it -n kube-system cilium-xxxxx -- cilium status | grep KubeProxyReplacement KubeProxyReplacement: Strict [NodePort (SNAT, 30000-32767), ExternalIPs, HostReachableServices (TCP, UDP)]
- NodePort and ExternalIPs services are currently exposed through the native device which has the default route on the host or a user specified device. In tunneling mode, they are additionally exposed through the tunnel interface (
cilium_geneve). Exposing services through multiple native devices will be supported in upcoming Cilium versions. See GH issue 9620 for additional details.
- Cilium’s BPF kube-proxy replacement currently cannot be used with Transparent Encryption (stable/beta).
- Cilium’s BPF kube-proxy replacement relies upon the Host-Reachable Services feature which uses BPF cgroup hooks to implement the service translation. The getpeername(2) hook is currently missing which will be addressed for newer kernels. It is known to currently not work with libceph deployments.
- Cilium in general currently does not support IP de-/fragmentation. This also includes the BPF kube-proxy replacement. Meaning, while the first packet with L4 header will reach the backend, all subsequent packets will not due to service lookup failing. This will be addressed via GH issue 10076.
- Kubernetes Service sessionAffinity is currently not implemented. This will be addressed via GH issue 9076.
- The BPF kube-proxy replacement currently cannot be used in combination with CNI chaining e.g. deployed as
--set global.cni.chainingMode=portmap. Future Cilium versions are going to provide native portmap support via BPF and therefore without the need for chaining; tracked via GH issue 10359.