L2 Announcements / L2 Aware LB (Beta)

Note

This is a beta feature. Please provide feedback and file a GitHub issue if you experience any problems.

L2 Announcements is a feature which makes services visible and reachable on the local area network. This feature is primarily intended for on-premises deployments within networks without BGP based routing such as office or campus networks.

When used, this feature will respond to ARP queries for ExternalIPs and/or LoadBalancer IPs. These IPs are Virtual IPs (not installed on network devices) on multiple nodes, so for each service one node at a time will respond to the ARP queries and respond with its MAC address. This node will perform load balancing with the service load balancing feature, thus acting as a north/south load balancer.

The advantage of this feature over NodePort services is that each service can use a unique IP so multiple services can use the same port numbers. When using NodePorts, it is up to the client to decide to which host to send traffic, and if a node goes down, the IP+Port combo becomes unusable. With L2 announcements the service VIP simply migrates to another node and will continue to work.

The L2 Announcements feature and all the requirements can be enabled as follows:

$ helm upgrade cilium ./cilium \
   --namespace kube-system \
   --reuse-values \
   --set l2announcements.enabled=true \
   --set k8sClientRateLimit.qps={QPS} \
   --set k8sClientRateLimit.burst={BURST} \
   --set kubeProxyReplacement=true \
   --set k8sServiceHost=${API_SERVER_IP} \
   --set k8sServicePort=${API_SERVER_PORT}

Warning

Sizing the client rate limit (k8sClientRateLimit.qps and k8sClientRateLimit.burst) is important when using this feature due to increased API usage. See Sizing client rate limit for sizing guidelines.

Prerequisites

  • Kube Proxy replacement mode must be enabled. For more information, see Kubernetes Without kube-proxy.

  • All devices on which L2 Aware LB will be announced should be enabled and included in the --devices flag or devices Helm option if explicitly set, see NodePort Devices, Port and Bind settings.

  • The externalIPs.enabled=true Helm option must be set, if usage of externalIPs is desired. Otherwise service load balancing for external IPs is disabled.

Limitations

  • The feature currently does not support IPv6/NDP.

  • Due to the way L3->L2 translation protocols work, one node receives all ARP requests for a specific IP, so no load balancing can happen before traffic hits the cluster.

  • The feature currently has no traffic balancing mechanism so nodes within the same policy might be asymmetrically loaded. For details see Leader Election.

Policies

Policies provide fine-grained control over which services should be announced, where, and how. This is an example policy using all optional fields:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: policy1
spec:
  serviceSelector:
    matchLabels:
      color: blue
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
  interfaces:
  - ^eth[0-9]+
  externalIPs: true
  loadBalancerIPs: true

Service Selector

The service selector is a label selector that determines which services are selected by this policy. If no service selector is provided, all services are selected by the policy. A service must have loadBalancerClass unspecified or set to io.cilium/l2-announcer to be selected by a policy for announcement.

There are a few special purpose selector fields which don’t match on labels but instead on other metadata like .meta.name or .meta.namespace.

Selector

Field

io.kubernetes.service.namespace

.meta.namespace

io.kubernetes.service.name

.meta.name

Node Selector

The node selector field is a label selector which determines which nodes are candidates to announce the services from.

It might be desirable to pick a subset of nodes in you cluster, since the chosen node (see Leader Election) will act as the north/south load balancer for all of the traffic for a particular service.

Interfaces

The interfaces field is a list of regular expressions (golang syntax) that determine over which network interfaces the selected services will be announced. This field is optional, if not specified all interfaces will be used.

The expressions are OR-ed together, so any network device matching any of the expressions will be matched.

L2 announcements only work if the selected devices are also part of the set of devices specified in the devices Helm option, see NodePort Devices, Port and Bind settings.

Note

This selector is NOT a security feature, services will still be available via interfaces when not advertised (for example by hard-coding ARP entries).

IP Types

The externalIPs and loadBalancerIPs fields determine what sort of IPs are announced. They are both set to false by default, so a functional policy should always have one or both set to true.

If externalIPs is true all IPs in .spec.externalIPs field are announced. These IPs are are managed by service authors.

If loadBalancerIPs is true all IPs in the service’s .status.loadbalancer.ingress field are announced. These can be assigned by LoadBalancer IP Address Management (LB IPAM) which can be configured by cluster admins for better control over which IPs can be allocated.

Note

If a user intends to use externalIPs, the externalIPs.enable=true Helm option should be set to enable service load balancing for external IPs.

Status

If a policy is invalid for any number of reasons, the status of the policy will reflect that. For example if an invalid match expression is provided:

$ kubectl describe l2announcement
Name:         policy1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  cilium.io/v2alpha1
Kind:         CiliumL2AnnouncementPolicy
Metadata:
  #[...]
Spec:
  #[...]
  Service Selector:
    Match Expressions:
      Key:       something
      Operator:  NotIn
      Values:
Status:
  Conditions:
    Last Transition Time:  2023-05-12T15:39:01Z
    Message:               values: Invalid value: []string(nil): for 'in', 'notin' operators, values set can't be empty
    Observed Generation:   1
    Reason:                error
    Status:                True
    Type:                  io.cilium/bad-service-selector

The status of these error conditions will go to False as soon as the user updates the policy to resolve the error.

Leader Election

Due to the way ARP/NDP works, hosts only store one MAC address per IP, that being the latest reply they see. This means that only one node in the cluster is allowed to reply to requests for a given IP.

To implement this behavior, every Cilium agent resolves which services are selected for its node and will start participating in leader election for every service. We use Kubernetes lease mechanism to achieve this. Each service translates to a lease, the lease holder will start replying to requests on the selected interfaces.

The lease mechanism is a first come, first serve picking order. So the first node to claim a lease gets it. This might cause asymmetric traffic distribution.

Leases

The leases are created in the same namespace where Cilium is deployed, typically kube-system. You can inspect the leases with the following command:

$ kubectl -n kube-system get lease
NAME                                  HOLDER                                                    AGE
cilium-l2announce-default-deathstar   worker-node                                               2d20h
cilium-operator-resource-lock         worker-node2-tPDVulKoRK                                   2d20h
kube-controller-manager               control-plane-node_9bd97f6c-cd0c-4565-8486-e718deb310e4   2d21h
kube-scheduler                        control-plane-node_2c490643-dd95-4f73-8862-139afe771ffd   2d21h

The leases starting with cilium-l2announce- are leases used by this feature. The last part of the name is the namespace and service name. The holder indicates the name of the node that currently holds the lease and thus announced the IPs of that given service.

To inspect a lease:

$ kubectl -n kube-system get lease/cilium-l2announce-default-deathstar -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-05-09T15:13:32Z"
  name: cilium-l2announce-default-deathstar
  namespace: kube-system
  resourceVersion: "449966"
  uid: e3c9c020-6e24-4c5c-9df9-d0c50f6c4cec
spec:
  acquireTime: "2023-05-09T15:14:20.108431Z"
  holderIdentity: worker-node
  leaseDurationSeconds: 3
  leaseTransitions: 1
  renewTime: "2023-05-12T12:15:26.773020Z"

The acquireTime is the time at which the current leader acquired the lease. The holderIdentity is the name of the current holder/leader node. If the leader does not renew the lease for leaseDurationSeconds seconds a new leader is chosen. leaseTransitions indicates how often the lease changed hands and renewTime the last time the leader renewed the lease.

There are three Helm options that can be tuned with regards to leases:

  • l2announcements.leaseDuration determines the leaseDurationSeconds value of created leases and by extent how long a leader must be “down” before failover occurs. Its default value is 15s, it must always be greater than 1s and be larger than leaseRenewDeadline.

  • l2announcements.leaseRenewDeadline is the interval at which the leader should renew the lease. Its default value is 5s, it must be greater than leaseRetryPeriod by at least 20% and is not allowed to be below 1ns.

  • l2announcements.leaseRetryPeriod if renewing the lease fails, how long should the agent wait before it tries again. Its default value is 2s, it must be smaller than leaseRenewDeadline by at least 20% and above 1ns.

Note

The theoretical shortest time between failure and failover is leaseDuration - leaseRenewDeadline and the longest leaseDuration + leaseRenewDeadline. So with the default values, failover occurs between 10s and 20s. For the example below, these times are between 2s and 4s.

$ helm upgrade cilium ./cilium \
   --namespace kube-system \
   --reuse-values \
   --set l2announcements.enabled=true \
   --set kubeProxyReplacement=true \
   --set k8sServiceHost=${API_SERVER_IP} \
   --set k8sServicePort=${API_SERVER_PORT} \
   --set k8sClientRateLimit.qps={QPS} \
   --set k8sClientRateLimit.burst={BURST} \
   --set l2announcements.leaseDuration=3s \
   --set l2announcements.leaseRenewDeadline=1s \
   --set l2announcements.leaseRetryPeriod=200ms

There is a trade-off between fast failure detection and CPU + network usage. Each service incurs a CPU and network overhead, so clusters with smaller amounts of services can more easily afford faster failover times. Larger clusters might need to increase parameters if the overhead is too high.

Sizing client rate limit

The leader election process continually generates API traffic, the exact amount depends on the configured lease duration, configured renew deadline, and amount of services using the feature.

The default client rate limit is 5 QPS with allowed bursts up to 10 QPS. this default limit is quickly reached when utilizing L2 announcements and thus users should size the client rate limit accordingly.

In a worst case scenario, services are distributed unevenly, so we will assume a peek load based on the renew deadline. In complex scenarios with multiple policies over disjunct sets of node, max QPS per node will be lower.

QPS = #services * (1 / leaseRenewDeadline)

// example
#services = 65
leaseRenewDeadline = 2s
QPS = 65 * (1 / 2s) = 32.5 QPS

Setting the base QPS to around the calculated value should be sufficient, given in multi-node scenarios leases are spread around nodes, and non-holders participating in the election have a lower QPS.

The burst QPS should be slightly higher to allow for bursts of traffic caused by other features which also use the API server.

Failover

When nodes participating in leader election detect that the lease holder did not renew the lease for leaseDurationSeconds amount of seconds, they will ask the API server to make them the new holder. The first request to be processed gets through and the rest are denied.

When a node becomes the leader/holder, it will send out a gratuitous ARP reply over all of the configured interfaces. Clients who accept these will update their ARP tables at once causing them to send traffic to the new leader/holder. Not all clients accept gratuitous ARP replies since they can be used for ARP spoofing. Such clients might experience longer downtime then configured in the leases since they will only re-query via ARP when TTL in their internal tables has been reached.

Note

Since this feature has no IPv6 support yet, only ARP messages are sent, no Unsolicited Neighbor Advertisements are sent.

L2 Pod Announcements

L2 Pod Announcements announce Pod IP addresses on the L2 network using Gratuitous ARP replies. When enabled, the node transmits Gratuitous ARP replies for every locally created pod, on the configured network interface. This feature is enabled separately from the above L2 announcements feature.

To enable L2 Pod Announcements, set the following:

$ helm upgrade cilium ./cilium \
   --namespace kube-system \
   --reuse-values \
   --set l2podAnnouncements.enabled=true \
   --set l2podAnnouncements.interface=eth0

Note

Since this feature has no IPv6 support yet, only ARP messages are sent, no Unsolicited Neighbor Advertisements are sent.