BGP Control Plane Resources

Cilium BGP control plane is managed by a set of custom resources which provide a flexible way to configure BGP peers, policies, and advertisements.

The following resources are used to manage the BGP Control Plane:

  • CiliumBGPClusterConfig: Defines BGP instances and peer configurations that are applied to multiple nodes.

  • CiliumBGPPeerConfig: A common set of BGP peering setting. It can be used across multiple peers.

  • CiliumBGPAdvertisement: Defines prefixes that are injected into the BGP routing table.

  • CiliumBGPNodeConfigOverride: Defines node-specific BGP configuration to provide a finer control.

The relationship between various resources is shown in the below diagram:

../../../_images/bgpv2.png

BGP Cluster Configuration

CiliumBGPClusterConfig resource is used to define BGP configuration for one or more nodes in the cluster based on its nodeSelector field. Each CiliumBGPClusterConfig defines one or more BGP instances, which are uniquely identified by their name field.

A BGP instance can have one or more peers. Each peer is uniquely identified by its name field. The Peer autonomous system number and peer address are defined by the peerASN and peerAddress fields, respectively. The configuration of the peers is defined by the peerConfigRef field, which is a reference to a peer configuration resource. Group and kind in peerConfigRef are optional and default to cilium.io and CiliumBGPPeerConfig, respectively.

By default, the BGP Control Plane instantiates each router instance without a listening port. This means the BGP router can only initiate connections to the configured peers, but cannot accept incoming connections. This is the default behavior because the BGP Control Plane is designed to function in environments where another BGP router (such as Bird) is running on the same node. When it is required to accept incoming connections, the localPort field can be used to specify the listening port.

Warning

The CiliumBGPPeeringPolicy and CiliumBGPClusterConfig should not be used together. If both resources are present and Cilium agent matches with both based on the node selector, CiliumBGPPeeringPolicy will take precedence.

Warning

Listening on the default BGP port (179) requires CAP_NET_BIND_SERVICE. If you wish to use the default port, you must grant the CAP_NET_BIND_SERVICE capability with securityContext.capabilities.ciliumAgent Helm value.

Here is an example configuration of the CiliumBGPClusterConfig with a BGP instance named instance-65000 and two peers configured under this BGP instance.

apiVersion: cilium.io/v2
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
spec:
  nodeSelector:
    matchLabels:
      rack: rack0
  bgpInstances:
  - name: "instance-65000"
    localASN: 65000
    localPort: 179
    peers:
    - name: "peer-65000-tor1"
      peerASN: 65000
      peerAddress: fd00:10:0:0::1
      peerConfigRef:
        name: "cilium-peer"
    - name: "peer-65000-tor2"
      peerASN: 65000
      peerAddress: fd00:11:0:0::1
      peerConfigRef:
        name: "cilium-peer"

BGP Peer Configuration

The CiliumBGPPeerConfig resource is used to define a BGP peer configuration. Multiple peers can share the same configuration and provide reference to the common CiliumBGPPeerConfig resource.

The CiliumBGPPeerConfig resource contains configuration options for:

Here is an example configuration of the CiliumBGPPeerConfig resource. In the next section, we will go over each configuration option.

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  timers:
    holdTimeSeconds: 9
    keepAliveTimeSeconds: 3
  authSecretRef: bgp-auth-secret
  ebgpMultihop: 4
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"

MD5 Password

AuthSecretRef in CiliumBGPPeerConfig can be used to configure an RFC-2385 TCP MD5 password on the session with the BGP peer which references this configuration.

Here is an example of setting authSecretRef:

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  authSecretRef: bgp-auth-secret

AuthSecretRef should reference the name of a secret in the BGP secrets namespace (if using the Helm chart this is kube-system by default). The secret should contain a key with a name of password.

BGP secrets are limited to a configured namespace to keep the permissions needed on each Cilium Agent instance to a minimum. The Helm chart will configure Cilium to be able to read from it by default.

An example of creating a secret is:

$ kubectl create secret generic -n kube-system --type=string secretname --from-literal=password=my-secret-password

If you wish to change the namespace, you can set the bgpControlPlane.secretNamespace.name Helm chart value. To have the namespace created automatically, you can set the bgpControlPlane.secretNamespace.create Helm chart value to true.

Because TCP MD5 passwords sign the header of the packet they cannot be used if the session is address-translated by Cilium (in other words, the Cilium Agent’s pod IP address must be the address that the BGP peer sees).

If the password is incorrect, or if the header is otherwise changed, then the TCP connection will not succeed. This will appear as dial: i/o timeout in the Cilium Agent’s logs rather than a more specific error message.

If a CiliumBGPPeerConfig is deployed with an authSecretRef that Cilium cannot find, the BGP session will use an empty password and the agent will log an error such as in the following example:

level=error msg="Failed to fetch secret \"secretname\": not found (will continue with empty password)" component=manager.fetchPeerPassword subsys=bgp-control-plane

Timers

BGP Control Plane supports modifying the following BGP timer parameters. For more detailed description for each timer parameters, please refer to RFC4271.

Name

Field

Default

ConnectRetryTimer

connectRetryTimeSeconds

120

HoldTimer

holdTimeSeconds

90

KeepaliveTimer

keepAliveTimeSeconds

30

In datacenter networks where Kubernetes clusters are deployed, it is generally recommended to set the HoldTimer and KeepaliveTimer to a lower value for faster possible failure detection. For example, you can set the minimum possible values holdTimeSeconds=9 and keepAliveTimeSeconds=3.

To ensure a fast reconnection after losing connectivity with the peer, reduce the connectRetryTimeSeconds (for example to 5 or less). As random jitter is applied to the configured value internally, the actual value used for the ConnectRetryTimer is within the interval [ConnectRetryTimeSeconds, 2 * ConnectRetryTimeSeconds).

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  timers:
    connectRetryTimeSeconds: 5
    holdTimeSeconds: 9
    keepAliveTimeSeconds: 3

EBGP Multihop

By default, IP TTL of the BGP packets is set to 1 in eBGP. Generally, it is encouraged to not change the TTL, but in some cases, you may need to change the TTL value. For example, when the BGP peer is a Route Server and located in a different subnet, you may need to set the TTL value to more than 1.

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  ebgpMultihop: 4 # <-- specify the TTL value

Graceful Restart

The Cilium BGP Control Plane can be configured to act as a graceful restart Restarting Speaker. When you enable graceful restart, the BGP session restarts and the “graceful restart” capability is advertised in the BGP OPEN message.

In the event of a Cilium Agent restart, the peering BGP router does not withdraw routes received from the Cilium BGP control plane immediately. The datapath continues to forward traffic during Agent restart, so there is no traffic disruption.

Optionally, you can use the restartTimeSeconds parameter. RestartTime is the time advertised to the peer within which Cilium BGP control plane is expected to re-establish the BGP session after a restart. On expiration of RestartTime, the peer removes the routes previously advertised by the Cilium BGP control plane.

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15

When the Cilium Agent restarts, it closes the BGP TCP socket, causing the emission of a TCP FIN packet. On receiving this TCP FIN, the peer changes its BGP state to Idle and starts its RestartTime timer.

The Cilium agent boot up time varies depending on the deployment. If using RestartTime, you should set it to a duration greater than the time taken by the Cilium Agent to boot up.

Default value of RestartTime is 120 seconds. More details on graceful restart and RestartTime can be found in RFC-4724 and RFC-8538.

Transport

The transport section of CiliumBGPPeerConfig can be used to configure a custom destination port for a peer’s BGP session.

By default, when BGP is operating in active mode (with the Cilium agent initiating the TCP connection), the destination port is 179 and the source port is ephemeral.

Here is an example of setting the transport configuration:

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  transport:
    peerPort: 179

Address Families

The families field is a list of AFI (Address Family Identifier), SAFI (Subsequent Address Family Identifier) pairs, and advertisement selector. The only AFI/SAFI options currently supported are {afi: ipv4, safi: unicast} and {afi: ipv6, safi: unicast}.

By default, if no address families are specified, BGP Control Plane sends both IPv4 Unicast and IPv6 Unicast Multiprotocol Extensions Capability (RFC-4760) to the peer.

In each address family, you can control the route publication via the advertisements label selector. Various advertisements types are defined here.

Note

Without matching advertisements, no prefix will be advertised to the peer. Default configuration is to not advertise any prefix.

apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"
    - afi: ipv6
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"

BGP Advertisements

The CiliumBGPAdvertisement resource is used to define various advertisement types and attributes associated with them. The advertisements label selector defined in the families field of a peer configuration may match with one or more of the CiliumBGPAdvertisement resources.

BGP Attributes

You can configure BGP path attributes for the prefixes advertised by Cilium BGP control plane using attributes field in advertisements[*]. There are two types of Path Attributes that can be advertised: Communities and LocalPreference.

Here is an example configuration of the CiliumBGPAdvertisement resource that advertises pod prefixes with the community value of “65000:99” and local preference of 99.

apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "PodCIDR"
      attributes:
        communities:
          standard: [ "65000:99" ]
        localPreference: 99

Community

Communities defines a set of community values advertised in the supported BGP Communities Path Attributes.

The values can be of three types:

  • Standard: represents a value of the “standard” 32-bit BGP Communities Attribute (RFC-1997) as a 4-byte decimal number or two 2-byte decimal numbers separated by a colon (for example: 64512:100).

  • WellKnown: represents a value of the “standard” 32-bit BGP Communities Attribute (RFC-1997) as a well-known string alias to its numeric value. Allowed values and their mapping to the numeric values are displayed in the following table:

    Well-Known Value

    Hexadecimal Value

    16-bit Pair Value

    internet

    0x00000000

    0:0

    planned-shut

    0xffff0000

    65535:0

    accept-own

    0xffff0001

    65535:1

    route-filter-translated-v4

    0xffff0002

    65535:2

    route-filter-v4

    0xffff0003

    65535:3

    route-filter-translated-v6

    0xffff0004

    65535:4

    route-filter-v6

    0xffff0005

    65535:5

    llgr-stale

    0xffff0006

    65535:6

    no-llgr

    0xffff0007

    65535:7

    blackhole

    0xffff029a

    65535:666

    no-export

    0xffffff01

    65535:65281

    no-advertise

    0xffffff02

    65535:65282

    no-export-subconfed

    0xffffff03

    65535:65283

    no-peer

    0xffffff04

    65535:65284

  • Large: represents a value of the BGP Large Communities Attribute (RFC-8092), as three 4-byte decimal numbers separated by colons (for example: 64512:100:50).

Local Preference

LocalPreference defines the preference value advertised in the BGP Local Preference Path Attribute. As Local Preference is only valid for iBGP peers, this value will be ignored for eBGP peers (no Local Preference Path Attribute will be advertised).

BGP Configuration Override

The CiliumBGPNodeConfigOverride resource can be used to override some of the auto-generated configuration on a per-node basis.

Here is an example of the CiliumBGPNodeConfigOverride resource, that sets Router ID, local address and local autonomous system number used in each peer for the node with a name bgpv2-cplane-dev-multi-homing-worker.

apiVersion: cilium.io/v2
kind: CiliumBGPNodeConfigOverride
metadata:
  name: bgpv2-cplane-dev-multi-homing-worker
spec:
  bgpInstances:
    - name: "instance-65000"
      routerID: "192.168.10.1"
      localPort: 1790
      localASN: 65010
      peers:
        - name: "peer-65000-tor1"
          localAddress: fd00:10:0:2::2
        - name: "peer-65000-tor2"
          localAddress: fd00:11:0:2::2

Note

The name of CiliumBGPNodeConfigOverride resource must match the name of the node for which the configuration is intended. Similarly, the names of the BGP instance and peers must match with what is defined under CiliumBGPClusterConfig.

This is a per node configuration.

RouterID

There is bgpControlPlane.routerIDAllocation.mode Helm chart value, which stipulates how the Router ID is allocated. Currently, only default is supported. In default mode, when Cilium runs on an IPv4 single-stack or a dual-stack, the BGP Control Plane can use the IPv4 address assigned to the node as the BGP Router ID because the Router ID is 32 bit-long, and we can rely on the uniqueness of the IPv4 address to make the Router ID unique. When running in an IPv6 single-stack, the lower 32 bits of MAC address of cilium_host interface are used as Router ID. If the auto assignment of the Router ID is not desired, the administrator needs to manually define it.

In order to configure custom Router ID, you can set routerID field in an IPv4 address format.

Listening Port

The localPort field in the CiliumBGPClusterConfig can be used to specify the listening port. If you wish to override it on a per-node basis, you can set the localPort field in the CiliumBGPNodeConfigOverride resource. This also works even if the localPort field is not set in the CiliumBGPClusterConfig.

Local Peering Address

The source interface and the address used by the BGP Control Plane in order to setup peering with the neighbor are based on a route lookup of the peer address defined in CiliumBGPClusterConfig. There may be use cases where multiple links are present on the node and you want tighter control over which link BGP peering should be setup.

To configure the source address, the peers[*].localAddress field can be set. It should be an address configured on one of the links on the node.

Local ASN

It is possible to override the Autonomous System Number (ASN) of a node using the field LocalASN of the CiliumBGPNodeConfigOverride resource. When this field is not defined, the LocalASN from the matching CiliumBGPClusterConfig is used as local ASN for the node. This customization allows individual nodes to operate with a different ASN when required by the network design.

Sample Configurations

Please refer to container lab examples in Cilium repository under contrib/containerlab/bgpv2.