Cilium BGP Control Plane (Beta)
BGP Control Plane provides a way for Cilium to advertise routes to connected routers by using the
Border Gateway Protocol (BGP). BGP Control Plane makes Pod networks and/or Services of type
LoadBalancer reachable from outside the cluster for environments that support BGP. Because BGP
Control Plane does not program the datapath, do not use it to establish
reachability within the cluster.
Usage
Currently a single flag in the Cilium Agent exists to turn on the
BGP Control Plane feature set.
--enable-bgp-control-plane=true
If using Helm charts instead, the relevant values are the following:
bgpControlPlane:
enabled: true
Note
The BGP Control Plane feature is mutually exclusive with the MetalLB-based MetalLB BGP ControlPlane (deprecated) feature. To use the Control Plane, the older BGP feature has to be disabled. In other words, this feature does _not_ switch the BGP implementation from MetalLB to GoBGP.
When set to true the BGP Control Plane Controllers will be
instantiated and will begin listening for CiliumBGPPeeringPolicy
events.
Currently, the BGP Control Plane will only work when IPAM mode is set to
“cluster-pool”, “kubernetes”, or “multi-pool”.
CiliumBGPPeeringPolicy CRD
All BGP peering topology information is carried in a
CiliumBGPPeeringPolicy CRD.
CiliumBGPPeeringPolicy can be applied to one or more nodes based on
its nodeSelector fields.
A Cilium node may only have a single CiliumBGPPeeringPolicy apply to
it and if more than one does, it will apply no policy at all.
Each CiliumBGPPeeringPolicy defines one or more
CiliumBGPVirtualRouter configurations.
When these CRDs are written or read from the cluster the Controllers
will take notice and perform the necessary actions to drive the
BGP Control Plane to the desired state described by the policy.
The policy in yaml form is defined below:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
name: 01-bgp-peering-policy
spec: # CiliumBGPPeeringPolicySpec
nodeSelector:
matchLabels:
bgp-policy: a
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
exportPodCIDR: true
neighbors: # []CiliumBGPNeighbor
- peerAddress: 'fc00:f853:ccd:e793::50/128'
peerASN: 64512
authSecretRef: secretname
eBGPMultihopTTL: 10
connectRetryTimeSeconds: 120
holdTimeSeconds: 90
keepAliveTimeSeconds: 30
gracefulRestart:
enabled: true
restartTimeSeconds: 120
Fields
nodeSelector: Nodes which are selected by this label selector will apply the given policy
virtualRouters: One or more peering configurations outlined below. Each peering configuration can be thought of as a BGP router instance.
virtualRouters[*].localASN: The local ASN for this peering configuration
virtualRouters[*].serviceSelector: Services which are selected by this label selector will be announced.
virtualRouters[*].podIPPoolSelector: Allocated CIDRs from CiliumPodIPPools which are selected by this label selector will be announced.
virtualRouters[*].exportPodCIDR: Whether to export the private pod CIDR block to the listed neighbors
virtualRouters[*].neighbors: A list of neighbors to peer with
neighbors[*].peerAddress: The address of the peer neighbor
neighbors[*].peerPort: Optional TCP port number of the neighbor. 1-65535 are valid values and defaults to 179 when unspecified.
neighbors[*].peerASN: The ASN of the peer
neighbors[*].authSecretRef: Optional name of a secret in the BGP secrets namespace to use to retrieve a TCP MD5 password.
neighbors[*].eBGPMultihopTTL: Time To Live (TTL) value used in BGP packets. The value 1 implies that eBGP multi-hop feature is disabled.
neighbors[*].connectRetryTimeSeconds: Initial value for the BGP ConnectRetryTimer (RFC 4271, Section 8). Defaults to 120 seconds.
neighbors[*].holdTimeSeconds: Initial value for the BGP HoldTimer (RFC 4271, Section 4.2). Defaults to 90 seconds.
neighbors[*].keepAliveTimeSeconds: Initial value for the BGP KeepaliveTimer (RFC 4271, Section 8). Defaults to 30 seconds.
neighbors[*].gracefulRestart.enabled: The flag to enable graceful restart capability.
neighbors[*].gracefulRestart.restartTimeSeconds: The restart time advertised to the peer (RFC 4724 section 4.2).
Note
Setting unique configuration details of a particular instantiated virtual router on a particular Cilium node is explained in Virtual Router Attributes
Creating a BGP Topology
Rules
Follow the rules below to have a CiliumBGPPeeringPolicy correctly
apply to a node.
Only a single
CiliumBGPPeeringPolicycan apply to aCiliumnode.If the
BGP Control Planeon a node iterates through theCiliumBGPPeeringPolicyCRs currently written to the cluster and discovers (n > 1) policies match its labels, it will return an error and remove any existing BGP sessions. Only (n == 1) policies must match a node’s label sets.Administrators should test a new BGP topology in a staging environment before making permanent changes in production.
Within a
CiliumBGPPeeringPolicyeachCiliumBGPVirtualRouterdefined must have a uniquelocalASNfield.A node cannot host two or more logical routers with the same local ASN. Local ASNs are used as unique keys for a logical router.
A node can define the remote ASN on a per-neighbor basis to mitigate this scenario. See
CiliumBGPNeighborCR sub-structure.
IPv6 single stack deployments must set an IPv4 encoded
routerIDfield in each definedCiliumBGPVirtualRouterobject within aCiliumBGPPeeringPolicyCilium running on a IPv6 single stack cluster cannot reliably generate a unique 32 bit BGP router ID, as it defines no unique IPv4 addresses for the node. The administrator must define these IDs manually or an error applying the policy will occur.
This is explained further in Virtual Router Attributes
Defining Topology
Within a CiliumBGPPeeringPolicy multiple
CiliumBGPVirtualRouter(s) can be defined.
Each one can be thought of as a logical BGP router instance.
Defining more than one CiliumBGPVirtualRouter in a
CiliumBGPVirtualRouter creates more than one logical BGP router on
the hosts which the policy matches.
It is possible to create a single CiliumBGPPeeringPolicy for all
nodes by giving each node in a cluster the same label and defining a
single CiliumBGPPeeringPolicy which applies to this label.
It is also possible to provide each Kubernetes node its own
CiliumBGPPeeringPolicy by giving each node a unique label and
creating a CiliumBGPPeeringPolicy for each unique label.
This allows for selecting subsets of nodes which peer to a particular BGP router while another subset of nodes peer to a separate BGP router, akin to an “AS-per-rack” topology.
Virtual Router Attributes
A CiliumBGPPeeringPolicy can apply to multiple nodes.
When a CiliumBGPPeeringPolicy applies to one or more nodes each node
will instantiate one or more BGP routers as defined by the list of
CiliumBGPVirutalRouter.
However, there are times where fine-grained control over an instantiated virtual router’s configuration needs to take place.
To accomplish this a Kubernetes annotation is defined which applies to Kubernetes Node resources.
A single annotation is used to specify a set of configuration attributes to apply to a particular virtual router instantiated on a particular host.
The syntax of the annotation is as follows:
cilium.io/bgp-virtual-router.{asn}="key=value,..."
The {asn} portion should be replaced by the virtual router’s local
ASN you wish to apply these configuration attributes to.
The following sections outline the currently supported attributes.
Note
Each following section describes the syntax of applying a single attribute, however the annotation’s value supports a comma separated lists of attributes and applying multiple attributes in a single annotation is supported.
Note
When duplicate key=value attributes are defined the last
one will be selected.
Router ID Attribute
When Cilium is running on an IPv4 or a dual-stack IPv4/6 cluster
the BGP Control Plane will utilize the IPv4 addressed used by
Cilium for external reach ability.
This will typically be Kubernetes’ reported external IP address but can also be configured with a Cilium agent flag.
When running in IPv6 single stack or when the administrator needs to
manually define the instantiated BGP server’s router ID a Kubernetes
annotation can be placed on the node.
The annotation takes the following syntax:
cilium.io/bgp-virtual-router.{asn}="router-id=127.0.0.1"
The above annotation syntax should replace {asn} with the local ASN
of the CiliumBGPVirtualRouter you are setting the provided router ID
for.
When the BGPControlPlane evaluates a CiliumBGPPeeringPolicy with
a CiliumBGPVirtualRouter it also searches for an annotation which
targets the aforementioned CiliumBGPVirtualRouter local ASN.
If found it will use the provided router ID and not attempt to use the IPv4 address assigned to the node.
Local Listening Port
By default the GoBGP BGPRouterManager will instantiate each virtual
router without a listening port.
It is possible to deploy a virtual router which creates a local listening port where BGP connections may take place.
If this is desired the following annotation can be provided
cilium.io/bgp-virtual-router.{asn}="local-port=45450"
Neighbors
Each CiliumBGPVirtualRouter can contain multiple CiliumBGPNeighbor sections,
each specifying configuration for a neighboring BGP peer of the Virtual Router.
Each neighbor is uniquely identified by the address and the ASN of the peer, and can
contain additional configuration specific for the given BGP peering, such as BGP timer
values, graceful restart configuration and others.
Warning
Change of an existing neighbor configuration can cause reset of the existing BGP peering connection, which results in route flaps and transient packet loss while the session reestablishes and peers exchange their routes. To prevent packet loss, it is recommended to configure BGP graceful restart.
MD5 passwords
By configuring authSecretRef for a neighbor you can configure that a
RFC-2385 TCP MD5 password should be configured on the session with this BGP
peer.
authSecretRef should reference the name of a secret in the BGP secrets
namespace (if using the Helm chart this is kube-system by default). The
secret should contain a key with a name of password.
BGP secrets are limited to a configured namespace to keep the permissions needed on each Cilium Agent instance to a minimum. The Helm chart will configure Cilium to be able to read from it by default.
An example of creating a secret is:
$ kubectl create secret generic -n kube-system --type=string secretname --from-literal=password=my-secret-password
If you wish to change the namespace, you can set the
bgpControlPlane.secretNamespace.name Helm chart value. To have the
namespace created automatically, you can set the
bgpControlPlane.secretNamespace.create Helm chart value to true.
Because TCP MD5 passwords sign the header of the packet they cannot be used if the session will be address translated by Cilium (i.e. the Cilium Agent’s pod IP address must be the address the BGP peer sees).
If the password is incorrect, or the header is otherwise changed the TCP
connection will not succeed. This will appear as dial: i/o timeout in the
Cilium Agent’s logs rather than a more specific error message.
If a CiliumBGPPeeringPolicy is deployed with an authSecretRef that Cilium cannot find, the BGP session will use an empty password and the agent will log an error such as in the following example:
level=error msg="Failed to fetch secret \"secretname\": not found (will continue with empty password)" component=manager.fetchPeerPassword subsys=bgp-control-plane
Graceful Restart
The Cilium BGP control plane can be configured to act as a graceful restart
Restarting Speaker. When you enable graceful restart, the BGP session will restart
and the “graceful restart” capability will be advertised in the BGP OPEN message.
In the event of a Cilium Agent restart, the peering BGP router does not withdraw routes received from the Cilium BGP control plane immediately. The datapath continues to forward traffic during Agent restart, so there is no traffic disruption.
Configure graceful restart on per-neighbor basis, as follows:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
neighbors: # []CiliumBGPNeighbor
- peerAddress: 'fc00:f853:ccd:e793::50/128'
# [...]
gracefulRestart:
enabled: true
restartTimeSeconds: 120
Note
When enabled, graceful restart capability is advertised for IPv4 and IPv6 address families.
Optionally, you can use the RestartTime parameter. RestartTime is the time
advertised to the peer within which Cilium BGP control plane is expected to re-establish
the BGP session after a restart. On expiration of RestartTime, the peer removes
the routes previously advertised by the Cilium BGP control plane.
When the Cilium Agent restarts, it closes the BGP TCP socket, causing the emission of a
TCP FIN packet. On receiving this TCP FIN, the peer changes its BGP state to Idle and
starts its RestartTime timer.
The Cilium agent boot up time varies depending on the deployment. If using RestartTime,
you should set it to a duration greater than the time taken by the Cilium Agent to boot up.
Default value of RestartTime is 120 seconds. More details on graceful restart and
RestartTime can be found in RFC-4724 and RFC-8538.
Advertised Path Attributes
BGP advertisements can be extended with additional BGP Path Attributes - BGP Communities (RFC-1997) or Local Preference. These Path Attributes can be configured selectively for each BGP peer and advertisement type.
The following code block shows an example configuration of AdvertisedPathAttributes for a BGP neighbor,
which adds a BGP community attribute with the value 65001:100 to all Service announcements from the
matching CiliumLoadBalancerIPPool and sets the Local Preference value for all Pod CIDR announcements
to the value 150:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
neighbors: # []CiliumBGPNeighbor
- peerASN: 64512
peerAddress: 172.0.0.1/32
# [...]
advertisedPathAttributes:
- selectorType: CiliumLoadBalancerIPPool
selector:
matchLabels:
environment: production
communities:
standard:
- 65001:100
- selectorType: PodCIDR
localPreference: 150
communities:
standard:
- 65001:150
Note
Note that Local Preference Path Attribute is sent only to iBGP peers (not to eBGP peers).
Each AdvertisedPathAttributes configuration item consists of two parts:
SelectorTypewithSelectordefine which BGP advertisements will be extended with additional Path Attributes.
Communitiesand / orLocalPreferencedefine the additional Path Attributes applied on the selected routes.
There are three possible values of the SelectorType which define the object type on which the Selector applies:
PodCIDR: matchesCiliumNodecustom resources (Path Attributes apply to routes announced for PodCIDRs of selectedCiliumNodeobjects).
CiliumLoadBalancerIPPool: matchesCiliumLoadBalancerIPPoolcustom resources (Path Attributes apply to routes announced for selectedCiliumLoadBalancerIPPoolobjects).
CiliumPodIPPool: matchesCiliumPodIPPoolcustom resources (Path Attributes apply to routes announced for allocated prefixes of selectedCiliumPodIPPoolobjects).
There are two types of additional Path Attributes that can be advertised with the routes: Communities and LocalPreference.
Communities defines a set of community values advertised in the supported BGP Communities Path Attributes.
The values can be of two types:
Standard: represents a value of the “standard” 32-bit BGP Communities Attribute (RFC-1997) as a 4-byte decimal number or two 2-byte decimal numbers separated by a colon (e.g.65100:100).
Large: represents a value of the BGP Large Communities Attribute (RFC-8092), as three 4-byte decimal numbers separated by colons (e.g.65100:100:50).
LocalPreference defines the preference value advertised in the BGP Local Preference Path Attribute.
As Local Preference is only valid for iBGP peers, this value will be ignored for eBGP peers
(no Local Preference Path Attribute will be advertised).
Once configured, the additional Path Attributes advertised with the routes for a peer can be verified using the
cilium-dbg bgp routes CLI command, for example:
$ cilium-dbg bgp routes advertised ipv4 unicast peer 172.0.0.1
VRouter Prefix NextHop Age Attrs
65000 10.244.0.0/24 172.0.0.2 3m31s [{Origin: i} {LocalPref: 150} {Nexthop: 172.0.0.2}
65000 192.168.100.190/32 172.0.0.2 3m32s [{Origin: i} {LocalPref: 100} {Communities: 64512:100}] {Nexthop: 172.0.0.2}
Service announcements
By default, virtual routers will not announce services. Virtual routers will announce
the ingress IPs of any LoadBalancer services that matches the .serviceSelector
of the virtual router and has loadBalancerClass
unspecified or set to io.cilium/bgp-control-plane.
If you wish to announce ALL services within the cluster, a NotIn match expression
with a dummy key and value can be used like:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
serviceSelector:
matchExpressions:
- {key: somekey, operator: NotIn, values: ['never-used-value']}
There are a few special purpose selector fields which don’t match on labels but
instead on other metadata like .meta.name or .meta.namespace.
Selector |
Field |
io.kubernetes.service.namespace |
|
io.kubernetes.service.name |
|
Semantics of the externalTrafficPolicy: Local
When the service has externalTrafficPolicy: Local, BGP Control Plane keeps track
of the endpoints for the service on the local node and stops advertisement when there’s
no local endpoint.
CiliumPodIPPool announcements
By default, virtual routers will not announce any CiliumPodIPPool CIDRs. To announce allocated
CIDRs of a CiliumPodIPPool, specify the .podIPPoolSelector for the virtual router. The
.podIPPoolSelector field is a label selector that selects allocated CIDRs of CiliumPodIPPools
matching the specified .matchLabels or .matchExpressions.
Note
The CiliumPodIPPool CIDR must be allocated to a CiliumNode that matches the .nodeSelector for
the virtual router to announce the CIDR as a BGP route.
If you wish to announce ALL CiliumPodIPPool CIDRs within the cluster, a NotIn match expression
with a dummy key and value can be used like:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
podIPPoolSelector:
matchExpressions:
- {key: somekey, operator: NotIn, values: ['never-used-value']}
There are two special purpose selector fields that match CiliumPodIPPools based on name and/or
namespace metadata instead of labels:
Selector |
Field |
io.cilium.podippool.namespace |
|
io.cilium.podippool.name |
|
For additional details regarding CiliumPodIPPools, see the Multi-Pool (Beta) section.
CLI
There are two CLIs available to view cilium BGP peering state. One CLI is present inside Cilium Agent. The second CLI is the cluster-wide Cilium CLI.
Warning
The Cilium CLI is experimental. Consider carefully before using it in production environments!
Cilium Agent CLI
The following command shows peering status:
cilium# cilium-dbg bgp peers -h
List state of all peers defined in CiliumBGPPeeringPolicy
Usage:
cilium-dbg bgp peers [flags]
Flags:
-h, --help help for peers
-o, --output string json| yaml| jsonpath='{}'
Global Flags:
--config string Config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
The following command shows BGP routes available in the RIB / advertised to the peers:
cilium# cilium-dbg bgp routes -h
List routes in the BGP Control Plane's Routing Information Bases (RIBs)
Usage:
cilium-dbg bgp routes <available | advertised> <afi> <safi> [vrouter <asn>] [peer|neighbor <address>] [flags]
Examples:
Get all IPv4 unicast routes available:
cilium bgp routes available ipv4 unicast
Get all IPv6 unicast routes available for a specific vrouter:
cilium bgp routes available ipv6 unicast vrouter 65001
Get IPv4 unicast routes advertised to a specific peer:
cilium bgp routes advertised ipv4 unicast peer 10.0.0.1
Flags:
-h, --help help for routes
-o, --output string json| yaml| jsonpath='{}'
Global Flags:
--config string Config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
Cilium-CLI
Cilium CLI displays the BGP peering status of all nodes.
$ cilium bgp peers -h
Gets BGP peering status from all nodes in the cluster
Usage:
cilium bgp peers [flags]
Flags:
--agent-pod-selector string Label on cilium-agent pods to select with (default "k8s-app=cilium")
-h, --help help for peers
--node string Node from which BGP status will be fetched, omit to select all nodes
-o, --output string Output format. One of: json, summary (default "summary")
--wait-duration duration Maximum time to wait for result, default 1 minute (default 1m0s)
Global Flags:
--context string Kubernetes configuration context
-n, --namespace string Namespace Cilium is running in (default "kube-system")
Architecture
The BGP Control Plane is split into a Agent-Side Control Plane
and a Operator-Side control plane (not yet implemented).
Both control planes are implemented by a Controller which follows
the Kubernetes controller pattern.
Both control planes primarily listen for CiliumBGPPeeringPolicy CRDs,
along with other Cilium and Kubernetes resources used for implementing
a BGP control plane.
Agent-Side Architecture
At a high level, the Agent-Side Control Plane is divided into the following
sub-modules:
Agent
Manager
Router
Agent
The Agent implements a controller located in pkg/bgpv1/agent/controller.go.
The controller listens for CiliumBGPPeeringPolicy changes and
determines if the policy applies to its current host.
It will then capture some information about Cilium’s current state
and pass down the desired state to Manager.
Manager
The Manager implements the interface BGPRouterManager, which
defines a declarative API between the Controller and instances of
BGP routers.
The interface defines a single declarative method whose argument is the
desired CiliumBGPPeeringPolicy (among a few others).
The Manager is in charge of pushing the BGP Control Plane
to the desired CiliumBGPPeeringPolicy or returning an error if it
is not possible.
Implementation Details
Manager implementation will take desired CiliumBGPPeeringPolicy
and translate into imperative router API calls :
evaluate the desired
CiliumBGPPeeringPolicycreate/remove the desired BGP routers
advertise/withdraw the desired BGP routes
enable/disable any BGP server specific features
inform the caller if the policy cannot be applied
The Manager evaluates each CiliumBGPVirtualRouter in isolation.
While applying a CiliumBGPPeeringPolicy, it will attempt to create each
CiliumBGPVirtualRouter.
If a particular CiliumBGPVirtualRouter fails to instantiate, the error
message is logged, and the Manager will continue to the next
CiliumBGPVirtualRouter.
It is worth expanding on how the Manager works internally.
Manager views each CiliumBGPVirtualRouter as a BGP router instance.
Each CiliumBGPVirtualRouter is defined by a local ASN, a router ID and a
list of CiliumBGPNeighbors with whom it will establish peering.
This is enough for the Manager to create a Router instance.
Manager groups Router instances by their local ASNs.
Note
A CiliumBGPPeeringPolicy applying to a node must not have two or more
CiliumBGPVirtualRouters with the same localASN fields.
The Manager employs a set of Reconcilers which perform an
order-dependent reconciliation action for each Router.
See the source code at pkg/bgpv1/manager/reconcile.go for a more in
depth explanation on how each Reconcilers works.
Router
BGP Control Plane utilizes GoBGP as the underlying routing agent.
GoBGP client-side implementation is located in pkg/bgpv1/gobgp.
Implementation API adheres to the Router interface defined in pkg/bgpv1/types/bgp.go.