Cilium BGP Control Plane (Beta)
BGP Control Plane provides a way for Cilium to advertise routes to connected routers by using the
Border Gateway Protocol (BGP). BGP Control Plane makes Pod networks and/or Services of type
LoadBalancer
reachable from outside the cluster for environments that support BGP. Because BGP
Control Plane does not program the datapath, do not use it to establish
reachability within the cluster.
Usage
Currently a single flag in the Cilium Agent
exists to turn on the
BGP Control Plane
feature set.
--enable-bgp-control-plane=true
If using Helm charts instead, the relevant values are the following:
bgpControlPlane:
enabled: true
Note
The BGP Control Plane feature is mutually exclusive with the MetalLB-based MetalLB BGP ControlPlane (deprecated) feature. To use the Control Plane, the older BGP feature has to be disabled. In other words, this feature does _not_ switch the BGP implementation from MetalLB to GoBGP.
When set to true
the BGP Control Plane
Controllers
will be
instantiated and will begin listening for CiliumBGPPeeringPolicy
events.
Currently, the BGP Control Plane
will only work when IPAM mode is set to
“cluster-pool” or “kubernetes”.
CiliumBGPPeeringPolicy CRD
All BGP
peering topology information is carried in a
CiliumBGPPeeringPolicy
CRD.
CiliumBGPPeeringPolicy
can be applied to one or more nodes based on
its nodeSelector
fields.
A Cilium node may only have a single CiliumBGPPeeringPolicy
apply to
it and if more than one does, it will apply no policy at all.
Each CiliumBGPPeeringPolicy
defines one or more
CiliumBGPVirtualRouter
configurations.
When these CRDs are written or read from the cluster the Controllers
will take notice and perform the necessary actions to drive the
BGP Control Plane
to the desired state described by the policy.
The policy in yaml
form is defined below:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
name: 01-bgp-peering-policy
spec: # CiliumBGPPeeringPolicySpec
nodeSelector:
matchLabels:
bgp-policy: a
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
exportPodCIDR: true
neighbors: # []CiliumBGPNeighbor
- peerAddress: 'fc00:f853:ccd:e793::50/128'
peerASN: 64512
eBGPMultihopTTL: 10
connectRetryTimeSeconds: 120
holdTimeSeconds: 90
keepAliveTimeSeconds: 30
gracefulRestart:
enabled: true
restartTimeSeconds: 120
Fields
nodeSelector: Nodes which are selected by this label selector will apply the given policy
virtualRouters: One or more peering configurations outlined below. Each peering configuration can be thought of as a BGP router instance.
virtualRouters[*].localASN: The local ASN for this peering configuration
virtualRouters[*].serviceSelector: Services which are selected by this label selector will be announced.
virtualRouters[*].exportPodCIDR: Whether to export the private pod CIDR block to the listed neighbors
virtualRouters[*].neighbors: A list of neighbors to peer with
neighbors[*].peerAddress: The address of the peer neighbor
neighbors[*].peerPort: Optional TCP port number of the neighbor. 1-65535 are valid values and defaults to 179 when unspecified.
neighbors[*].peerASN: The ASN of the peer
neighbors[*].eBGPMultihopTTL: Time To Live (TTL) value used in BGP packets. The value 1 implies that eBGP multi-hop feature is disabled.
neighbors[*].connectRetryTimeSeconds: Initial value for the BGP ConnectRetryTimer (RFC 4271, Section 8). Defaults to 120 seconds.
neighbors[*].holdTimeSeconds: Initial value for the BGP HoldTimer (RFC 4271, Section 4.2). Defaults to 90 seconds.
neighbors[*].keepAliveTimeSeconds: Initial value for the BGP KeepaliveTimer (RFC 4271, Section 8). Defaults to 30 seconds.
neighbors[*].gracefulRestart.enabled: The flag to enable graceful restart capability.
neighbors[*].gracefulRestart.restartTimeSeconds: The restart time advertised to the peer (RFC 4724 section 4.2).
Note
Setting unique configuration details of a particular instantiated virtual router on a particular Cilium node is explained in Virtual Router Attributes
Creating a BGP Topology
Rules
Follow the rules below to have a CiliumBGPPeeringPolicy
correctly
apply to a node.
Only a single
CiliumBGPPeeringPolicy
can apply to aCilium
node.If the
BGP Control Plane
on a node iterates through theCiliumBGPPeeringPolicy
CRs currently written to the cluster and discovers (n > 1) policies match its labels, it will return an error and remove any existing BGP sessions. Only (n == 1) policies must match a node’s label sets.Administrators should test a new BGP topology in a staging environment before making permanent changes in production.
Within a
CiliumBGPPeeringPolicy
eachCiliumBGPVirtualRouter
defined must have a uniquelocalASN
field.A node cannot host two or more logical routers with the same local ASN. Local ASNs are used as unique keys for a logical router.
A node can define the remote ASN on a per-neighbor basis to mitigate this scenario. See
CiliumBGPNeighbor
CR sub-structure.
IPv6 single stack deployments must set an IPv4 encoded
routerID
field in each definedCiliumBGPVirtualRouter
object within aCiliumBGPPeeringPolicy
Cilium running on a IPv6 single stack cluster cannot reliably generate a unique 32 bit BGP router ID, as it defines no unique IPv4 addresses for the node. The administrator must define these IDs manually or an error applying the policy will occur.
This is explained further in Virtual Router Attributes
Defining Topology
Within a CiliumBGPPeeringPolicy
multiple
CiliumBGPVirtualRouter
(s) can be defined.
Each one can be thought of as a logical BGP router instance.
Defining more than one CiliumBGPVirtualRouter
in a
CiliumBGPVirtualRouter
creates more than one logical BGP router on
the hosts which the policy matches.
It is possible to create a single CiliumBGPPeeringPolicy
for all
nodes by giving each node in a cluster the same label and defining a
single CiliumBGPPeeringPolicy
which applies to this label.
It is also possible to provide each Kubernetes
node its own
CiliumBGPPeeringPolicy
by giving each node a unique label and
creating a CiliumBGPPeeringPolicy
for each unique label.
This allows for selecting subsets of nodes which peer to a particular BGP router while another subset of nodes peer to a separate BGP router, akin to an “AS-per-rack” topology.
Virtual Router Attributes
A CiliumBGPPeeringPolicy
can apply to multiple nodes.
When a CiliumBGPPeeringPolicy
applies to one or more nodes each node
will instantiate one or more BGP routers as defined by the list of
CiliumBGPVirutalRouter
.
However, there are times where fine-grained control over an instantiated virtual router’s configuration needs to take place.
To accomplish this a Kubernetes annotation is defined which applies to Kubernetes Node resources.
A single annotation is used to specify a set of configuration attributes to apply to a particular virtual router instantiated on a particular host.
The syntax of the annotation is as follows:
cilium.io/bgp-virtual-router.{asn}="key=value,..."
The {asn}
portion should be replaced by the virtual router’s local
ASN you wish to apply these configuration attributes to.
The following sections outline the currently supported attributes.
Note
Each following section describes the syntax of applying a single attribute, however the annotation’s value supports a comma separated lists of attributes and applying multiple attributes in a single annotation is supported.
Note
When duplicate key=value
attributes are defined the last
one will be selected.
Router ID Attribute
When Cilium is running on an IPv4
or a dual-stack IPv4/6
cluster
the BGP Control Plane
will utilize the IPv4
addressed used by
Cilium for external reach ability.
This will typically be Kubernetes’ reported external IP address but can also be configured with a Cilium agent flag.
When running in IPv6
single stack or when the administrator needs to
manually define the instantiated BGP server’s router ID a Kubernetes
annotation can be placed on the node.
The annotation takes the following syntax:
cilium.io/bgp-virtual-router.{asn}="router-id=127.0.0.1"
The above annotation syntax should replace {asn}
with the local ASN
of the CiliumBGPVirtualRouter
you are setting the provided router ID
for.
When the BGPControlPlane
evaluates a CiliumBGPPeeringPolicy
with
a CiliumBGPVirtualRouter
it also searches for an annotation which
targets the aforementioned CiliumBGPVirtualRouter
local ASN.
If found it will use the provided router ID and not attempt to use the IPv4 address assigned to the node.
Local Listening Port
By default the GoBGP BGPRouterManager
will instantiate each virtual
router without a listening port.
It is possible to deploy a virtual router which creates a local listening port where BGP connections may take place.
If this is desired the following annotation can be provided
cilium.io/bgp-virtual-router.{asn}="local-port=45450"
Neighbors
Each CiliumBGPVirtualRouter
can contain multiple CiliumBGPNeighbor
sections,
each specifying configuration for a neighboring BGP peer of the Virtual Router.
Each neighbor is uniquely identified by the address and the ASN of the peer, and can
contain additional configuration specific for the given BGP peering, such as BGP timer
values, graceful restart configuration and others.
Warning
Change of an existing neighbor configuration can cause reset of the existing BGP peering connection, which results in route flaps and transient packet loss while the session reestablishes and peers exchange their routes. To prevent packet loss, it is recommended to configure BGP graceful restart.
Graceful Restart
The Cilium BGP control plane can be configured to act as a graceful restart
Restarting Speaker
. When you enable graceful restart, the BGP session will restart
and the “graceful restart” capability will be advertised in the BGP OPEN message.
In the event of a Cilium Agent restart, the peering BGP router does not withdraw routes received from the Cilium BGP control plane immediately. The datapath continues to forward traffic during Agent restart, so there is no traffic disruption.
Configure graceful restart on per-neighbor basis, as follows:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
neighbors: # []CiliumBGPNeighbor
- peerAddress: 'fc00:f853:ccd:e793::50/128'
# [...]
gracefulRestart:
enabled: true
restartTimeSeconds: 120
Note
When enabled, graceful restart capability is advertised for IPv4 and IPv6 address families.
Optionally, you can use the RestartTime
parameter. RestartTime
is the time
advertised to the peer within which Cilium BGP control plane is expected to re-establish
the BGP session after a restart. On expiration of RestartTime
, the peer removes
the routes previously advertised by the Cilium BGP control plane.
When the Cilium Agent restarts, it closes the BGP TCP socket, causing the emission of a
TCP FIN packet. On receiving this TCP FIN, the peer changes its BGP state to Idle
and
starts its RestartTime
timer.
The Cilium agent boot up time varies depending on the deployment. If using RestartTime
,
you should set it to a duration greater than the time taken by the Cilium Agent to boot up.
Default value of RestartTime
is 120 seconds. More details on graceful restart and
RestartTime
can be found in RFC-4724 and RFC-8538.
Service announcements
By default, virtual routers will not announce services. Virtual routers will announce
the ingress IPs of any LoadBalancer services that matches the .serviceSelector
of the virtual router and has loadBalancerClass
unspecified or set to io.cilium/bgp-control-plane
.
If you wish to announce ALL services within the cluster, a NotIn
match expression
with a dummy key and value can be used like:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
#[...]
virtualRouters: # []CiliumBGPVirtualRouter
- localASN: 64512
# [...]
serviceSelector:
matchExpressions:
- {key: somekey, operator: NotIn, values: ['never-used-value']}
There are a few special purpose selector fields which don’t match on labels but
instead on other metadata like .meta.name
or .meta.namespace
.
Selector |
Field |
io.kubernetes.service.namespace |
|
io.kubernetes.service.name |
|
Semantics of the externalTrafficPolicy: Local
When the service has externalTrafficPolicy: Local
, BGP Control Plane
keeps track
of the endpoints for the service on the local node and stops advertisement when there’s
no local endpoint.
CLI
There are two CLIs available to view cilium BGP peering state. One CLI is present inside Cilium Agent. The second CLI is the cluster-wide Cilium CLI.
Warning
The Cilium CLI is experimental. Consider carefully before using it in production environments!
Cilium Agent CLI
The following command shows peering status:
cilium# cilium bgp peers -h
List state of all peers defined in CiliumBGPPeeringPolicy
Usage:
cilium bgp peers [flags]
Flags:
-h, --help help for peers
-o, --output string json| yaml| jsonpath='{}'
Global Flags:
--config string Config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
Cilium-CLI
Cilium CLI displays the BGP peering status of all nodes.
# cilium-cli bgp peers -h
Gets BGP peering status from all nodes in the cluster
Usage:
cilium bgp peers [flags]
Flags:
--agent-pod-selector string Label on cilium-agent pods to select with (default "k8s-app=cilium")
-h, --help help for peers
--node string Node from which BGP status will be fetched, omit to select all nodes
-o, --output string Output format. One of: json, summary (default "summary")
--wait-duration duration Maximum time to wait for result, default 1 minute (default 1m0s)
Global Flags:
--context string Kubernetes configuration context
-n, --namespace string Namespace Cilium is running in (default "kube-system")
Architecture
The BGP Control Plane
is split into a Agent-Side Control Plane
and a Operator-Side
control plane (not yet implemented).
Both control planes are implemented by a Controller
which follows
the Kubernetes
controller pattern.
Both control planes primarily listen for CiliumBGPPeeringPolicy
CRDs,
along with other Cilium and Kubernetes resources used for implementing
a BGP control plane.
Agent-Side Architecture
At a high level, the Agent-Side Control Plane
is divided into the following
sub-modules:
Agent
Manager
Router
Agent
The Agent
implements a controller located in pkg/bgpv1/agent/controller.go
.
The controller listens for CiliumBGPPeeringPolicy
changes and
determines if the policy applies to its current host.
It will then capture some information about Cilium’s current state
and pass down the desired state to Manager
.
Manager
The Manager
implements the interface BGPRouterManager
, which
defines a declarative API between the Controller
and instances of
BGP routers.
The interface defines a single declarative method whose argument is the
desired CiliumBGPPeeringPolicy
(among a few others).
The Manager
is in charge of pushing the BGP Control Plane
to the desired CiliumBGPPeeringPolicy
or returning an error if it
is not possible.
Implementation Details
Manager
implementation will take desired CiliumBGPPeeringPolicy
and translate into imperative router API calls :
evaluate the desired
CiliumBGPPeeringPolicy
create/remove the desired BGP routers
advertise/withdraw the desired BGP routes
enable/disable any BGP server specific features
inform the caller if the policy cannot be applied
The Manager
evaluates each CiliumBGPVirtualRouter
in isolation.
While applying a CiliumBGPPeeringPolicy
, it will attempt to create each
CiliumBGPVirtualRouter
.
If a particular CiliumBGPVirtualRouter
fails to instantiate, the error
message is logged, and the Manager
will continue to the next
CiliumBGPVirtualRouter
.
It is worth expanding on how the Manager
works internally.
Manager
views each CiliumBGPVirtualRouter
as a BGP router instance.
Each CiliumBGPVirtualRouter
is defined by a local ASN, a router ID and a
list of CiliumBGPNeighbors
with whom it will establish peering.
This is enough for the Manager
to create a Router
instance.
Manager
groups Router
instances by their local ASNs.
Note
A CiliumBGPPeeringPolicy
applying to a node must not have two or more
CiliumBGPVirtualRouters
with the same localASN
fields.
The Manager
employs a set of Reconcilers
which perform an
order-dependent reconciliation action for each Router
.
See the source code at pkg/bgpv1/manager/reconcile.go
for a more in
depth explanation on how each Reconcilers
works.
Router
BGP Control Plane
utilizes GoBGP
as the underlying routing agent.
GoBGP client-side implementation is located in pkg/bgpv1/gobgp
.
Implementation API adheres to the Router
interface defined in pkg/bgpv1/types/bgp.go
.